Data Lifecycle Management
- Executive Summary
- Working Knowledge
- Technical Spec
Every piece of data in ReGenesis follows a defined lifecycle from the moment it enters the system to the moment it is permanently deleted. This is not optional — it is a regulatory requirement under GDPR, CCPA/CPRA, and a SOC 2 audit expectation.
The Eight Steps
| Step | What Happens | Security Control |
|---|---|---|
| 1. Ingest | Data enters the system (recording, file, message) | OAuth validation, malware scan, tenant tagging |
| 2. Normalize | Raw data is parsed, structured, classified | PII detection, data classification, speaker diarization |
| 3. Store | Processed data is persisted to appropriate stores | AES-256 encryption, tenant isolation (RLS), visibility tagging |
| 4. Derive | Sasha AI generates insights from stored data | Pseudonymization, no-training flags, model/version logging |
| 5. Evidence Pack | AI outputs are assembled into L0/L1/L2 packs | Cross-validation against source, provenance metadata |
| 6. Approve | Coach reviews and approves/edits AI outputs | Human-in-the-loop, edit tracking, visibility promotion |
| 7. Deliver | Approved content reaches the coachee | Notification without content, secure links, export controls |
| 8. Archive/Delete | Data reaches retention limit or user requests deletion | Configurable retention, deletion certificates, backup scrubbing |
Why This Matters
- Compliance: Demonstrates data lifecycle governance for SOC 2 audits and GDPR Article 5(1)(e) (storage limitation)
- Trust: Clients need to know that their deeply personal coaching data has a defined end-of-life
- Liability reduction: Data that does not exist cannot be breached. Aggressive but defensible retention policies reduce risk surface.
- AI governance: Every AI-derived output has a full provenance chain back to source data
Following a Session Through the Lifecycle
Let's trace what happens when a coaching session is recorded on Zoom and flows through ReGenesis:
Step 1: Ingest
- Zoom sends a webhook notification: "recording available"
- The Ingest Service validates the webhook signature (HMAC-SHA256)
- Downloads the recording via Zoom's API using the coach's OAuth token
- Runs a malware scan on the file
- Tags it with the tenant ID and session ID
- Uploads the encrypted recording to S3
- Logs the ingest event to the audit log
Step 2: Normalize
- The NLP Pipeline picks up the new recording from a queue
- Transcribes the audio using Whisper or Deepgram
- Performs speaker diarization (labels who said what)
- Adds word-level timestamps
- Runs PII detection (phone numbers, addresses, SSNs)
- Classifies the data: this transcript contains "special category" content (personal development, emotional content)
- Stores the structured transcript in PostgreSQL
Step 3: Store
- The transcript is encrypted at rest (AES-256, database-level)
- Row-Level Security ensures only this tenant can access it
- The visibility tag is set to
system_internal(raw transcripts are not shown to anyone directly) - Embeddings are generated and stored in the Vector DB under this tenant's namespace
Step 4: Derive (Sasha AI)
- Sasha prepares to generate insights
- Pseudonymization: Before sending to the LLM, personal names are replaced with tokens ("Participant A", "Coach B")
- The request to Anthropic Claude includes the
no-trainingflag - Sasha generates themes, patterns, action items, and growth areas
- Every LLM call is logged: model version, token counts, input hash, output hash
Step 5: Evidence Pack
- The Evidence Pack Builder takes Sasha's raw output and structures it:
- L0: The insight itself ("Coachee shows a pattern of avoiding delegation")
- L1: The reasoning chain ("Mentioned in 3 sessions, correlates with control orientation...")
- L2: Source references with transcript timestamps and video jump links
- Cross-validates the L0 claim against the L2 sources (hallucination check)
- Stores the pack with provenance metadata
Step 6: Approve
- The coach receives a notification: "New insights ready for review" (no content in the notification)
- The coach opens the Evidence Pack in their dashboard
- They can: approve as-is, edit the wording, reject with feedback, or request re-generation
- All edits are tracked (who changed what, when)
- Approval changes the insight visibility from
coach_onlytoclient_visible
Step 7: Deliver
- The coachee receives a notification: "New insights from your recent session" (no content in notification)
- They log in and see the approved Evidence Pack
- L0 (insight) and approved L1 (reasoning) are visible
- L2 (source references) may be visible or coach-only, depending on configuration
- The coachee can provide feedback (helpful/not helpful, corrections)
Step 8: Archive/Delete
- After the retention period expires (default 2 years, configurable per tenant):
- Transcripts are deleted first
- AI-derived insights are deleted
- Evidence packs are deleted
- Artifacts (recordings) are deleted from S3
- Vector DB embeddings are purged
- A deletion certificate is generated and stored permanently
- If a user exercises their right to erasure (GDPR Art. 17), the same cascade runs immediately
What Stays After Deletion
Only these records survive deletion:
- Audit logs (immutable, anonymized references)
- Deletion certificates (proof that deletion occurred)
- Anonymized aggregate metrics (already de-identified, cannot be linked back)
Deleting data from the primary database is not enough. Backups must also be scrubbed within 30 days of a deletion event. This is a GDPR requirement. The backup scrub scheduler tracks this and generates a supplemental certificate when complete.
Retention periods are configurable per tenant via the admin panel. Minimum retention is 90 days (to allow for dispute resolution). Maximum is 7 years (for SOC 2 audit log requirements). Default is 2 years.