Data Lifecycle Management

Executive Summary
Working Knowledge
Technical Spec

Every piece of data in ReGenesis follows a defined lifecycle from the moment it enters the system to the moment it is permanently deleted. This is not optional — it is a regulatory requirement under GDPR, CCPA/CPRA, and a SOC 2 audit expectation.

The Eight Steps

Step	What Happens	Security Control
1. Ingest	Data enters the system (recording, file, message)	OAuth validation, malware scan, tenant tagging
2. Normalize	Raw data is parsed, structured, classified	PII detection, data classification, speaker diarization
3. Store	Processed data is persisted to appropriate stores	AES-256 encryption, tenant isolation (RLS), visibility tagging
4. Derive	Sasha AI generates insights from stored data	Pseudonymization, no-training flags, model/version logging
5. Evidence Pack	AI outputs are assembled into L0/L1/L2 packs	Cross-validation against source, provenance metadata
6. Approve	Coach reviews and approves/edits AI outputs	Human-in-the-loop, edit tracking, visibility promotion
7. Deliver	Approved content reaches the coachee	Notification without content, secure links, export controls
8. Archive/Delete	Data reaches retention limit or user requests deletion	Configurable retention, deletion certificates, backup scrubbing

Why This Matters

Compliance: Demonstrates data lifecycle governance for SOC 2 audits and GDPR Article 5(1)(e) (storage limitation)
Trust: Clients need to know that their deeply personal coaching data has a defined end-of-life
Liability reduction: Data that does not exist cannot be breached. Aggressive but defensible retention policies reduce risk surface.
AI governance: Every AI-derived output has a full provenance chain back to source data

Following a Session Through the Lifecycle

Let's trace what happens when a coaching session is recorded on Zoom and flows through ReGenesis:

Step 1: Ingest

Zoom sends a webhook notification: "recording available"
The Ingest Service validates the webhook signature (HMAC-SHA256)
Downloads the recording via Zoom's API using the coach's OAuth token
Runs a malware scan on the file
Tags it with the tenant ID and session ID
Uploads the encrypted recording to S3
Logs the ingest event to the audit log

Step 2: Normalize

The NLP Pipeline picks up the new recording from a queue
Transcribes the audio using Whisper or Deepgram
Performs speaker diarization (labels who said what)
Adds word-level timestamps
Runs PII detection (phone numbers, addresses, SSNs)
Classifies the data: this transcript contains "special category" content (personal development, emotional content)
Stores the structured transcript in PostgreSQL

Step 3: Store

The transcript is encrypted at rest (AES-256, database-level)
Row-Level Security ensures only this tenant can access it
The visibility tag is set to system_internal (raw transcripts are not shown to anyone directly)
Embeddings are generated and stored in the Vector DB under this tenant's namespace

Step 4: Derive (Sasha AI)

Sasha prepares to generate insights
Pseudonymization: Before sending to the LLM, personal names are replaced with tokens ("Participant A", "Coach B")
The request to Anthropic Claude includes the no-training flag
Sasha generates themes, patterns, action items, and growth areas
Every LLM call is logged: model version, token counts, input hash, output hash

Step 5: Evidence Pack

The Evidence Pack Builder takes Sasha's raw output and structures it:
- L0: The insight itself ("Coachee shows a pattern of avoiding delegation")
- L1: The reasoning chain ("Mentioned in 3 sessions, correlates with control orientation...")
- L2: Source references with transcript timestamps and video jump links
Cross-validates the L0 claim against the L2 sources (hallucination check)
Stores the pack with provenance metadata

Step 6: Approve

The coach receives a notification: "New insights ready for review" (no content in the notification)
The coach opens the Evidence Pack in their dashboard
They can: approve as-is, edit the wording, reject with feedback, or request re-generation
All edits are tracked (who changed what, when)
Approval changes the insight visibility from coach_only to client_visible

Step 7: Deliver

The coachee receives a notification: "New insights from your recent session" (no content in notification)
They log in and see the approved Evidence Pack
L0 (insight) and approved L1 (reasoning) are visible
L2 (source references) may be visible or coach-only, depending on configuration
The coachee can provide feedback (helpful/not helpful, corrections)

Step 8: Archive/Delete

After the retention period expires (default 2 years, configurable per tenant):
- Transcripts are deleted first
- AI-derived insights are deleted
- Evidence packs are deleted
- Artifacts (recordings) are deleted from S3
- Vector DB embeddings are purged
- A deletion certificate is generated and stored permanently
If a user exercises their right to erasure (GDPR Art. 17), the same cascade runs immediately

What Stays After Deletion

Only these records survive deletion:

Audit logs (immutable, anonymized references)
Deletion certificates (proof that deletion occurred)
Anonymized aggregate metrics (already de-identified, cannot be linked back)

Backup Scrubbing

Deleting data from the primary database is not enough. Backups must also be scrubbed within 30 days of a deletion event. This is a GDPR requirement. The backup scrub scheduler tracks this and generates a supplemental certificate when complete.

Retention Configuration

Retention periods are configurable per tenant via the admin panel. Minimum retention is 90 days (to allow for dispute resolution). Maximum is 7 years (for SOC 2 audit log requirements). Default is 2 years.

The Eight Steps​

Why This Matters​

Following a Session Through the Lifecycle​

Step 1: Ingest​

Step 2: Normalize​

Step 3: Store​

Step 4: Derive (Sasha AI)​

Step 5: Evidence Pack​

Step 6: Approve​

Step 7: Deliver​

Step 8: Archive/Delete​

What Stays After Deletion​