Disaster Recovery & Business Continuity

Executive Summary
Working Knowledge
Technical Spec

Keeping the Platform Running When Things Go Wrong

Disaster Recovery (DR) and Business Continuity Planning (BCP) ensure that ReGenesis remains available and that data is recoverable even in the face of infrastructure failures, regional outages, cyberattacks, or provider disruptions. For an AI coaching platform where scheduled sessions and ongoing coaching relationships depend on availability, unplanned downtime directly damages the user experience and client trust.

Enterprise clients require documented DR/BCP with specific Recovery Point Objectives (RPO) -- how much data can be lost -- and Recovery Time Objectives (RTO) -- how long until service is restored. ReGenesis targets progressively tighter RPO/RTO as the platform matures: from RPO 24h / RTO 8h at MVP0 to RPO 5min / RTO 30min at Global. Daily database snapshots, Multi-AZ redundancy, and a planned cross-region replication strategy provide the foundation for data durability.

A unique challenge for ReGenesis is provider dependency on Anthropic's Claude API for the Sasha coaching engine. If Claude experiences an outage, Sasha cannot function. The DR plan includes graceful degradation strategies for this scenario -- switching to cached responses, allowing human coaching to continue without AI, and providing transparent status communication. This is an area that needs further development, as noted in the Blueprint Supplement gap analysis.

Business Continuity Overview

Blueprint Supplement Gap

DR/BCP needs more development before the McKinsey pilot. The current document describes the target architecture, but detailed runbooks, failover procedures, and DR testing have not yet been implemented. This section documents both the target state and what needs to be built.

Recovery Objectives by Stage

RPO and RTO targets get tighter as the platform matures and the business can invest in more resilient infrastructure. The approach is pragmatic: daily backups and basic health checks at MVP0, with a clear plan for how DR investment scales with each stage.

Metric	MVP0 Demo	Pilot (McKinsey)	GA Launch	Global
RPO	24 hours	1 hour	15 minutes	5 minutes
RTO	8 hours	4 hours	1 hour	30 minutes
MTTR	Best effort	2 hours	45 minutes	20 minutes
Availability	Best effort	99.9% (8.7h/yr)	99.95% (4.4h/yr)	99.99% (52min/yr)
Backup frequency	Daily snapshot	Continuous PITR + daily cross-region	Continuous PITR + real-time cross-region	Multi-region active-active
DR testing	None	Quarterly backup restore	Quarterly failover drill	Monthly chaos engineering
Vector DB	Manual export	Daily pgvector snapshots	Continuous replication	Multi-region vector sync

How Each Stage Achieves Its Targets

MVP0 (RPO 24h / RTO 8h): Daily automated RDS snapshots. Manual recovery from snapshot. Acceptable because demo data is not production-critical.
Pilot (RPO 1h / RTO 4h): RDS point-in-time recovery (5-min granularity), Multi-AZ deployment, daily cross-region snapshot copies. Automated AZ failover. Vector DB (pgvector) daily snapshots.
GA (RPO 15min / RTO 1h): Cross-region read replica with async replication (under 15 min lag). Automated Route 53 health check failover. Warm standby in DR region. Real-time S3 cross-region replication. Vector DB continuous replication.
Global (RPO 5min / RTO 30min): Multi-region active-active with global database (Aurora Global Database or equivalent). Automated failover with under 1 min detection. Edge-level DNS failover. Full infrastructure pre-provisioned in all regions.

What Could Go Wrong

Data Backup Strategy

Data Type	Backup Method	Frequency	Retention	Storage Location
PostgreSQL Database	RDS Automated Snapshots	Daily	35 days	Same region (Multi-AZ)
PostgreSQL Database	Cross-region snapshot copy	Daily	14 days	us-west-2 (backup region)
PostgreSQL Database	Point-in-time recovery	Continuous (5-min granularity)	35 days	Same region
S3 Data	Cross-region replication	Real-time	Same as source	us-west-2
Audit Logs	S3 WORM (immutable)	Real-time	1-7 years	Same region + cross-region
Application Configuration	Git repository	On every change	Indefinite	GitHub (encrypted)
Infrastructure State	Terraform state	On every apply	Versioned (30 days)	S3 + DynamoDB lock
Secrets	AWS Secrets Manager	On every change	Versioned	Multi-AZ (automatic)
Vector DB (pgvector)	pg_dump of vector tables + RDS snapshot	Daily (Pilot), Continuous (GA)	14 days	Same region + cross-region

Multi-AZ Architecture (Current)

Multi-Region Plan (Future)

GA Target

Full multi-region deployment is targeted for GA or post-GA. For the pilot, Multi-AZ within us-east-1 provides sufficient redundancy. Cross-region backup copies provide data durability.

Provider Failover: Claude API Down

The Sasha AI engine depends on Anthropic's Claude API. If Claude is unavailable:

Phase	Duration	Behavior
Immediate (0-5 min)	Detection	Health checks detect Claude API failure
Graceful Degradation (5-15 min)	Notification	Users see "AI features temporarily unavailable" banner
Continued Operation	Duration of outage	Human coaching sessions continue normally; Sasha interactions queued or disabled
Cached Responses	During outage	Previously generated insights remain visible; no new AI generation
Recovery	When Claude returns	Queued requests processed; normal operation resumes

Alternative AI provider fallback (future consideration):

OpenAI GPT-4 as secondary AI provider
Requires separate integration and prompt adaptation
Not a simple swap -- Sasha's personality and coaching methodology are tuned for Claude
Decision: Evaluate cost/benefit at GA based on Claude reliability history

Status Page

A public status page provides transparency to clients during incidents:

Component	Status Indicators
Web Application	Operational / Degraded / Major Outage / Maintenance
API	Operational / Degraded / Major Outage / Maintenance
AI Coaching (Sasha)	Operational / Degraded / Unavailable
Authentication (SSO)	Operational / Degraded / Major Outage
Data Export	Operational / Degraded / Major Outage
Mobile App	Operational / Degraded / Major Outage

Platform: Statuspage.io or Instatus (to be selected)

DR Testing Schedule

Test Type	Frequency	Description	Status
Backup restore	Monthly	Restore database from snapshot to verify integrity	Not started
Failover test	Quarterly	Trigger AZ failover to verify automatic recovery	Not started
DR drill	Semi-annually	Full DR exercise: simulate region failure, execute runbook	Not started
Tabletop exercise	Annually	Walk through cyberattack scenario with full team	Not started
Chaos engineering	Post-GA	Intentional failure injection in production (Game Days)	Future

RTO/RPO Validation Criteria

Each DR test must validate that actual recovery times meet targets. The following acceptance criteria apply.

Test	Metric Validated	Pass Criteria	Evidence Collected
Backup restore test	RPO	Data loss window matches expected RPO	Timestamp comparison
AZ failover test	RTO (automatic)	Service restored within 5 minutes	CloudWatch metrics
Region failover drill	RTO (manual)	Restored within stage-appropriate RTO	Timestamped runbook log
Point-in-time recovery	RPO granularity	Recoverable to 5-minute granularity	Restored DB row counts
Backup integrity	Data integrity	All tables present with referential integrity	Automated verification script

Quarterly DR Report Requirements:

Date and time of DR test
Scenario tested
Actual vs. target RTO
Actual vs. target RPO
Deviations or failures observed
Remediation actions taken or planned
Security Officer sign-off

What's Missing and Needs to Be Built

Before McKinsey Pilot (required):

Documented incident response runbooks for top 5 failure scenarios
Database backup verification (monthly restore test)
Status page setup and integration with monitoring
Communication plan templates (client notification, internal escalation)
Basic AZ failover testing

Before GA (required):

Cross-region backup replication operational
DR drill executed and results documented
Formal BCP document reviewed and approved
Cyber insurance policy in place
Claude API fallback strategy implemented
Full DR testing program with quarterly cadence

Post-GA (planned):

Multi-region warm standby deployment
Automated region failover via Route 53
Chaos engineering program
Annual BCP/DR plan review and update

Communication Plans

Scenario	Internal Notification	Client Notification	Public Communication
Planned maintenance	1 week ahead via Slack	72 hours ahead via email	Status page scheduled event
Minor incident (under 30 min)	Slack #incidents	Post-incident via email	Status page update
Major incident (30 min+)	PagerDuty + Slack	Real-time via email + in-app	Status page + social media
Data breach	Immediate to Security Officer	Within 72 hours (per DPA)	As required by law
Extended outage (4h+)	Executive escalation	Hourly updates via email	Status page + dedicated update page

Key Person Unavailability

Early-stage startups face a unique DR risk: what happens if a critical team member is suddenly unavailable? For ReGenesis, this is especially relevant during MVP0 and Pilot when the team is small.

Key Person	Risk	Mitigation
Jesse (Founder/CEO)	Strategic decisions blocked, client relationships disrupted	Documented decision framework, deputy authority for CTO on technical decisions, key client contacts shared with team
CTO / Lead Engineer	Deployments blocked, incident response impaired	Runbooks for all critical procedures, cross-trained backup engineer, IaC means infrastructure is reproducible
Security Officer	Compliance decisions delayed, incident response leadership gap	Deputy Security Officer designated, IR plan documented so any trained responder can lead
Sole DevOps Engineer	Infrastructure changes impossible, DR execution at risk	All infrastructure as code (Terraform), runbooks for common operations, cloud provider support contract

Mitigation principles:

Document everything -- procedures, credentials (in secrets manager), vendor contacts
Cross-train -- no single person should be the only one who knows how to do a critical task
Escalation paths -- define who takes over each responsibility if the primary is unavailable
Vendor support contracts -- AWS Business/Enterprise Support can assist with infrastructure issues even if the DevOps engineer is unavailable

Vendor Failure Contingencies

Vendor	Failure Scenario	Contingency
Anthropic (Claude API)	Extended outage (> 4 hours)	Graceful degradation mode (human coaching continues, AI features disabled). Evaluate OpenAI GPT-4 as emergency backup at GA stage.
AWS (Primary Cloud)	Regional outage	Cross-region DR to us-west-2. Multi-AZ handles single-AZ failures automatically.
Auth Provider (Clerk/Auth0)	Authentication service down	Cached session tokens remain valid for configured duration. Emergency bypass for admin access via pre-shared recovery codes.
Stripe	Payment processing down	Coaching sessions continue uninterrupted. Billing retries automatically. No user-facing impact.
Datadog	Monitoring platform down	CloudWatch provides baseline monitoring. PagerDuty alerts still function via CloudWatch SNS integration.

Cyber Insurance

Coverage	Target Amount	Status
Cyber liability	$5,000,000	Required before pilot
Data breach response	Included	Covers forensics, notification, credit monitoring
Business interruption	$1,000,000	Covers revenue loss during outage
Regulatory defense	$1,000,000	Covers legal costs for GDPR/regulatory actions
Social engineering	$500,000	Covers losses from phishing/fraud

Keeping the Platform Running When Things Go Wrong​

Business Continuity Overview​

Recovery Objectives by Stage​

How Each Stage Achieves Its Targets​

What Could Go Wrong​

Data Backup Strategy​

Multi-AZ Architecture (Current)​

Multi-Region Plan (Future)​

Provider Failover: Claude API Down​

Status Page​

DR Testing Schedule​

RTO/RPO Validation Criteria​

Communication Plans​

Key Person Unavailability​

Vendor Failure Contingencies​

Cyber Insurance​