Disaster Recovery & Business Continuity
- Executive Summary
- Working Knowledge
- Technical Spec
Keeping the Platform Running When Things Go Wrong
Disaster Recovery (DR) and Business Continuity Planning (BCP) ensure that ReGenesis remains available and that data is recoverable even in the face of infrastructure failures, regional outages, cyberattacks, or provider disruptions. For an AI coaching platform where scheduled sessions and ongoing coaching relationships depend on availability, unplanned downtime directly damages the user experience and client trust.
Enterprise clients require documented DR/BCP with specific Recovery Point Objectives (RPO) -- how much data can be lost -- and Recovery Time Objectives (RTO) -- how long until service is restored. ReGenesis targets progressively tighter RPO/RTO as the platform matures: from RPO 24h / RTO 8h at MVP0 to RPO 5min / RTO 30min at Global. Daily database snapshots, Multi-AZ redundancy, and a planned cross-region replication strategy provide the foundation for data durability.
A unique challenge for ReGenesis is provider dependency on Anthropic's Claude API for the Sasha coaching engine. If Claude experiences an outage, Sasha cannot function. The DR plan includes graceful degradation strategies for this scenario -- switching to cached responses, allowing human coaching to continue without AI, and providing transparent status communication. This is an area that needs further development, as noted in the Blueprint Supplement gap analysis.
Business Continuity Overview
DR/BCP needs more development before the McKinsey pilot. The current document describes the target architecture, but detailed runbooks, failover procedures, and DR testing have not yet been implemented. This section documents both the target state and what needs to be built.
Recovery Objectives by Stage
RPO and RTO targets get tighter as the platform matures and the business can invest in more resilient infrastructure. The approach is pragmatic: daily backups and basic health checks at MVP0, with a clear plan for how DR investment scales with each stage.
| Metric | MVP0 Demo | Pilot (McKinsey) | GA Launch | Global |
|---|---|---|---|---|
| RPO | 24 hours | 1 hour | 15 minutes | 5 minutes |
| RTO | 8 hours | 4 hours | 1 hour | 30 minutes |
| MTTR | Best effort | 2 hours | 45 minutes | 20 minutes |
| Availability | Best effort | 99.9% (8.7h/yr) | 99.95% (4.4h/yr) | 99.99% (52min/yr) |
| Backup frequency | Daily snapshot | Continuous PITR + daily cross-region | Continuous PITR + real-time cross-region | Multi-region active-active |
| DR testing | None | Quarterly backup restore | Quarterly failover drill | Monthly chaos engineering |
| Vector DB | Manual export | Daily pgvector snapshots | Continuous replication | Multi-region vector sync |
How Each Stage Achieves Its Targets
- MVP0 (RPO 24h / RTO 8h): Daily automated RDS snapshots. Manual recovery from snapshot. Acceptable because demo data is not production-critical.
- Pilot (RPO 1h / RTO 4h): RDS point-in-time recovery (5-min granularity), Multi-AZ deployment, daily cross-region snapshot copies. Automated AZ failover. Vector DB (pgvector) daily snapshots.
- GA (RPO 15min / RTO 1h): Cross-region read replica with async replication (under 15 min lag). Automated Route 53 health check failover. Warm standby in DR region. Real-time S3 cross-region replication. Vector DB continuous replication.
- Global (RPO 5min / RTO 30min): Multi-region active-active with global database (Aurora Global Database or equivalent). Automated failover with under 1 min detection. Edge-level DNS failover. Full infrastructure pre-provisioned in all regions.
What Could Go Wrong
Data Backup Strategy
| Data Type | Backup Method | Frequency | Retention | Storage Location |
|---|---|---|---|---|
| PostgreSQL Database | RDS Automated Snapshots | Daily | 35 days | Same region (Multi-AZ) |
| PostgreSQL Database | Cross-region snapshot copy | Daily | 14 days | us-west-2 (backup region) |
| PostgreSQL Database | Point-in-time recovery | Continuous (5-min granularity) | 35 days | Same region |
| S3 Data | Cross-region replication | Real-time | Same as source | us-west-2 |
| Audit Logs | S3 WORM (immutable) | Real-time | 1-7 years | Same region + cross-region |
| Application Configuration | Git repository | On every change | Indefinite | GitHub (encrypted) |
| Infrastructure State | Terraform state | On every apply | Versioned (30 days) | S3 + DynamoDB lock |
| Secrets | AWS Secrets Manager | On every change | Versioned | Multi-AZ (automatic) |
| Vector DB (pgvector) | pg_dump of vector tables + RDS snapshot | Daily (Pilot), Continuous (GA) | 14 days | Same region + cross-region |
Multi-AZ Architecture (Current)
Multi-Region Plan (Future)
Full multi-region deployment is targeted for GA or post-GA. For the pilot, Multi-AZ within us-east-1 provides sufficient redundancy. Cross-region backup copies provide data durability.
Provider Failover: Claude API Down
The Sasha AI engine depends on Anthropic's Claude API. If Claude is unavailable:
| Phase | Duration | Behavior |
|---|---|---|
| Immediate (0-5 min) | Detection | Health checks detect Claude API failure |
| Graceful Degradation (5-15 min) | Notification | Users see "AI features temporarily unavailable" banner |
| Continued Operation | Duration of outage | Human coaching sessions continue normally; Sasha interactions queued or disabled |
| Cached Responses | During outage | Previously generated insights remain visible; no new AI generation |
| Recovery | When Claude returns | Queued requests processed; normal operation resumes |
Alternative AI provider fallback (future consideration):
- OpenAI GPT-4 as secondary AI provider
- Requires separate integration and prompt adaptation
- Not a simple swap -- Sasha's personality and coaching methodology are tuned for Claude
- Decision: Evaluate cost/benefit at GA based on Claude reliability history
Status Page
A public status page provides transparency to clients during incidents:
| Component | Status Indicators |
|---|---|
| Web Application | Operational / Degraded / Major Outage / Maintenance |
| API | Operational / Degraded / Major Outage / Maintenance |
| AI Coaching (Sasha) | Operational / Degraded / Unavailable |
| Authentication (SSO) | Operational / Degraded / Major Outage |
| Data Export | Operational / Degraded / Major Outage |
| Mobile App | Operational / Degraded / Major Outage |
Platform: Statuspage.io or Instatus (to be selected)
DR Testing Schedule
| Test Type | Frequency | Description | Status |
|---|---|---|---|
| Backup restore | Monthly | Restore database from snapshot to verify integrity | Not started |
| Failover test | Quarterly | Trigger AZ failover to verify automatic recovery | Not started |
| DR drill | Semi-annually | Full DR exercise: simulate region failure, execute runbook | Not started |
| Tabletop exercise | Annually | Walk through cyberattack scenario with full team | Not started |
| Chaos engineering | Post-GA | Intentional failure injection in production (Game Days) | Future |
RTO/RPO Validation Criteria
Each DR test must validate that actual recovery times meet targets. The following acceptance criteria apply.
| Test | Metric Validated | Pass Criteria | Evidence Collected |
|---|---|---|---|
| Backup restore test | RPO | Data loss window matches expected RPO | Timestamp comparison |
| AZ failover test | RTO (automatic) | Service restored within 5 minutes | CloudWatch metrics |
| Region failover drill | RTO (manual) | Restored within stage-appropriate RTO | Timestamped runbook log |
| Point-in-time recovery | RPO granularity | Recoverable to 5-minute granularity | Restored DB row counts |
| Backup integrity | Data integrity | All tables present with referential integrity | Automated verification script |
Quarterly DR Report Requirements:
- Date and time of DR test
- Scenario tested
- Actual vs. target RTO
- Actual vs. target RPO
- Deviations or failures observed
- Remediation actions taken or planned
- Security Officer sign-off
Before McKinsey Pilot (required):
- Documented incident response runbooks for top 5 failure scenarios
- Database backup verification (monthly restore test)
- Status page setup and integration with monitoring
- Communication plan templates (client notification, internal escalation)
- Basic AZ failover testing
Before GA (required):
- Cross-region backup replication operational
- DR drill executed and results documented
- Formal BCP document reviewed and approved
- Cyber insurance policy in place
- Claude API fallback strategy implemented
- Full DR testing program with quarterly cadence
Post-GA (planned):
- Multi-region warm standby deployment
- Automated region failover via Route 53
- Chaos engineering program
- Annual BCP/DR plan review and update
Communication Plans
| Scenario | Internal Notification | Client Notification | Public Communication |
|---|---|---|---|
| Planned maintenance | 1 week ahead via Slack | 72 hours ahead via email | Status page scheduled event |
| Minor incident (under 30 min) | Slack #incidents | Post-incident via email | Status page update |
| Major incident (30 min+) | PagerDuty + Slack | Real-time via email + in-app | Status page + social media |
| Data breach | Immediate to Security Officer | Within 72 hours (per DPA) | As required by law |
| Extended outage (4h+) | Executive escalation | Hourly updates via email | Status page + dedicated update page |
Key Person Unavailability
Early-stage startups face a unique DR risk: what happens if a critical team member is suddenly unavailable? For ReGenesis, this is especially relevant during MVP0 and Pilot when the team is small.
| Key Person | Risk | Mitigation |
|---|---|---|
| Jesse (Founder/CEO) | Strategic decisions blocked, client relationships disrupted | Documented decision framework, deputy authority for CTO on technical decisions, key client contacts shared with team |
| CTO / Lead Engineer | Deployments blocked, incident response impaired | Runbooks for all critical procedures, cross-trained backup engineer, IaC means infrastructure is reproducible |
| Security Officer | Compliance decisions delayed, incident response leadership gap | Deputy Security Officer designated, IR plan documented so any trained responder can lead |
| Sole DevOps Engineer | Infrastructure changes impossible, DR execution at risk | All infrastructure as code (Terraform), runbooks for common operations, cloud provider support contract |
Mitigation principles:
- Document everything -- procedures, credentials (in secrets manager), vendor contacts
- Cross-train -- no single person should be the only one who knows how to do a critical task
- Escalation paths -- define who takes over each responsibility if the primary is unavailable
- Vendor support contracts -- AWS Business/Enterprise Support can assist with infrastructure issues even if the DevOps engineer is unavailable
Vendor Failure Contingencies
| Vendor | Failure Scenario | Contingency |
|---|---|---|
| Anthropic (Claude API) | Extended outage (> 4 hours) | Graceful degradation mode (human coaching continues, AI features disabled). Evaluate OpenAI GPT-4 as emergency backup at GA stage. |
| AWS (Primary Cloud) | Regional outage | Cross-region DR to us-west-2. Multi-AZ handles single-AZ failures automatically. |
| Auth Provider (Clerk/Auth0) | Authentication service down | Cached session tokens remain valid for configured duration. Emergency bypass for admin access via pre-shared recovery codes. |
| Stripe | Payment processing down | Coaching sessions continue uninterrupted. Billing retries automatically. No user-facing impact. |
| Datadog | Monitoring platform down | CloudWatch provides baseline monitoring. PagerDuty alerts still function via CloudWatch SNS integration. |
Cyber Insurance
| Coverage | Target Amount | Status |
|---|---|---|
| Cyber liability | $5,000,000 | Required before pilot |
| Data breach response | Included | Covers forensics, notification, credit monitoring |
| Business interruption | $1,000,000 | Covers revenue loss during outage |
| Regulatory defense | $1,000,000 | Covers legal costs for GDPR/regulatory actions |
| Social engineering | $500,000 | Covers losses from phishing/fraud |