Monitoring, Alerting & Security Operations
- Executive Summary
- Working Knowledge
- Technical Spec
Detecting and Responding to Threats in Real Time
Monitoring and alerting systems are the eyes and ears of platform security. While prevention controls (encryption, access control, firewalls) stop most threats, monitoring systems detect the threats that get through. For ReGenesis, this means continuously watching for unauthorized access attempts, unusual data access patterns, infrastructure anomalies, and AI behavior that deviates from expected norms.
Enterprise clients expect documented monitoring architecture, proven incident detection capabilities, and transparent communication during security events. SOC 2 CC7.1-CC7.4 specifically require continuous monitoring, anomaly detection, and documented incident response procedures. The monitoring stack feeds directly into the incident response process: when something looks wrong, the right people are alerted within minutes and know exactly what to do.
ReGenesis uses a layered monitoring approach: Application Performance Monitoring (APM) for platform health, Security Information and Event Management (SIEM) for threat detection, AWS native services for infrastructure monitoring, and custom dashboards for AI-specific metrics. This architecture is currently in early specification and needs further development before the McKinsey pilot -- a gap acknowledged in the Blueprint Supplement.
Monitoring Architecture
The monitoring architecture needs more specification before pilot launch. The current document describes the target architecture, but detailed implementation plans, tool selection, and staffing requirements need to be finalized by Q2 2026.
Monitoring Layers
What the Platform Monitors
| Category | Metric | Alert Threshold | Severity |
|---|---|---|---|
| Availability | Uptime percentage | < 99.9% over 5 min | Critical |
| Latency | API response time (p95) | > 500ms | Warning |
| Latency | API response time (p99) | > 2000ms | Critical |
| Error Rate | 5xx error rate | > 1% of requests | Critical |
| Error Rate | 4xx error rate | > 10% of requests | Warning |
| Authentication | Failed login rate | > 10 per user per hour | Critical |
| Authentication | MFA failure rate | > 50% for any user | Warning |
| Data Access | Unusual data access volume | > 3x baseline for user | Warning |
| Data Access | Cross-role data access attempt | Any occurrence | Critical |
| AI | Claude API latency | > 10 seconds | Warning |
| AI | Claude API error rate | > 5% | Critical |
| AI | Guardrail trigger rate | > 10% of sessions | Warning |
| AI | Token usage per session | > 50,000 tokens | Warning |
| Infrastructure | CPU utilization | > 80% sustained (15 min) | Warning |
| Infrastructure | Memory utilization | > 85% | Warning |
| Infrastructure | Disk utilization | > 80% | Warning |
| Infrastructure | Database connections | > 80% of max | Warning |
| Security | WAF block rate | > 100 blocks/min | Warning |
| Security | GuardDuty finding | Any Medium+ | Critical |
| Certificate | TLS cert expiry | < 30 days | Warning |
| Certificate | TLS cert expiry | < 7 days | Critical |
SIEM: Security Event Correlation
The SIEM (Security Information and Event Management) system aggregates logs from all sources and correlates events to detect threats that individual log entries would not reveal.
Correlation examples:
| Pattern | Events Correlated | Indicates |
|---|---|---|
| Brute force | Multiple failed logins from same IP across different accounts | Credential stuffing attack |
| Account takeover | Successful login from new IP + immediate data export | Compromised account |
| Insider threat | Coach accessing coachee data outside of scheduled sessions at unusual hours | Unauthorized data access |
| Data exfiltration | Large data export + new API key creation + unusual outbound traffic | Data breach in progress |
| Privilege escalation | Admin role change + immediate access to previously restricted data | Privilege abuse |
| AI manipulation | Repeated guardrail triggers + changing prompt patterns from same user | Prompt injection attempt |
Anomaly Detection
Beyond static thresholds, the monitoring system uses baseline-relative anomaly detection:
- Behavioral baselines are established per user (typical login times, data access patterns, session frequency)
- Deviations from baseline trigger alerts (e.g., coach who normally accesses 5 records/day suddenly accesses 500)
- Machine learning models (if/when scale justifies) identify subtle patterns
For MVP0/Pilot, anomaly detection will use CloudWatch Anomaly Detection (statistical) and simple threshold-based rules. ML-based anomaly detection is a GA feature. The SIEM tool selection (Datadog Security Monitoring vs. Splunk vs. AWS Security Hub) should be finalized by Q2 2026.
LLM-Specific Monitoring (Sasha / Anthropic Claude)
Every call to the Anthropic Claude API costs money, takes time, and carries risk. This monitoring domain is unique to AI-first platforms and one of the most critical for ReGenesis.
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Tokens per session | Input + output tokens per coaching interaction | Cost control and anomaly detection |
| Cost per session | Dollar cost of each Sasha interaction | Margin management -- a runaway prompt burns cash fast |
| Daily/monthly API spend | Aggregate Anthropic API cost | Budget tracking, burn rate awareness |
| API latency | Time from request to first token and to completion | User experience during AI interactions |
| Error rate | Anthropic API 4xx/5xx responses | Service degradation or rate limiting |
| Rate limit hits | 429 responses from Anthropic | Capacity planning, need for rate limit increase |
| Token budget utilization | Actual vs. budgeted token spend | Early warning for cost overruns |
| Prompt injection signals | Flagged inputs matching injection patterns | Security: someone trying to manipulate Sasha |
| Output quality flags | Responses flagged by content safety filters | Quality assurance, safety monitoring |
For an AI-first product, LLM costs are variable and can spike unexpectedly. A single user with a runaway prompt loop could generate thousands of dollars in API charges. Monitoring token usage with automatic circuit breakers is not optional -- it is financial survival.
Recommendation: Set daily and monthly spend alerts from day one. Even during MVP0, the platform tracks every dollar spent on Claude API calls. This becomes margin visibility once the platform is charging clients.
Metrics to Track per Stage
The monitoring footprint grows as the platform matures. This table shows what needs to be in place at each stage:
| Metric Domain | MVP0 Demo | Pilot (McKinsey) | GA Launch | Global |
|---|---|---|---|---|
| Health checks | Basic /health endpoint | Detailed component health with dependency checks | Deep health with multi-level checks | Multi-region health mesh |
| Error tracking | Console logging + CloudWatch | Structured logging + Datadog APM | Full distributed tracing | Cross-region trace correlation |
| LLM monitoring | Manual token count review | Automated cost dashboard + daily spend alerts | Real-time cost alerts + per-session circuit breakers | Per-region LLM cost allocation |
| Security monitoring | Basic auth logging in CloudWatch | Failed login alerts, access anomalies, GuardDuty | SIEM integration (Datadog Security) + anomaly detection | Global threat correlation, SOC integration |
| Uptime monitoring | Manual checks | External synthetic monitoring (e.g., Checkly) | Statuspage.io + SLA tracking + SLO burn-rate alerts | Multi-region SLA with 99.95% target |
| Alerting | Email to founder | PagerDuty for P1, Slack for P2 | Full 3-tier alerting with runbooks | Follow-the-sun on-call rotation |
| Dashboards | None (CloudWatch console) | Core operational + LLM cost dashboard | Executive + operational + security dashboards | Regional dashboards + global rollup |
| Compliance evidence | Manual screenshots | Automated evidence via Vanta/Drata | Continuous compliance monitoring | Multi-framework evidence (SOC 2, ISO, GDPR) |
SIEM Integration (GA Stage)
At the GA stage, all monitoring signals feed into a Security Information and Event Management (SIEM) system. The SIEM correlates events across sources to detect complex attack patterns that no single monitoring tool would catch.
Note: The SIEM is not needed for MVP0 or Pilot -- it is a GA requirement. However, the logging and metrics built now become the data feeds for the SIEM later. Building structured logs from the start means the SIEM integration becomes plug-and-play rather than a retrofit.
| SIEM Capability | Tool | Purpose |
|---|---|---|
| Log aggregation | Datadog Logs / CloudWatch | Centralize all log streams |
| Threat detection | Datadog Security / GuardDuty | ML-based anomaly detection |
| Correlation rules | Custom detection rules | ReGenesis-specific threat patterns |
| Incident workflow | PagerDuty + Datadog | Alert routing, escalation, response tracking |
| Compliance reporting | Datadog compliance dashboards | SOC 2 / ISO 27001 evidence collection |
| Retention | S3 + Glacier | 1-year hot, 7-year cold retention |
Endpoint Detection & Response (EDR)
For ReGenesis employee devices that have access to production systems:
| Control | Tool | Coverage |
|---|---|---|
| Endpoint protection | CrowdStrike Falcon or SentinelOne | All employee laptops |
| Device compliance | MDM (Jamf for Mac, Intune for Windows) | All employee devices |
| Disk encryption | FileVault (Mac), BitLocker (Windows) | Required |
| Patch management | MDM-enforced automatic updates | OS + browsers within 14 days |
Alert Channels and Escalation
On-Call Rotation
| Aspect | Detail |
|---|---|
| Coverage | 24/7/365 |
| Primary on-call | Rotates weekly among qualified engineers |
| Secondary on-call | Backup for escalation |
| Acknowledgment SLA | 5 minutes for Critical alerts |
| Engagement SLA | 15 minutes to begin investigation for Critical |
| Escalation | Auto-escalate to secondary if primary does not acknowledge within 10 minutes |
| Runbooks | Standardized response procedures for common alert types |
| Post-incident | Blameless retrospective within 48 hours |
Alert Response Runbooks
The following runbooks provide standardized, step-by-step procedures for the most common alert types. Every on-call engineer should be familiar with these before joining the rotation. Runbooks are stored in the internal wiki and linked from each PagerDuty alert.
Runbook 1: API 5xx Spike (> 1% Error Rate)
Trigger: CloudWatch alarm regenesis-api-5xx-rate fires when 5xx errors exceed 1% of total requests over two consecutive 5-minute evaluation periods.
| Step | Action | Owner | Time Target |
|---|---|---|---|
| 1 | Acknowledge the PagerDuty alert and join the #incident-response Slack channel | On-call engineer | 5 min |
| 2 | Check deployment history -- was there a deploy in the last 60 minutes? | On-call engineer | 2 min |
| 3 | If recent deploy identified, initiate rollback to last known good version | On-call engineer | 10 min |
| 4 | Check third-party dependency status (Anthropic, Stripe, identity providers) | On-call engineer | 5 min |
| 5 | Review application logs in CloudWatch / Datadog for error patterns and stack traces | On-call engineer | 10 min |
| 6 | Check database metrics: connection pool usage, query latency, replication lag | On-call engineer | 5 min |
| 7 | If database is the root cause, engage DBA on-call and consider read replica failover | On-call engineer + DBA | 15 min |
| 8 | Monitor error rate for 15 minutes after mitigation to confirm stability | On-call engineer | 15 min |
| 9 | If not resolved within 30 minutes, escalate to secondary on-call and engineering lead | On-call engineer | 30 min |
| 10 | Update status page, create post-incident Jira ticket, and schedule retrospective | On-call engineer | 60 min |
Runbook 2: Failed Login Spike (> 50 in 5 Minutes)
Trigger: CloudWatch alarm regenesis-failed-logins fires when failed login attempts exceed 50 within a 5-minute window.
| Step | Action | Owner | Time Target |
|---|---|---|---|
| 1 | Acknowledge the PagerDuty alert and join the #incident-response Slack channel | On-call engineer | 5 min |
| 2 | Determine attack pattern: single IP vs. distributed IPs, single account vs. multiple accounts | On-call engineer | 10 min |
| 3 | If single IP or small IP range, add WAF block rule immediately | On-call engineer | 5 min |
| 4 | Enable aggressive rate limiting on the authentication endpoint if not already active | On-call engineer | 5 min |
| 5 | Check if any accounts were successfully compromised (successful login after failures from same IP) | On-call engineer | 10 min |
| 6 | If compromised accounts found, escalate to security lead and initiate account lockout | On-call engineer + Security lead | 15 min |
| 7 | Force password reset for any compromised accounts and invalidate all active sessions | Security lead | 15 min |
| 8 | Notify affected enterprise admins per the DPA communication requirements | Security lead | 30 min |
| 9 | Document attack details, IOCs, and response actions in the incident ticket | On-call engineer | 60 min |
Runbook 3: Claude API Degradation (Latency > 10s or Error Rate > 5%)
Trigger: CloudWatch alarms for Claude API latency exceeding 10 seconds or Claude API error rate exceeding 5% of requests.
| Step | Action | Owner | Time Target |
|---|---|---|---|
| 1 | Acknowledge the PagerDuty alert and join the #incident-response Slack channel | On-call engineer | 5 min |
| 2 | Confirm whether the issue is Anthropic-side (check status.anthropic.com and Anthropic support channels) | On-call engineer | 5 min |
| 3 | Check ReGenesis integration layer: verify API keys are valid, review request formatting, check for recent code changes | On-call engineer | 10 min |
| 4 | If Anthropic-side, set the CLAUDE_DEGRADATION feature flag to activate graceful degradation mode (queued responses, cached fallbacks) | On-call engineer | 5 min |
| 5 | Notify enterprise admins via the admin dashboard banner and email that AI features are operating in degraded mode | On-call engineer | 15 min |
| 6 | Monitor Anthropic status page and ReGenesis error rates every 15 minutes for recovery | On-call engineer | Ongoing |
| 7 | Once Anthropic service recovers, run integration smoke tests before removing degradation mode | On-call engineer | 10 min |
| 8 | Remove CLAUDE_DEGRADATION feature flag and verify normal AI response times | On-call engineer | 5 min |
| 9 | Update enterprise admins that AI features have returned to normal operation and document the incident | On-call engineer | 30 min |
Runbook 4: GuardDuty Medium+ Finding
Trigger: CloudWatch Events rule regenesis-guardduty-high-severity fires for any GuardDuty finding with severity >= Medium (4.0+).
| Step | Action | Owner | Time Target |
|---|---|---|---|
| 1 | Acknowledge the PagerDuty alert and join the #incident-response Slack channel | On-call engineer | 5 min |
| 2 | Review the GuardDuty finding details: finding type, affected resource, actor information, and recommended remediation | On-call engineer | 10 min |
| 3 | Classify the finding as true positive, false positive, or needs investigation | On-call engineer + Security lead | 15 min |
| 4 | If false positive, suppress the finding in GuardDuty with documented justification and close the alert | Security lead | 10 min |
| 5 | If true positive, isolate the affected resource (revoke credentials, restrict security group, disable user) | On-call engineer | 10 min |
| 6 | If severity >= 7.0 (High), escalate to CTO and initiate formal incident response process | Security lead | 15 min |
| 7 | Preserve forensic evidence: snapshot affected instances, export relevant logs to secure S3 bucket | On-call engineer | 30 min |
| 8 | Remediate the root cause: patch vulnerability, rotate credentials, update security group rules | On-call engineer + Security lead | 60 min |
| 9 | If finding indicates potential data breach, activate the breach response plan and notify legal counsel | Security lead + CTO | 60 min |
Runbook 5: LLM Cost Overrun (Daily Spend > 150% of Budget)
Trigger: Custom CloudWatch metric alarm fires when daily Anthropic API spend exceeds 150% of the daily budget threshold.
| Step | Action | Owner | Time Target |
|---|---|---|---|
| 1 | Acknowledge the alert and review the LLM cost dashboard to identify the spend trajectory | On-call engineer | 10 min |
| 2 | Identify the source of excess spend: which tenants, endpoints, or sessions are driving token usage | On-call engineer | 15 min |
| 3 | Check for runaway processes: automated loops, stuck retry logic, or misconfigured batch jobs calling the Claude API | On-call engineer | 10 min |
| 4 | If a runaway process is identified, kill the process immediately and disable the triggering mechanism | On-call engineer | 5 min |
| 5 | Review per-session and per-user token budgets -- tighten circuit breakers if thresholds are too generous | On-call engineer + Engineering lead | 15 min |
| 6 | Investigate potential abuse: check for users generating excessive AI interactions or attempting prompt-based token inflation | On-call engineer + Security lead | 30 min |
| 7 | Report the cost overrun to finance with root cause analysis and projected monthly impact | Engineering lead | 60 min |
| 8 | Adjust daily spend limits, per-session token caps, or rate limits to prevent recurrence | Engineering lead | 120 min |