Monitoring, Alerting & Security Operations

Executive Summary
Working Knowledge
Technical Spec

Detecting and Responding to Threats in Real Time

Monitoring and alerting systems are the eyes and ears of platform security. While prevention controls (encryption, access control, firewalls) stop most threats, monitoring systems detect the threats that get through. For ReGenesis, this means continuously watching for unauthorized access attempts, unusual data access patterns, infrastructure anomalies, and AI behavior that deviates from expected norms.

Enterprise clients expect documented monitoring architecture, proven incident detection capabilities, and transparent communication during security events. SOC 2 CC7.1-CC7.4 specifically require continuous monitoring, anomaly detection, and documented incident response procedures. The monitoring stack feeds directly into the incident response process: when something looks wrong, the right people are alerted within minutes and know exactly what to do.

ReGenesis uses a layered monitoring approach: Application Performance Monitoring (APM) for platform health, Security Information and Event Management (SIEM) for threat detection, AWS native services for infrastructure monitoring, and custom dashboards for AI-specific metrics. This architecture is currently in early specification and needs further development before the McKinsey pilot -- a gap acknowledged in the Blueprint Supplement.

Monitoring Architecture

Blueprint Supplement Gap

The monitoring architecture needs more specification before pilot launch. The current document describes the target architecture, but detailed implementation plans, tool selection, and staffing requirements need to be finalized by Q2 2026.

Monitoring Layers

What the Platform Monitors

Category	Metric	Alert Threshold	Severity
Availability	Uptime percentage	< 99.9% over 5 min	Critical
Latency	API response time (p95)	> 500ms	Warning
Latency	API response time (p99)	> 2000ms	Critical
Error Rate	5xx error rate	> 1% of requests	Critical
Error Rate	4xx error rate	> 10% of requests	Warning
Authentication	Failed login rate	> 10 per user per hour	Critical
Authentication	MFA failure rate	> 50% for any user	Warning
Data Access	Unusual data access volume	> 3x baseline for user	Warning
Data Access	Cross-role data access attempt	Any occurrence	Critical
AI	Claude API latency	> 10 seconds	Warning
AI	Claude API error rate	> 5%	Critical
AI	Guardrail trigger rate	> 10% of sessions	Warning
AI	Token usage per session	> 50,000 tokens	Warning
Infrastructure	CPU utilization	> 80% sustained (15 min)	Warning
Infrastructure	Memory utilization	> 85%	Warning
Infrastructure	Disk utilization	> 80%	Warning
Infrastructure	Database connections	> 80% of max	Warning
Security	WAF block rate	> 100 blocks/min	Warning
Security	GuardDuty finding	Any Medium+	Critical
Certificate	TLS cert expiry	< 30 days	Warning
Certificate	TLS cert expiry	< 7 days	Critical

SIEM: Security Event Correlation

The SIEM (Security Information and Event Management) system aggregates logs from all sources and correlates events to detect threats that individual log entries would not reveal.

Correlation examples:

Pattern	Events Correlated	Indicates
Brute force	Multiple failed logins from same IP across different accounts	Credential stuffing attack
Account takeover	Successful login from new IP + immediate data export	Compromised account
Insider threat	Coach accessing coachee data outside of scheduled sessions at unusual hours	Unauthorized data access
Data exfiltration	Large data export + new API key creation + unusual outbound traffic	Data breach in progress
Privilege escalation	Admin role change + immediate access to previously restricted data	Privilege abuse
AI manipulation	Repeated guardrail triggers + changing prompt patterns from same user	Prompt injection attempt

Anomaly Detection

Beyond static thresholds, the monitoring system uses baseline-relative anomaly detection:

Behavioral baselines are established per user (typical login times, data access patterns, session frequency)
Deviations from baseline trigger alerts (e.g., coach who normally accesses 5 records/day suddenly accesses 500)
Machine learning models (if/when scale justifies) identify subtle patterns

Current Implementation Plan

For MVP0/Pilot, anomaly detection will use CloudWatch Anomaly Detection (statistical) and simple threshold-based rules. ML-based anomaly detection is a GA feature. The SIEM tool selection (Datadog Security Monitoring vs. Splunk vs. AWS Security Hub) should be finalized by Q2 2026.

LLM-Specific Monitoring (Sasha / Anthropic Claude)

Every call to the Anthropic Claude API costs money, takes time, and carries risk. This monitoring domain is unique to AI-first platforms and one of the most critical for ReGenesis.

Metric	What It Measures	Why It Matters
Tokens per session	Input + output tokens per coaching interaction	Cost control and anomaly detection
Cost per session	Dollar cost of each Sasha interaction	Margin management -- a runaway prompt burns cash fast
Daily/monthly API spend	Aggregate Anthropic API cost	Budget tracking, burn rate awareness
API latency	Time from request to first token and to completion	User experience during AI interactions
Error rate	Anthropic API 4xx/5xx responses	Service degradation or rate limiting
Rate limit hits	429 responses from Anthropic	Capacity planning, need for rate limit increase
Token budget utilization	Actual vs. budgeted token spend	Early warning for cost overruns
Prompt injection signals	Flagged inputs matching injection patterns	Security: someone trying to manipulate Sasha
Output quality flags	Responses flagged by content safety filters	Quality assurance, safety monitoring

Cost Control Is Monitoring

For an AI-first product, LLM costs are variable and can spike unexpectedly. A single user with a runaway prompt loop could generate thousands of dollars in API charges. Monitoring token usage with automatic circuit breakers is not optional -- it is financial survival.

Recommendation: Set daily and monthly spend alerts from day one. Even during MVP0, the platform tracks every dollar spent on Claude API calls. This becomes margin visibility once the platform is charging clients.

Metrics to Track per Stage

The monitoring footprint grows as the platform matures. This table shows what needs to be in place at each stage:

Metric Domain	MVP0 Demo	Pilot (McKinsey)	GA Launch	Global
Health checks	Basic `/health` endpoint	Detailed component health with dependency checks	Deep health with multi-level checks	Multi-region health mesh
Error tracking	Console logging + CloudWatch	Structured logging + Datadog APM	Full distributed tracing	Cross-region trace correlation
LLM monitoring	Manual token count review	Automated cost dashboard + daily spend alerts	Real-time cost alerts + per-session circuit breakers	Per-region LLM cost allocation
Security monitoring	Basic auth logging in CloudWatch	Failed login alerts, access anomalies, GuardDuty	SIEM integration (Datadog Security) + anomaly detection	Global threat correlation, SOC integration
Uptime monitoring	Manual checks	External synthetic monitoring (e.g., Checkly)	Statuspage.io + SLA tracking + SLO burn-rate alerts	Multi-region SLA with 99.95% target
Alerting	Email to founder	PagerDuty for P1, Slack for P2	Full 3-tier alerting with runbooks	Follow-the-sun on-call rotation
Dashboards	None (CloudWatch console)	Core operational + LLM cost dashboard	Executive + operational + security dashboards	Regional dashboards + global rollup
Compliance evidence	Manual screenshots	Automated evidence via Vanta/Drata	Continuous compliance monitoring	Multi-framework evidence (SOC 2, ISO, GDPR)

SIEM Integration (GA Stage)

At the GA stage, all monitoring signals feed into a Security Information and Event Management (SIEM) system. The SIEM correlates events across sources to detect complex attack patterns that no single monitoring tool would catch.

Note: The SIEM is not needed for MVP0 or Pilot -- it is a GA requirement. However, the logging and metrics built now become the data feeds for the SIEM later. Building structured logs from the start means the SIEM integration becomes plug-and-play rather than a retrofit.

SIEM Capability	Tool	Purpose
Log aggregation	Datadog Logs / CloudWatch	Centralize all log streams
Threat detection	Datadog Security / GuardDuty	ML-based anomaly detection
Correlation rules	Custom detection rules	ReGenesis-specific threat patterns
Incident workflow	PagerDuty + Datadog	Alert routing, escalation, response tracking
Compliance reporting	Datadog compliance dashboards	SOC 2 / ISO 27001 evidence collection
Retention	S3 + Glacier	1-year hot, 7-year cold retention

Endpoint Detection & Response (EDR)

For ReGenesis employee devices that have access to production systems:

Control	Tool	Coverage
Endpoint protection	CrowdStrike Falcon or SentinelOne	All employee laptops
Device compliance	MDM (Jamf for Mac, Intune for Windows)	All employee devices
Disk encryption	FileVault (Mac), BitLocker (Windows)	Required
Patch management	MDM-enforced automatic updates	OS + browsers within 14 days

Alert Channels and Escalation

On-Call Rotation

Aspect	Detail
Coverage	24/7/365
Primary on-call	Rotates weekly among qualified engineers
Secondary on-call	Backup for escalation
Acknowledgment SLA	5 minutes for Critical alerts
Engagement SLA	15 minutes to begin investigation for Critical
Escalation	Auto-escalate to secondary if primary does not acknowledge within 10 minutes
Runbooks	Standardized response procedures for common alert types
Post-incident	Blameless retrospective within 48 hours

Alert Response Runbooks

The following runbooks provide standardized, step-by-step procedures for the most common alert types. Every on-call engineer should be familiar with these before joining the rotation. Runbooks are stored in the internal wiki and linked from each PagerDuty alert.

Runbook 1: API 5xx Spike (> 1% Error Rate)

Trigger: CloudWatch alarm regenesis-api-5xx-rate fires when 5xx errors exceed 1% of total requests over two consecutive 5-minute evaluation periods.

Step	Action	Owner	Time Target
1	Acknowledge the PagerDuty alert and join the #incident-response Slack channel	On-call engineer	5 min
2	Check deployment history -- was there a deploy in the last 60 minutes?	On-call engineer	2 min
3	If recent deploy identified, initiate rollback to last known good version	On-call engineer	10 min
4	Check third-party dependency status (Anthropic, Stripe, identity providers)	On-call engineer	5 min
5	Review application logs in CloudWatch / Datadog for error patterns and stack traces	On-call engineer	10 min
6	Check database metrics: connection pool usage, query latency, replication lag	On-call engineer	5 min
7	If database is the root cause, engage DBA on-call and consider read replica failover	On-call engineer + DBA	15 min
8	Monitor error rate for 15 minutes after mitigation to confirm stability	On-call engineer	15 min
9	If not resolved within 30 minutes, escalate to secondary on-call and engineering lead	On-call engineer	30 min
10	Update status page, create post-incident Jira ticket, and schedule retrospective	On-call engineer	60 min

Trigger: CloudWatch alarm regenesis-failed-logins fires when failed login attempts exceed 50 within a 5-minute window.

Step	Action	Owner	Time Target
1	Acknowledge the PagerDuty alert and join the #incident-response Slack channel	On-call engineer	5 min
2	Determine attack pattern: single IP vs. distributed IPs, single account vs. multiple accounts	On-call engineer	10 min
3	If single IP or small IP range, add WAF block rule immediately	On-call engineer	5 min
4	Enable aggressive rate limiting on the authentication endpoint if not already active	On-call engineer	5 min
5	Check if any accounts were successfully compromised (successful login after failures from same IP)	On-call engineer	10 min
6	If compromised accounts found, escalate to security lead and initiate account lockout	On-call engineer + Security lead	15 min
7	Force password reset for any compromised accounts and invalidate all active sessions	Security lead	15 min
8	Notify affected enterprise admins per the DPA communication requirements	Security lead	30 min
9	Document attack details, IOCs, and response actions in the incident ticket	On-call engineer	60 min

Runbook 3: Claude API Degradation (Latency > 10s or Error Rate > 5%)

Trigger: CloudWatch alarms for Claude API latency exceeding 10 seconds or Claude API error rate exceeding 5% of requests.

Step	Action	Owner	Time Target
1	Acknowledge the PagerDuty alert and join the #incident-response Slack channel	On-call engineer	5 min
2	Confirm whether the issue is Anthropic-side (check status.anthropic.com and Anthropic support channels)	On-call engineer	5 min
3	Check ReGenesis integration layer: verify API keys are valid, review request formatting, check for recent code changes	On-call engineer	10 min
4	If Anthropic-side, set the CLAUDE_DEGRADATION feature flag to activate graceful degradation mode (queued responses, cached fallbacks)	On-call engineer	5 min
5	Notify enterprise admins via the admin dashboard banner and email that AI features are operating in degraded mode	On-call engineer	15 min
6	Monitor Anthropic status page and ReGenesis error rates every 15 minutes for recovery	On-call engineer	Ongoing
7	Once Anthropic service recovers, run integration smoke tests before removing degradation mode	On-call engineer	10 min
8	Remove CLAUDE_DEGRADATION feature flag and verify normal AI response times	On-call engineer	5 min
9	Update enterprise admins that AI features have returned to normal operation and document the incident	On-call engineer	30 min

Runbook 4: GuardDuty Medium+ Finding

Trigger: CloudWatch Events rule regenesis-guardduty-high-severity fires for any GuardDuty finding with severity >= Medium (4.0+).

Step	Action	Owner	Time Target
1	Acknowledge the PagerDuty alert and join the #incident-response Slack channel	On-call engineer	5 min
2	Review the GuardDuty finding details: finding type, affected resource, actor information, and recommended remediation	On-call engineer	10 min
3	Classify the finding as true positive, false positive, or needs investigation	On-call engineer + Security lead	15 min
4	If false positive, suppress the finding in GuardDuty with documented justification and close the alert	Security lead	10 min
5	If true positive, isolate the affected resource (revoke credentials, restrict security group, disable user)	On-call engineer	10 min
6	If severity >= 7.0 (High), escalate to CTO and initiate formal incident response process	Security lead	15 min
7	Preserve forensic evidence: snapshot affected instances, export relevant logs to secure S3 bucket	On-call engineer	30 min
8	Remediate the root cause: patch vulnerability, rotate credentials, update security group rules	On-call engineer + Security lead	60 min
9	If finding indicates potential data breach, activate the breach response plan and notify legal counsel	Security lead + CTO	60 min

Runbook 5: LLM Cost Overrun (Daily Spend > 150% of Budget)

Trigger: Custom CloudWatch metric alarm fires when daily Anthropic API spend exceeds 150% of the daily budget threshold.

Step	Action	Owner	Time Target
1	Acknowledge the alert and review the LLM cost dashboard to identify the spend trajectory	On-call engineer	10 min
2	Identify the source of excess spend: which tenants, endpoints, or sessions are driving token usage	On-call engineer	15 min
3	Check for runaway processes: automated loops, stuck retry logic, or misconfigured batch jobs calling the Claude API	On-call engineer	10 min
4	If a runaway process is identified, kill the process immediately and disable the triggering mechanism	On-call engineer	5 min
5	Review per-session and per-user token budgets -- tighten circuit breakers if thresholds are too generous	On-call engineer + Engineering lead	15 min
6	Investigate potential abuse: check for users generating excessive AI interactions or attempting prompt-based token inflation	On-call engineer + Security lead	30 min
7	Report the cost overrun to finance with root cause analysis and projected monthly impact	Engineering lead	60 min
8	Adjust daily spend limits, per-session token caps, or rate limits to prevent recurrence	Engineering lead	120 min

Detecting and Responding to Threats in Real Time​

Monitoring Architecture​

Monitoring Layers​

What the Platform Monitors​

SIEM: Security Event Correlation​

Anomaly Detection​

LLM-Specific Monitoring (Sasha / Anthropic Claude)​

Metrics to Track per Stage​

SIEM Integration (GA Stage)​

Endpoint Detection & Response (EDR)​

Alert Channels and Escalation​

On-Call Rotation​

Alert Response Runbooks​

Runbook 1: API 5xx Spike (> 1% Error Rate)​

Runbook 2: Failed Login Spike (> 50 in 5 Minutes)​

Runbook 3: Claude API Degradation (Latency > 10s or Error Rate > 5%)​

Runbook 4: GuardDuty Medium+ Finding​

Runbook 5: LLM Cost Overrun (Daily Spend > 150% of Budget)​

Third-Party Service Monitoring​