Incident Management
Vitals provides comprehensive incident management capabilities to help teams detect, investigate, resolve, and learn from production incidents—all within VS Code.
Overview
The Incident Management system enables collaborative debugging workflows, automated remediation, and post-incident analysis without leaving your development environment.
Key Features
1. Incident Detection & Declaration
Create incidents manually or automatically when anomalies are detected:
// Incidents can be created from:
// - Manual declaration (Command Palette)
// - Automated detection from metric thresholds
// - CI/CD deployment failures
// - Integration webhooks (PagerDuty, Opsgenie)Commands:
Vitals: Create Incident- Manually declare a new incidentVitals: Show Incident Details- View incident timeline and details
2. Incident Lifecycle Management
Track incidents through their complete lifecycle:
- Detected - Incident identified
- Investigating - Team actively debugging
- Identified - Root cause found
- Monitoring - Fix deployed, monitoring for recurrence
- Resolved - Incident fully resolved
3. Collaborative Debugging
Annotations
Add observations and notes during investigation:
# Command: Vitals: Add Incident Annotation
- Timestamp-based observations
- Link to specific metrics or logs
- Attach screenshots or diagnostic dataHypothesis Tracking
Document and test theories systematically:
# Command: Vitals: Add Incident Hypothesis
- State: pending, confirmed, rejected
- Link to evidence (metrics, logs, traces)
- Track validation steps4. Context Capture
Automatically capture diagnostic context:
- Metric Snapshots - Performance data at incident time
- Log Snapshots - Relevant log entries
- Trace IDs - Distributed traces for debugging
- Git Context - Recent commits and deployments
5. Runbook Automation
Execute predefined remediation steps automatically:
Pre-configured Runbooks:
Kubernetes Pod Restart
Steps:
1. Verify pod status (automated)
2. Delete failing pod (automated)
3. Verify new pod startup (automated)High CPU Mitigation
Steps:
1. Identify high CPU processes (automated)
2. Scale deployment (automated/manual)
3. Verify CPU normalization (automated)Command: Vitals: Execute Runbook
Supported Actions:
kubectlcommands (Kubernetes operations)aws_clicommands (AWS operations)azure_clicommands (Azure operations)httprequests (API calls)scriptexecution (Custom scripts)notification(Alert team members)
6. Integration with External Services
PagerDuty
- Create incidents automatically
- Sync incident status
- Trigger escalations
Opsgenie
- Create alerts with priorities (P1-P4)
- Assign to on-call team
- Update alert status
Slack
- Send incident notifications
- Color-coded by severity
- Interactive incident updates
Microsoft Teams
- MessageCard format notifications
- Incident summary with actions
- Status updates
Command: Vitals: Configure Incident Integration
7. On-Call Management
Status Bar Integration
When you're on-call, see a badge in VS Code's status bar:
🚨 On-CallEscalation Policies
Define multi-level escalation:
Primary → Backup → Manager
15 min → 15 min → 30 min delaysCommand: Vitals: Show On-Call Schedule
8. Post-Mortem Generation
Generate comprehensive post-incident reports:
Includes:
- Executive summary
- Complete timeline with annotations
- Root cause analysis (AI-assisted)
- Impact assessment (duration, users affected, revenue loss)
- What went well / What went wrong
- Action items for prevention
- Lessons learned
Output: Markdown file exported to post-mortems/ directory
Command: Vitals: Generate Post-Mortem
Configuration
Add to your VS Code settings:
{
"vitals.enableIncidentManagement": true,
"vitals.incidentIntegrations": {
"pagerduty": {
"enabled": true,
"serviceId": "P1234567"
},
"slack": {
"enabled": true,
"channel": "#incidents"
}
},
"vitals.runbookAutoExecute": false,
"vitals.postMortemTemplate": "standard",
"vitals.onCallNotifications": true
}Workflow Example
Typical Incident Flow
Detection
bashAlert: API latency > 2s threshold → Automatic incident creation → Notification sent to on-call engineerInvestigation
bashEngineer opens incident in VS Code → Reviews metric snapshots → Adds annotations: "High CPU on api-pod-1234" → Adds hypothesis: "Memory leak in user service"Remediation
bashExecute runbook: "k8s-pod-restart" → Pod deleted automatically → New pod starts with fresh memory → Metrics normalizeResolution
bashUpdate incident status to "Resolved" → Generate post-mortem → Create action items: - Add memory limit to pod spec - Implement memory profilingLearning
bashPost-mortem exported to team wiki → Runbook updated with new steps → Alert threshold adjusted
API Reference
IncidentManager
class IncidentManager {
// Create new incident
createIncident(params: {
title: string;
description: string;
severity?: IncidentSeverity;
source: 'manual' | 'automated';
}): Promise<Incident>;
// Update incident status
updateStatus(
incidentId: string,
status: IncidentStatus
): Promise<void>;
// Add annotation
addAnnotation(
incidentId: string,
annotation: {
text: string;
metricReference?: string;
}
): Promise<void>;
// Add hypothesis
addHypothesis(
incidentId: string,
hypothesis: {
description: string;
state: 'pending' | 'confirmed' | 'rejected';
}
): Promise<void>;
// Capture diagnostic context
captureMetricSnapshot(
incidentId: string,
metricName: string
): Promise<void>;
captureLogSnapshot(
incidentId: string,
query: string
): Promise<void>;
}RunbookEngine
class RunbookEngine {
// Execute runbook
executeRunbook(
runbookId: string,
variables?: Record<string, string>
): Promise<{
success: boolean;
steps: StepResult[];
}>;
// Register custom runbook
registerRunbook(runbook: {
id: string;
name: string;
description: string;
steps: RunbookStep[];
}): void;
}PostMortemGenerator
class PostMortemGenerator {
// Generate post-mortem
generatePostMortem(
incidentId: string,
options?: {
includeAI?: boolean;
template?: 'standard' | 'detailed' | 'minimal';
}
): Promise<PostMortem>;
// Save as markdown
saveAsMarkdown(
postMortem: PostMortem,
filepath: string
): Promise<void>;
}Best Practices
1. Severity Classification
Use consistent severity levels:
- Critical - Complete service outage, data loss
- High - Major feature unavailable, severe degradation
- Medium - Minor feature impacted, workaround available
- Low - Cosmetic issue, minimal user impact
2. Annotation Guidelines
- Use timestamps for all observations
- Link to specific metrics/logs when possible
- Be objective and factual
- Include commands/queries executed
3. Hypothesis Testing
- State hypotheses clearly
- Document validation steps
- Update state based on evidence
- Don't delete rejected hypotheses (learning!)
4. Runbook Maintenance
- Update runbooks after each incident
- Add new edge cases discovered
- Include rollback steps
- Test runbooks regularly in staging
5. Post-Mortem Writing
- Focus on systems, not individuals (blameless)
- Include both positives and negatives
- Create actionable items with owners
- Set realistic timelines for improvements
Metrics & Analytics
Track incident management effectiveness:
- MTTD (Mean Time To Detect) - Alert → Incident created
- MTTA (Mean Time To Acknowledge) - Incident created → Engineer assigned
- MTTI (Mean Time To Identify) - Investigation start → Root cause found
- MTTR (Mean Time To Resolve) - Incident created → Resolved
- Incident Volume - Trends over time
- Recurrence Rate - Same incident repeating
- Runbook Success Rate - Automated remediation effectiveness
Security Considerations
- Credentials Storage: All integration credentials stored in VS Code Secrets API
- Access Control: Configure integration permissions per team
- Audit Logging: All incident actions logged with timestamps
- Data Retention: Configure retention policies for sensitive data
Troubleshooting
Integrations Not Working
# Check credential storage
Command Palette → "Vitals: Configure Incident Integration"
Re-enter API keys
# Verify network connectivity
curl -X POST https://api.pagerduty.com/incidentsRunbooks Failing
# Check command availability
which kubectl # Should return path
which aws # Should return path
# Verify permissions
kubectl auth can-i delete pods # Should return 'yes'Post-Mortems Not Generating
# Check workspace permissions
# Ensure write access to workspace folder
# post-mortems/ directory created automaticallyRelated Documentation
- CI/CD Integration - Correlate deployments with incidents
- Distributed Tracing - Debug with traces
- Premium Features - Enterprise features
Support
- GitHub Issues: Report bugs
- Discussions: Ask questions