Incident Management

Vitals provides comprehensive incident management capabilities to help teams detect, investigate, resolve, and learn from production incidents—all within VS Code.

Overview

The Incident Management system enables collaborative debugging workflows, automated remediation, and post-incident analysis without leaving your development environment.

Key Features

1. Incident Detection & Declaration

Create incidents manually or automatically when anomalies are detected:

typescript

// Incidents can be created from:
// - Manual declaration (Command Palette)
// - Automated detection from metric thresholds
// - CI/CD deployment failures
// - Integration webhooks (PagerDuty, Opsgenie)

Commands:

Vitals: Create Incident - Manually declare a new incident
Vitals: Show Incident Details - View incident timeline and details

2. Incident Lifecycle Management

Track incidents through their complete lifecycle:

Detected - Incident identified
Investigating - Team actively debugging
Identified - Root cause found
Monitoring - Fix deployed, monitoring for recurrence
Resolved - Incident fully resolved

3. Collaborative Debugging

Annotations

Add observations and notes during investigation:

bash

# Command: Vitals: Add Incident Annotation
- Timestamp-based observations
- Link to specific metrics or logs
- Attach screenshots or diagnostic data

Hypothesis Tracking

Document and test theories systematically:

bash

# Command: Vitals: Add Incident Hypothesis
- State: pending, confirmed, rejected
- Link to evidence (metrics, logs, traces)
- Track validation steps

4. Context Capture

Automatically capture diagnostic context:

Metric Snapshots - Performance data at incident time
Log Snapshots - Relevant log entries
Trace IDs - Distributed traces for debugging
Git Context - Recent commits and deployments

5. Runbook Automation

Execute predefined remediation steps automatically:

Pre-configured Runbooks:

Kubernetes Pod Restart

yaml

Steps:
  1. Verify pod status (automated)
  2. Delete failing pod (automated)
  3. Verify new pod startup (automated)

High CPU Mitigation

yaml

Steps:
  1. Identify high CPU processes (automated)
  2. Scale deployment (automated/manual)
  3. Verify CPU normalization (automated)

Command: Vitals: Execute Runbook

Supported Actions:

kubectl commands (Kubernetes operations)
aws_cli commands (AWS operations)
azure_cli commands (Azure operations)
http requests (API calls)
script execution (Custom scripts)
notification (Alert team members)

6. Integration with External Services

PagerDuty

Create incidents automatically
Sync incident status
Trigger escalations

Opsgenie

Create alerts with priorities (P1-P4)
Assign to on-call team
Update alert status

Slack

Send incident notifications
Color-coded by severity
Interactive incident updates

Microsoft Teams

MessageCard format notifications
Incident summary with actions
Status updates

Command: Vitals: Configure Incident Integration

7. On-Call Management

Status Bar Integration

When you're on-call, see a badge in VS Code's status bar:

text

🚨 On-Call

Escalation Policies

Define multi-level escalation:

typescript

Primary → Backup → Manager
15 min → 15 min → 30 min delays

Command: Vitals: Show On-Call Schedule

8. Post-Mortem Generation

Generate comprehensive post-incident reports:

Includes:

Executive summary
Complete timeline with annotations
Root cause analysis (AI-assisted)
Impact assessment (duration, users affected, revenue loss)
What went well / What went wrong
Action items for prevention
Lessons learned

Output: Markdown file exported to post-mortems/ directory

Command: Vitals: Generate Post-Mortem

Configuration

Add to your VS Code settings:

json

{
  "vitals.enableIncidentManagement": true,
  "vitals.incidentIntegrations": {
    "pagerduty": {
      "enabled": true,
      "serviceId": "P1234567"
    },
    "slack": {
      "enabled": true,
      "channel": "#incidents"
    }
  },
  "vitals.runbookAutoExecute": false,
  "vitals.postMortemTemplate": "standard",
  "vitals.onCallNotifications": true
}

Workflow Example

Typical Incident Flow

Detection

bash

Alert: API latency > 2s threshold
→ Automatic incident creation
→ Notification sent to on-call engineer

Investigation

bash

Engineer opens incident in VS Code
→ Reviews metric snapshots
→ Adds annotations: "High CPU on api-pod-1234"
→ Adds hypothesis: "Memory leak in user service"

Remediation

bash

Execute runbook: "k8s-pod-restart"
→ Pod deleted automatically
→ New pod starts with fresh memory
→ Metrics normalize

Resolution

bash

Update incident status to "Resolved"
→ Generate post-mortem
→ Create action items:
  - Add memory limit to pod spec
  - Implement memory profiling

Learning

bash

Post-mortem exported to team wiki
→ Runbook updated with new steps
→ Alert threshold adjusted

API Reference

IncidentManager

typescript

class IncidentManager {
  // Create new incident
  createIncident(params: {
    title: string;
    description: string;
    severity?: IncidentSeverity;
    source: 'manual' | 'automated';
  }): Promise<Incident>;

  // Update incident status
  updateStatus(
    incidentId: string, 
    status: IncidentStatus
  ): Promise<void>;

  // Add annotation
  addAnnotation(
    incidentId: string,
    annotation: {
      text: string;
      metricReference?: string;
    }
  ): Promise<void>;

  // Add hypothesis
  addHypothesis(
    incidentId: string,
    hypothesis: {
      description: string;
      state: 'pending' | 'confirmed' | 'rejected';
    }
  ): Promise<void>;

  // Capture diagnostic context
  captureMetricSnapshot(
    incidentId: string,
    metricName: string
  ): Promise<void>;

  captureLogSnapshot(
    incidentId: string,
    query: string
  ): Promise<void>;
}

RunbookEngine

typescript

class RunbookEngine {
  // Execute runbook
  executeRunbook(
    runbookId: string,
    variables?: Record<string, string>
  ): Promise<{
    success: boolean;
    steps: StepResult[];
  }>;

  // Register custom runbook
  registerRunbook(runbook: {
    id: string;
    name: string;
    description: string;
    steps: RunbookStep[];
  }): void;
}

PostMortemGenerator

typescript

class PostMortemGenerator {
  // Generate post-mortem
  generatePostMortem(
    incidentId: string,
    options?: {
      includeAI?: boolean;
      template?: 'standard' | 'detailed' | 'minimal';
    }
  ): Promise<PostMortem>;

  // Save as markdown
  saveAsMarkdown(
    postMortem: PostMortem,
    filepath: string
  ): Promise<void>;
}

Best Practices

1. Severity Classification

Use consistent severity levels:

Critical - Complete service outage, data loss
High - Major feature unavailable, severe degradation
Medium - Minor feature impacted, workaround available
Low - Cosmetic issue, minimal user impact

2. Annotation Guidelines

Use timestamps for all observations
Link to specific metrics/logs when possible
Be objective and factual
Include commands/queries executed

3. Hypothesis Testing

State hypotheses clearly
Document validation steps
Update state based on evidence
Don't delete rejected hypotheses (learning!)

4. Runbook Maintenance

Update runbooks after each incident
Add new edge cases discovered
Include rollback steps
Test runbooks regularly in staging

5. Post-Mortem Writing

Focus on systems, not individuals (blameless)
Include both positives and negatives
Create actionable items with owners
Set realistic timelines for improvements

Metrics & Analytics

Track incident management effectiveness:

MTTD (Mean Time To Detect) - Alert → Incident created
MTTA (Mean Time To Acknowledge) - Incident created → Engineer assigned
MTTI (Mean Time To Identify) - Investigation start → Root cause found
MTTR (Mean Time To Resolve) - Incident created → Resolved
Incident Volume - Trends over time
Recurrence Rate - Same incident repeating
Runbook Success Rate - Automated remediation effectiveness

Security Considerations

Credentials Storage: All integration credentials stored in VS Code Secrets API
Access Control: Configure integration permissions per team
Audit Logging: All incident actions logged with timestamps
Data Retention: Configure retention policies for sensitive data

Troubleshooting

Integrations Not Working

bash

# Check credential storage
Command Palette → "Vitals: Configure Incident Integration"
Re-enter API keys

# Verify network connectivity
curl -X POST https://api.pagerduty.com/incidents

Runbooks Failing

bash

# Check command availability
which kubectl  # Should return path
which aws      # Should return path

# Verify permissions
kubectl auth can-i delete pods  # Should return 'yes'

Post-Mortems Not Generating

bash

# Check workspace permissions
# Ensure write access to workspace folder
# post-mortems/ directory created automatically

CI/CD Integration - Correlate deployments with incidents
Distributed Tracing - Debug with traces
Premium Features - Enterprise features

Support

GitHub Issues: Report bugs
Discussions: Ask questions

Incident Management ​

Overview ​

Key Features ​

1. Incident Detection & Declaration ​

2. Incident Lifecycle Management ​

3. Collaborative Debugging ​

Annotations ​

Hypothesis Tracking ​

4. Context Capture ​

5. Runbook Automation ​

Kubernetes Pod Restart ​

High CPU Mitigation ​

6. Integration with External Services ​

PagerDuty ​

Opsgenie ​

Slack ​

Microsoft Teams ​

7. On-Call Management ​

Status Bar Integration ​

Escalation Policies ​

8. Post-Mortem Generation ​

Configuration ​

Workflow Example ​

Typical Incident Flow ​

API Reference ​

IncidentManager ​

RunbookEngine ​

PostMortemGenerator ​

Best Practices ​

1. Severity Classification ​

2. Annotation Guidelines ​

3. Hypothesis Testing ​

4. Runbook Maintenance ​

5. Post-Mortem Writing ​

Metrics & Analytics ​

Security Considerations ​

Troubleshooting ​

Integrations Not Working ​

Runbooks Failing ​

Post-Mortems Not Generating ​

Related Documentation ​

Support ​

Incident Management

Overview

Key Features

1. Incident Detection & Declaration

2. Incident Lifecycle Management

3. Collaborative Debugging

Annotations

Hypothesis Tracking

4. Context Capture

5. Runbook Automation

Kubernetes Pod Restart

High CPU Mitigation

6. Integration with External Services

PagerDuty

Opsgenie

Slack

Microsoft Teams

7. On-Call Management

Status Bar Integration

Escalation Policies

8. Post-Mortem Generation

Configuration

Workflow Example

Typical Incident Flow

API Reference

IncidentManager

RunbookEngine

PostMortemGenerator

Best Practices

1. Severity Classification

2. Annotation Guidelines

3. Hypothesis Testing

4. Runbook Maintenance

5. Post-Mortem Writing

Metrics & Analytics

Security Considerations

Troubleshooting

Integrations Not Working

Runbooks Failing

Post-Mortems Not Generating

Related Documentation

Support