While deployment frequency and lead time measure how fast you ship, Change Failure Rate (CFR) and Mean Time to Restore (MTTR) measure how well you ship. These stability metrics are the "quality" half of DORA—and they're often the hardest to measure without comprehensive incident tracking. This guide shows you how to measure CFR and MTTR using GitHub data as your primary source.
Understanding these metrics helps you balance speed with reliability, turning the false dichotomy of "fast vs. stable" into a measurable, improvable reality.
The Quality Side of DORA
Why Stability Metrics Matter
High deployment frequency is only valuable if those deployments don't break things. The DORA research found that high-performing teams don't sacrifice stability for speed—they achieve both. In fact, the practices that enable fast delivery (small batches, automation, good testing) also reduce failures.
- Change Failure Rate: The percentage of deployments that result in degraded service, incidents, or rollbacks
- Time to Restore: How quickly you recover when failures do occur
Together, these metrics capture your deployment risk profile. Low CFR means you rarely break things. Low MTTR means when you do, you fix it fast.
The Speed-Stability Relationship
Speed vs. Stability: Myth vs. Reality
"If we ship faster, we'll break more things"
"Elite performers ship faster AND break less"
Why This Works
Key Factors
- Smaller changes = easier to review, test, and debug
- Frequent deploys = faster feedback loops
- Automation = consistent, repeatable processes
- Practice = teams get better at deploying safely
The false tradeoff comes from keeping large batches.
Defining "Failure" for Your Organization
What Counts as a Failure?
Before you can measure CFR, you need to define what constitutes a "failure." This varies by organization:
- Narrow definition: Only customer-facing outages (production down)
- Moderate definition: Any incident requiring intervention (rollback, hotfix, manual remediation)
- Broad definition: Any deployment that didn't work as expected (including reverts for non-critical issues)
Failure Definition Examples
E-commerce Platform (Narrow)
- Orders can't be placed
- Payments fail
- Site is inaccessible
- NOT counted: Slow page loads, minor UI bugs
SaaS Product (Moderate)
- Any severity 1-2 incident
- Any rollback or revert
- Any emergency hotfix
- NOT counted: Feature doesn't match spec exactly
Internal Tooling (Broad)
- Any revert for any reason
- Any follow-up fix within 24 hours
- Any user-reported regression
Pick based on your risk tolerance and industry context.
Consistency Matters More Than Precision
The absolute CFR number matters less than consistency over time. If you track the same definition consistently, you can measure improvement. If you keep changing definitions, the trend becomes meaningless.
For industry-specific guidance, see our fintech engineering metrics guide or healthcare engineering metrics guide.
Measuring Change Failure Rate from GitHub
GitHub-Based Failure Detection
Without integrated incident tracking, you can detect failures through Git activity patterns:
Signal 1: Revert Commits
The strongest signal of deployment failure is a revert—an explicit acknowledgment that a change needed to be undone.
Detecting Reverts in GitHub
Pattern Matching
- PR title contains: "revert", "Revert", "rollback", "Rollback"
- Commit message starts with: "Revert "
- PR reverts specific SHA (GitHub auto-detection)
Change Failure Rate Calculation
Examples:
Signal 2: Hotfix Branches
If your team uses hotfix branches for emergency fixes, these indicate something broke in production:
- Track merges from
hotfix/*branches - Count PRs with "hotfix" label
- Identify PRs merged outside normal review process
Signal 3: Rapid Follow-up Fixes
When a PR is merged and then another PR touching the same files is merged within a short window (24-48 hours), it may indicate the original change needed fixing.
Rapid Fix Detection
For Each Merged PR
- Get list of files modified
- Look for subsequent PRs within 24-48 hours
- Check if subsequent PRs modify same files
- Filter for "fix" patterns in title/description
Signal Weighting (Example)
- Revert: 100% failure signal
- Hotfix branch: 90% failure signal
- Rapid follow-up: 50% failure signal (might be iteration)
Combined CFR
📊 How CodePulse Tracks Change Failure Rate
CodePulse automatically detects failure signals from your GitHub activity:
- Revert detection from PR patterns and commit messages
- Hotfix identification from branch naming and labels
- Rapid follow-up analysis for potential fixes
- Dashboard - View CFR trends alongside other DORA metrics
Tracking Time to Restore Without Full Observability
What MTTR Measures
Mean Time to Restore (MTTR) captures how quickly you recover from failures. In a full observability setup, this is measured from incident detection to resolution. With GitHub data only, we measure recovery time from the code perspective.
GitHub-Based MTTR Approximation
MTTR Calculation Methods from GitHub
Method 1: Revert Time
- Start: Original PR merge time
- End: Revert PR merge time
- MTTR = End - Start
Method 2: Hotfix Cycle Time
- Start: Hotfix branch creation
- End: Hotfix PR merge time
- MTTR = End - Start
Method 3: Issue-to-Fix
- Start: Bug/incident issue creation
- End: Fixing PR merge time
- MTTR = End - Start (requires linked issues)
Aggregate MTTR
Use median when you have outliers (very long or very short recovery times) to get a more representative value.
Limitations of GitHub-Only MTTR
GitHub-based MTTR has blind spots:
- Detection time: You don't know when the failure was discovered, only when the fix was merged
- Deployment time: The fix is "done" at merge, but may take time to deploy
- Incidents without code fixes: Configuration changes, rollbacks via deployment tools, etc.
Despite limitations, GitHub-based MTTR tracks the engineering response time—how quickly your team produces a fix—which is valuable on its own.
Connecting Incidents to Code Changes
Building the Connection
The most accurate CFR and MTTR measurements connect incidents to specific code changes. Even without automated tooling, you can improve tracking:
- Issue linking: When creating a fix PR, link it to the incident issue or the original PR
- Labels: Tag PRs as "incident-fix", "hotfix", or "revert"
- Convention: Include original PR number in revert/fix PR title
PR Linking Conventions: Revert PR: Title: "Revert: Original feature title (#123)" Body: "Reverting #123 due to [reason]" Labels: revert, incident Hotfix PR: Title: "Hotfix: Brief description of fix" Body: "Fixes incident from #123" Labels: hotfix, priority/critical Follow-up fix: Title: "Fix: Issue with original feature" Body: "Addresses problem introduced in #123" Labels: fix, incident-related
For detailed strategies, see our incident correlation guide.
Improving CFR Without Slowing Down
The Right Interventions
High CFR often tempts teams to add gates and slow down. But the DORA research shows elite teams achieve low CFR while maintaining high velocity. Here's how:
Prevention: Catch Issues Before Production
- Better testing: Focus on integration tests that catch real issues, not just unit test coverage numbers
- Code review focus: Review for correctness and edge cases, not just style (see code review culture guide)
- Pre-merge validation: Required CI checks that actually catch problems
Detection: Find Issues Quickly
- Monitoring: Alerting that detects anomalies quickly
- Staged rollouts: Canary deployments that limit blast radius
- Feature flags: Ability to disable problematic features without rollback
Recovery: Fix Issues Fast
- One-click rollbacks: Automated ability to revert to last good state
- Hotfix fast-path: Expedited review for critical fixes
- Runbooks: Documented response procedures
What NOT to Do
Common Mistakes vs. Better Approaches
- Adding more approval gates - delays don't improve quality, reviewers get fatigued
- Requiring more manual QA - slows everything, still misses issues
- Batching changes into big releases - bigger batches = bigger failures, harder to debug
- Blaming developers for failures - creates fear, hides problems, stops experimentation
- Invest in automation - consistent, repeatable processes catch issues early
- Improve test coverage - automated tests are faster and more reliable
- Ship smaller changes more often - easier to review, debug, and roll back
- Build fast recovery capabilities - reduces impact when failures do occur
DORA Benchmarks for Stability Metrics
| Level | Change Failure Rate | Time to Restore |
|---|---|---|
| Elite | 0-15% | < 1 hour |
| High | 16-30% | < 1 day |
| Medium | 16-30% | 1 day - 1 week |
| Low | 16-30% | > 1 week |
Note: Medium and Low performers have similar CFR but differ dramatically in MTTR. This suggests that elite performers prevent more failures, while high performers recover faster when failures do occur.
💡 Start with MTTR
If you're new to stability metrics, start by improving MTTR rather than CFR. Fast recovery is achievable through process improvements (rollback automation, on-call procedures), while reducing CFR requires deeper technical investments. Plus, low MTTR reduces the business impact of any CFR level.
For implementation guidance on all four DORA metrics, see our DORA Four Keys implementation guide.
See these insights for your team
CodePulse connects to your GitHub and shows you actionable engineering metrics in minutes. No complex setup required.
Free tier available. No credit card required.
Related Guides
No CI/CD Access? Here's How Google Measures DORA Anyway
Measure DORA Four Keys (deployment frequency, lead time, change failure rate, time to restore) using only GitHub data—no CI/CD integration required.
DORA Metrics Are Being Weaponized. Here's the Fix
DORA metrics were designed for research, not management. Learn how to use them correctly as signals for improvement, not targets to game.
Find the Exact Commit That Broke Production in 5 Minutes
Learn to identify which code patterns lead to production incidents using file hotspots, risky change detection, and post-incident analysis.