Skip to main content
All Guides
Metrics

The DORA Metric Everyone Ignores (Until Production Breaks)

Learn how to measure Change Failure Rate and Mean Time to Restore using GitHub data, even without full incident tracking integration.

12 min readUpdated December 25, 2025By CodePulse Team

While deployment frequency and lead time measure how fast you ship, Change Failure Rate (CFR) and Mean Time to Restore (MTTR) measure how well you ship. These stability metrics are the "quality" half of DORA—and they're often the hardest to measure without comprehensive incident tracking. This guide shows you how to measure CFR and MTTR using GitHub data as your primary source.

Understanding these metrics helps you balance speed with reliability, turning the false dichotomy of "fast vs. stable" into a measurable, improvable reality.

The Quality Side of DORA

Why Stability Metrics Matter

High deployment frequency is only valuable if those deployments don't break things. The DORA research found that high-performing teams don't sacrifice stability for speed—they achieve both. In fact, the practices that enable fast delivery (small batches, automation, good testing) also reduce failures.

  • Change Failure Rate: The percentage of deployments that result in degraded service, incidents, or rollbacks
  • Time to Restore: How quickly you recover when failures do occur

Together, these metrics capture your deployment risk profile. Low CFR means you rarely break things. Low MTTR means when you do, you fix it fast.

The Speed-Stability Relationship

Speed vs. Stability: Myth vs. Reality

Common Misconception

"If we ship faster, we'll break more things"

DORA Research Finding

"Elite performers ship faster AND break less"

Why This Works

Key Factors
  • Smaller changes = easier to review, test, and debug
  • Frequent deploys = faster feedback loops
  • Automation = consistent, repeatable processes
  • Practice = teams get better at deploying safely

The false tradeoff comes from keeping large batches.

Detect code hotspots and knowledge silos with CodePulse

Defining "Failure" for Your Organization

What Counts as a Failure?

Before you can measure CFR, you need to define what constitutes a "failure." This varies by organization:

  • Narrow definition: Only customer-facing outages (production down)
  • Moderate definition: Any incident requiring intervention (rollback, hotfix, manual remediation)
  • Broad definition: Any deployment that didn't work as expected (including reverts for non-critical issues)

Failure Definition Examples

E-commerce Platform (Narrow)
  • Orders can't be placed
  • Payments fail
  • Site is inaccessible
  • NOT counted: Slow page loads, minor UI bugs
SaaS Product (Moderate)
  • Any severity 1-2 incident
  • Any rollback or revert
  • Any emergency hotfix
  • NOT counted: Feature doesn't match spec exactly
Internal Tooling (Broad)
  • Any revert for any reason
  • Any follow-up fix within 24 hours
  • Any user-reported regression

Pick based on your risk tolerance and industry context.

Consistency Matters More Than Precision

The absolute CFR number matters less than consistency over time. If you track the same definition consistently, you can measure improvement. If you keep changing definitions, the trend becomes meaningless.

For industry-specific guidance, see our fintech engineering metrics guide or healthcare engineering metrics guide.

Measuring Change Failure Rate from GitHub

GitHub-Based Failure Detection

Without integrated incident tracking, you can detect failures through Git activity patterns:

Signal 1: Revert Commits

The strongest signal of deployment failure is a revert—an explicit acknowledgment that a change needed to be undone.

Detecting Reverts in GitHub

Pattern Matching
  • PR title contains: "revert", "Revert", "rollback", "Rollback"
  • Commit message starts with: "Revert "
  • PR reverts specific SHA (GitHub auto-detection)

Change Failure Rate Calculation

CFR = (Revert PRs) / (Total PRs merged to main)
Examples:
Monthly
PRs merged: 100, Revert PRs: 5
= 5%

Signal 2: Hotfix Branches

If your team uses hotfix branches for emergency fixes, these indicate something broke in production:

  • Track merges from hotfix/* branches
  • Count PRs with "hotfix" label
  • Identify PRs merged outside normal review process

Signal 3: Rapid Follow-up Fixes

When a PR is merged and then another PR touching the same files is merged within a short window (24-48 hours), it may indicate the original change needed fixing.

Rapid Fix Detection

For Each Merged PR
  • Get list of files modified
  • Look for subsequent PRs within 24-48 hours
  • Check if subsequent PRs modify same files
  • Filter for "fix" patterns in title/description
Signal Weighting (Example)
  • Revert: 100% failure signal
  • Hotfix branch: 90% failure signal
  • Rapid follow-up: 50% failure signal (might be iteration)

Combined CFR

(Full-weight failures + weighted partial signals) / Total deploys

📊 How CodePulse Tracks Change Failure Rate

CodePulse automatically detects failure signals from your GitHub activity:

  • Revert detection from PR patterns and commit messages
  • Hotfix identification from branch naming and labels
  • Rapid follow-up analysis for potential fixes
  • Dashboard - View CFR trends alongside other DORA metrics

Tracking Time to Restore Without Full Observability

What MTTR Measures

Mean Time to Restore (MTTR) captures how quickly you recover from failures. In a full observability setup, this is measured from incident detection to resolution. With GitHub data only, we measure recovery time from the code perspective.

GitHub-Based MTTR Approximation

MTTR Calculation Methods from GitHub

Method 1: Revert Time
  • Start: Original PR merge time
  • End: Revert PR merge time
  • MTTR = End - Start
Method 2: Hotfix Cycle Time
  • Start: Hotfix branch creation
  • End: Hotfix PR merge time
  • MTTR = End - Start
Method 3: Issue-to-Fix
  • Start: Bug/incident issue creation
  • End: Fixing PR merge time
  • MTTR = End - Start (requires linked issues)

Aggregate MTTR

Mean or Median across all incidents in period

Use median when you have outliers (very long or very short recovery times) to get a more representative value.

Limitations of GitHub-Only MTTR

GitHub-based MTTR has blind spots:

  • Detection time: You don't know when the failure was discovered, only when the fix was merged
  • Deployment time: The fix is "done" at merge, but may take time to deploy
  • Incidents without code fixes: Configuration changes, rollbacks via deployment tools, etc.

Despite limitations, GitHub-based MTTR tracks the engineering response time—how quickly your team produces a fix—which is valuable on its own.

Identify bottlenecks slowing your team with CodePulse

Connecting Incidents to Code Changes

Building the Connection

The most accurate CFR and MTTR measurements connect incidents to specific code changes. Even without automated tooling, you can improve tracking:

  • Issue linking: When creating a fix PR, link it to the incident issue or the original PR
  • Labels: Tag PRs as "incident-fix", "hotfix", or "revert"
  • Convention: Include original PR number in revert/fix PR title
PR Linking Conventions:

Revert PR:
  Title: "Revert: Original feature title (#123)"
  Body: "Reverting #123 due to [reason]"
  Labels: revert, incident

Hotfix PR:
  Title: "Hotfix: Brief description of fix"
  Body: "Fixes incident from #123"
  Labels: hotfix, priority/critical

Follow-up fix:
  Title: "Fix: Issue with original feature"
  Body: "Addresses problem introduced in #123"
  Labels: fix, incident-related

For detailed strategies, see our incident correlation guide.

Improving CFR Without Slowing Down

The Right Interventions

High CFR often tempts teams to add gates and slow down. But the DORA research shows elite teams achieve low CFR while maintaining high velocity. Here's how:

Prevention: Catch Issues Before Production

  • Better testing: Focus on integration tests that catch real issues, not just unit test coverage numbers
  • Code review focus: Review for correctness and edge cases, not just style (see code review culture guide)
  • Pre-merge validation: Required CI checks that actually catch problems

Detection: Find Issues Quickly

  • Monitoring: Alerting that detects anomalies quickly
  • Staged rollouts: Canary deployments that limit blast radius
  • Feature flags: Ability to disable problematic features without rollback

Recovery: Fix Issues Fast

  • One-click rollbacks: Automated ability to revert to last good state
  • Hotfix fast-path: Expedited review for critical fixes
  • Runbooks: Documented response procedures

What NOT to Do

Common Mistakes vs. Better Approaches

What Hurts Velocity Without Helping CFR
  • Adding more approval gates - delays don't improve quality, reviewers get fatigued
  • Requiring more manual QA - slows everything, still misses issues
  • Batching changes into big releases - bigger batches = bigger failures, harder to debug
  • Blaming developers for failures - creates fear, hides problems, stops experimentation
What Actually Works
  • Invest in automation - consistent, repeatable processes catch issues early
  • Improve test coverage - automated tests are faster and more reliable
  • Ship smaller changes more often - easier to review, debug, and roll back
  • Build fast recovery capabilities - reduces impact when failures do occur

DORA Benchmarks for Stability Metrics

LevelChange Failure RateTime to Restore
Elite0-15%< 1 hour
High16-30%< 1 day
Medium16-30%1 day - 1 week
Low16-30%> 1 week

Note: Medium and Low performers have similar CFR but differ dramatically in MTTR. This suggests that elite performers prevent more failures, while high performers recover faster when failures do occur.

💡 Start with MTTR

If you're new to stability metrics, start by improving MTTR rather than CFR. Fast recovery is achievable through process improvements (rollback automation, on-call procedures), while reducing CFR requires deeper technical investments. Plus, low MTTR reduces the business impact of any CFR level.

For implementation guidance on all four DORA metrics, see our DORA Four Keys implementation guide.

See these insights for your team

CodePulse connects to your GitHub and shows you actionable engineering metrics in minutes. No complex setup required.

Free tier available. No credit card required.