The DORA Metric Everyone Ignores (Until Production Breaks)

While deployment frequency and lead time measure how fast you ship, Change Failure Rate (CFR) and Mean Time to Restore (MTTR) measure how well you ship. These stability metrics are the "quality" half of DORA—and they're often the hardest to measure without comprehensive incident tracking. This guide shows you how to measure CFR and MTTR using GitHub data as your primary source.

Understanding these metrics helps you balance speed with reliability, turning the false dichotomy of "fast vs. stable" into a measurable, improvable reality.

The Quality Side of DORA

Why Stability Metrics Matter

High deployment frequency is only valuable if those deployments don't break things. The DORA research found that high-performing teams don't sacrifice stability for speed—they achieve both. In fact, the practices that enable fast delivery (small batches, automation, good testing) also reduce failures.

Change Failure Rate: The percentage of deployments that result in degraded service, incidents, or rollbacks
Time to Restore: How quickly you recover when failures do occur

Together, these metrics capture your deployment risk profile. Low CFR means you rarely break things. Low MTTR means when you do, you fix it fast.

The Speed-Stability Relationship

Speed vs. Stability: Myth vs. Reality

Common Misconception

"If we ship faster, we'll break more things"

DORA Research Finding

"Elite performers ship faster AND break less"

Why This Works

Key Factors

Smaller changes = easier to review, test, and debug
Frequent deploys = faster feedback loops
Automation = consistent, repeatable processes
Practice = teams get better at deploying safely

The false tradeoff comes from keeping large batches.

Detect code hotspots and knowledge silos with CodePulse

Defining "Failure" for Your Organization

What Counts as a Failure?

Before you can measure CFR, you need to define what constitutes a "failure." This varies by organization:

Narrow definition: Only customer-facing outages (production down)
Moderate definition: Any incident requiring intervention (rollback, hotfix, manual remediation)
Broad definition: Any deployment that didn't work as expected (including reverts for non-critical issues)

Failure Definition Examples

E-commerce Platform (Narrow)

Orders can't be placed
Payments fail
Site is inaccessible
NOT counted: Slow page loads, minor UI bugs

SaaS Product (Moderate)

Any severity 1-2 incident
Any rollback or revert
Any emergency hotfix
NOT counted: Feature doesn't match spec exactly

Internal Tooling (Broad)

Any revert for any reason
Any follow-up fix within 24 hours
Any user-reported regression

Pick based on your risk tolerance and industry context.

Consistency Matters More Than Precision

The absolute CFR number matters less than consistency over time. If you track the same definition consistently, you can measure improvement. If you keep changing definitions, the trend becomes meaningless.

For industry-specific guidance, see our fintech engineering metrics guide or healthcare engineering metrics guide.

Measuring Change Failure Rate from GitHub

GitHub-Based Failure Detection

Without integrated incident tracking, you can detect failures through Git activity patterns:

Signal 1: Revert Commits

The strongest signal of deployment failure is a revert—an explicit acknowledgment that a change needed to be undone.

Detecting Reverts in GitHub

Pattern Matching

PR title contains: "revert", "Revert", "rollback", "Rollback"
Commit message starts with: "Revert "
PR reverts specific SHA (GitHub auto-detection)

Change Failure Rate Calculation

CFR = (Revert PRs) / (Total PRs merged to main)

Examples:

Monthly

PRs merged: 100, Revert PRs: 5

= 5%

Signal 2: Hotfix Branches

If your team uses hotfix branches for emergency fixes, these indicate something broke in production:

Track merges from hotfix/* branches
Count PRs with "hotfix" label
Identify PRs merged outside normal review process

Signal 3: Rapid Follow-up Fixes

When a PR is merged and then another PR touching the same files is merged within a short window (24-48 hours), it may indicate the original change needed fixing.

Rapid Fix Detection

For Each Merged PR

Get list of files modified
Look for subsequent PRs within 24-48 hours
Check if subsequent PRs modify same files
Filter for "fix" patterns in title/description

Signal Weighting (Example)

Revert: 100% failure signal
Hotfix branch: 90% failure signal
Rapid follow-up: 50% failure signal (might be iteration)

Combined CFR

(Full-weight failures + weighted partial signals) / Total deploys

📊 How CodePulse Tracks Change Failure Rate

CodePulse automatically detects failure signals from your GitHub activity:

Revert detection from PR patterns and commit messages
Hotfix identification from branch naming and labels
Rapid follow-up analysis for potential fixes
Dashboard - View CFR trends alongside other DORA metrics

Tracking Time to Restore Without Full Observability

What MTTR Measures

Mean Time to Restore (MTTR) captures how quickly you recover from failures. In a full observability setup, this is measured from incident detection to resolution. With GitHub data only, we measure recovery time from the code perspective.

GitHub-Based MTTR Approximation

MTTR Calculation Methods from GitHub

Method 1: Revert Time

Start: Original PR merge time
End: Revert PR merge time
MTTR = End - Start

Method 2: Hotfix Cycle Time

Start: Hotfix branch creation
End: Hotfix PR merge time
MTTR = End - Start

Method 3: Issue-to-Fix

Start: Bug/incident issue creation
End: Fixing PR merge time
MTTR = End - Start (requires linked issues)

Aggregate MTTR

Mean or Median across all incidents in period

Use median when you have outliers (very long or very short recovery times) to get a more representative value.

Limitations of GitHub-Only MTTR

GitHub-based MTTR has blind spots:

Detection time: You don't know when the failure was discovered, only when the fix was merged
Deployment time: The fix is "done" at merge, but may take time to deploy
Incidents without code fixes: Configuration changes, rollbacks via deployment tools, etc.

Despite limitations, GitHub-based MTTR tracks the engineering response time—how quickly your team produces a fix—which is valuable on its own.

Identify bottlenecks slowing your team with CodePulse

Connecting Incidents to Code Changes

Building the Connection

The most accurate CFR and MTTR measurements connect incidents to specific code changes. Even without automated tooling, you can improve tracking:

Issue linking: When creating a fix PR, link it to the incident issue or the original PR
Labels: Tag PRs as "incident-fix", "hotfix", or "revert"
Convention: Include original PR number in revert/fix PR title

PR Linking Conventions:

Revert PR:
  Title: "Revert: Original feature title (#123)"
  Body: "Reverting #123 due to [reason]"
  Labels: revert, incident

Hotfix PR:
  Title: "Hotfix: Brief description of fix"
  Body: "Fixes incident from #123"
  Labels: hotfix, priority/critical

Follow-up fix:
  Title: "Fix: Issue with original feature"
  Body: "Addresses problem introduced in #123"
  Labels: fix, incident-related

For detailed strategies, see our incident correlation guide.

Improving CFR Without Slowing Down

The Right Interventions

High CFR often tempts teams to add gates and slow down. But the DORA research shows elite teams achieve low CFR while maintaining high velocity. Here's how:

Prevention: Catch Issues Before Production

Better testing: Focus on integration tests that catch real issues, not just unit test coverage numbers
Code review focus: Review for correctness and edge cases, not just style (see code review culture guide)
Pre-merge validation: Required CI checks that actually catch problems

Detection: Find Issues Quickly

Monitoring: Alerting that detects anomalies quickly
Staged rollouts: Canary deployments that limit blast radius
Feature flags: Ability to disable problematic features without rollback

Recovery: Fix Issues Fast

One-click rollbacks: Automated ability to revert to last good state
Hotfix fast-path: Expedited review for critical fixes
Runbooks: Documented response procedures

What NOT to Do

Common Mistakes vs. Better Approaches

What Hurts Velocity Without Helping CFR

Adding more approval gates - delays don't improve quality, reviewers get fatigued
Requiring more manual QA - slows everything, still misses issues
Batching changes into big releases - bigger batches = bigger failures, harder to debug
Blaming developers for failures - creates fear, hides problems, stops experimentation

What Actually Works

Invest in automation - consistent, repeatable processes catch issues early
Improve test coverage - automated tests are faster and more reliable
Ship smaller changes more often - easier to review, debug, and roll back
Build fast recovery capabilities - reduces impact when failures do occur

DORA Benchmarks for Stability Metrics

Level	Change Failure Rate	Time to Restore
Elite	0-15%	< 1 hour
High	16-30%	< 1 day
Medium	16-30%	1 day - 1 week
Low	16-30%	> 1 week

Note: Medium and Low performers have similar CFR but differ dramatically in MTTR. This suggests that elite performers prevent more failures, while high performers recover faster when failures do occur.

💡 Start with MTTR

If you're new to stability metrics, start by improving MTTR rather than CFR. Fast recovery is achievable through process improvements (rollback automation, on-call procedures), while reducing CFR requires deeper technical investments. Plus, low MTTR reduces the business impact of any CFR level.

For implementation guidance on all four DORA metrics, see our DORA Four Keys implementation guide. To benchmark your current performance, use our DORA metrics calculator.