Is code coverage a reliable indicator of software quality?

No. Microsoft Research found that code coverage has an insignificant correlation with post-release bugs. Coverage tells you which lines execute during tests, not whether those tests catch meaningful failures. An 80% coverage target can be met with tests that assert nothing useful. Use coverage as a floor (ensuring critical paths are tested), not as a quality target.

How do you measure software quality without slowing down development?

The premise is flawed. High-performing teams in the DORA research ship faster AND have fewer failures. Quality and speed are not tradeoffs when you invest in small PRs, automated testing, fast reviews, and quick rollback capability. The teams that feel slow are usually the ones spending time on incident response rather than prevention.

What change failure rate should engineering teams target?

Elite teams maintain a change failure rate below 5%. Most teams sit between 16-30%. If yours is above 15%, prioritize reducing PR size, improving review coverage, and implementing deployment safeguards like feature flags and canary releases.

How do software quality metrics differ from DORA metrics?

DORA metrics include two quality-adjacent metrics (change failure rate and failed deployment recovery time). Software quality metrics go deeper, covering code-level signals like churn, hotspots, and review coverage that predict problems before they become DORA-level failures. Think of DORA as the outcome layer and quality metrics as the early warning layer.

Software Quality Metrics: 8 That Predict Incidents

Q: What are the most important software quality metrics to track?

Focus on predictive metrics: code churn rate, change failure rate, review coverage, merge without approval rate, file hotspot frequency, test failure rate, PR size, and failed deployment recovery time. These 8 metrics correlate with future incidents rather than just describing past performance.

Most engineering teams track code coverage and call it a quality strategy. Then an incident hits production, and the module with 92% coverage is the one that caused it. The problem is not that you lack metrics. The problem is you are tracking metrics that describe the past instead of predicting the future. This guide covers the 8 software quality metrics that actually correlate with production incidents, and explains why the ones you are probably tracking do not.

Software quality metrics in software engineering have a credibility problem. Teams collect dozens of numbers, present them in dashboards, and still get blindsided by outages. According to a Synopsys analysis, fixing a bug found in production costs up to 100x more than catching it during design. And yet most quality dashboards are oriented around catching bugs after they happen, not predicting where they will occur.

That gap between measurement and prediction is where engineering teams lose millions. CloudQA estimates businesses lose $3.1 trillion annually to poor software quality. The teams spending 30-50% of their sprint cycles firefighting defects are the ones measuring the wrong things.

Why Most Quality Metrics Are Useless

The standard quality dashboard at most companies includes code coverage percentage, bug count, and maybe cyclomatic complexity. These metrics are not wrong. They are just not predictive. They tell you what your code looks like today, not where it will break tomorrow.

Code coverage is the biggest offender. Microsoft Research analyzed 100 large open-source Java projects and found that coverage has an insignificant correlation with post-release bugs at the project level, and no correlation at the file level. A separate IEEE empirical study confirmed this: there is a low to moderate correlation between coverage and test effectiveness, and coverage should not be used as a quality target.

🔥 Our Take

LOC is NOT a quality metric. Stop putting it on quality dashboards.

Lines of code measures volume, not quality. A 500-line function is not twice as good as a 250-line function. More often, it is twice as likely to contain a bug. We have seen teams use LOC targets that actively incentivize bloated, copy-pasted code. If your quality dashboard includes lines of code as anything other than a size indicator, you are measuring the wrong thing. The same applies to commit counts and PR volume. Activity is not quality.

The root cause is that most quality metrics are lagging indicators. They report on what already happened. Bug counts tell you that bugs exist. Coverage tells you which lines execute during tests, not whether those tests catch real failures. Cyclomatic complexity is useful as a general heuristic, but research from an empirical study on code complexity and bugs found the correlation varies significantly across projects.

What you need are leading indicators: metrics that change before an incident happens, giving you time to act. The 8 metrics in the next section all have this property.

8 Quality Metrics That Predict Incidents

These are the software implementation metrics that reliably predict production problems. Each one has research backing, and each one gives you a signal you can act on before something breaks.

#	Metric	What It Predicts	Action Threshold
1	Code Churn Rate	Instability, rework, and upcoming defects	>8% churn on any module
2	Change Failure Rate	Deployment reliability	>15% of changes cause incidents
3	Review Coverage Rate	Unreviewed code reaching production	<90% of PRs reviewed
4	Merge Without Approval Rate	Bypassed safety gates	Any non-zero rate outside hotfixes
5	File Hotspot Frequency	Which files will break next	Files changed >10x in 30 days
6	Test Failure Rate	Eroding test reliability	>5% persistent failure rate
7	PR Size (Lines Changed)	Review effectiveness and risk	>400 lines per PR
8	Failed Deployment Recovery Time	Incident response capability	>1 hour for critical services

Let me walk through each one and explain why it matters.

1. Code Churn Rate

Code churn measures the percentage of recently written code that gets rewritten or deleted within a short window (typically 2-3 weeks). CodeScene's research shows that code churn is a strong predictor of post-release defects. High churn means code is unstable, requirements are unclear, or the design is wrong. All three lead to production problems.

A healthy churn rate sits below 5%. When a module exceeds 8%, investigate immediately. For more on tracking this, see our Code Churn Guide.

2. Change Failure Rate

Change failure rate is the percentage of deployments that cause a production incident or require a rollback. The DORA framework places this as a core stability metric. The 2024 Accelerate State of DevOps report added Rework Rate alongside it, measuring the proportion of unplanned deployments made to fix user-visible issues.

Elite teams keep change failure rate below 5%. Most teams sit between 16-30%, according to the 2024 DORA findings. If yours is above 15%, your deployment pipeline has a quality problem that no amount of testing will solve alone.

3. Review Coverage Rate

This measures the percentage of pull requests that receive at least one review before merging. Research from Coding Horror cites data showing code inspections detect about 60% of defects, far more than testing alone (which catches roughly 30%). An AT&T organization of 200+ people reported a 90% decrease in defects after introducing reviews.

"80% code coverage is theater if you are testing the wrong things. Review coverage catches what tests cannot: design flaws, security gaps, and logic errors that only a human reader would notice."

4. Merge Without Approval Rate

Every PR that merges without explicit approval is a skipped safety gate. Research on merge request deviations found that 37% of merge requests deviate from standard review processes. While some of these are legitimate (dependency updates, rebases), unreviewed changes to business logic or infrastructure are where incidents originate.

Our internal data at CodePulse shows that teams with merge-without-approval rates above 20% have significantly higher incident rates. The fix is not banning all self-merges. It is knowing which ones bypass review and whether the risk is acceptable.

5. File Hotspot Frequency

File hotspots are modules that get changed frequently across many PRs. Research consistently shows that the degree to which a source file changes over time correlates with faults and vulnerabilities. According to Sonar's analysis, functions exceeding a cyclomatic complexity of 15 contain significantly more defects. When you combine high change frequency with high complexity, you have found your next incident.

6. Test Failure Rate

Not just "are tests passing," but what percentage of test runs fail across your CI pipeline. A test suite with a 10% flaky test rate trains developers to ignore failures. Once that habit takes hold, real failures get merged too. For a deeper treatment, see our Test Failure Rate Guide.

7. PR Size (Lines Changed)

Our GitHub Code Quality Metrics research shows that PRs over 400 lines receive far less thorough reviews. Large PRs get rubber-stamped because reviewers lack the cognitive bandwidth to evaluate 1,000+ lines of changes. The result: defects pass through the one process designed to catch them.

8. Failed Deployment Recovery Time

The DORA team redefined Mean Time to Recovery as "Failed Deployment Recovery Time" in 2024 to focus specifically on deployment failures rather than all incidents. This metric reflects your team's ability to detect, diagnose, and fix deployment problems. Elite teams recover in under an hour. If yours takes days, you have a systemic issue in your deployment and rollback process.

Detect code hotspots and knowledge silos with CodePulse

The Quality vs Speed Tradeoff Is a Myth

Engineering leaders frequently frame quality and speed as opposing forces. Ship fast or ship safely, pick one. This framing is wrong, and the data proves it.

The 2024 DORA report found that high-performing teams have both higher throughput AND lower change failure rates. They deploy more frequently and break things less often. This is not a contradiction. Teams that deploy small changes frequently get faster feedback loops, which means they catch problems before they compound.

"High velocity without quality is not speed. It is a liability that compounds with every deployment. The fastest teams we have seen are the ones that rarely roll back."

The real tradeoff is not quality vs speed. It is between short-term velocity and sustainable velocity. A team that skips reviews to ship faster this sprint will spend the next sprint firefighting. The CTO Fraction analysis estimates that for every $1 spent resolving a production bug, companies incur $30 in secondary costs through customer compensation, trust erosion, and engineering distraction.

The 2025 DORA report reinforced this further. Faros AI's analysis of the findings noted that AI adoption improves individual throughput (21% more tasks, 98% more PRs) but increases delivery instability at the organizational level. Speed without quality gates is speed toward incidents.

The Real Quality Equation:

  Sustainable Velocity = Deployment Frequency × (1 - Change Failure Rate)

  Example:
    Team A: 10 deploys/week × (1 - 0.05) = 9.5 effective deploys/week
    Team B: 20 deploys/week × (1 - 0.30) = 14.0 effective deploys/week

  Team B deploys 2x more but only delivers 47% more value.
  And they carry 6x the incident burden.

Building a Quality Dashboard That Drives Action

A quality dashboard should answer one question: "What is most likely to break next?" If your dashboard only answers "what broke last time," it is a retrospective tool, not a quality tool.

Here is how to structure a dashboard that uses the 8 predictive metrics above.

Layer 1: Early Warning Signals

These metrics should be front and center, with alert thresholds configured:

Code churn rate per module (alert at >8%)
File hotspot frequency with complexity overlay (alert at >10 changes/month + complexity >15)
Merge without approval rate (alert at any increase over baseline)

Layer 2: Process Health Indicators

These track whether your quality processes are functioning:

Review coverage rate (target: >95%)
PR size distribution (target: 80% of PRs under 400 lines)
Test failure rate trend (alert on sustained increase)

Layer 3: Outcome Metrics

These confirm whether your early warnings are working:

Change failure rate (target: <10%)
Failed deployment recovery time (target: <1 hour)

📊 How to Track This in CodePulse

CodePulse tracks several of these predictive metrics out of the box:

Risky Changes flags PRs with high churn, large size, or missing reviews before they merge
File Hotspots identifies modules with high change frequency and complexity
Review Network visualizes review coverage and identifies gaps in your review process
Benchmarks compares your quality metrics against industry standards

The key distinction: layer 1 metrics should trigger alerts. Layer 2 metrics belong in weekly reviews. Layer 3 metrics belong in monthly reporting. If you are reviewing outcome metrics daily, you are reacting instead of preventing.

For a broader look at code quality tooling, see our Code Quality Tools Comparison.

Detect code hotspots and knowledge silos with CodePulse

Quality Is a System Property, Not a Developer Responsibility

When an incident happens, the instinct is to ask "who wrote the bug?" This is the wrong question. It assumes quality is an individual skill issue when, in reality, quality is a system property.

Consider: a developer writes buggy code. But the code passed CI. No reviewer flagged it. It deployed without a feature flag. The monitoring did not catch the regression for 4 hours. Five systems failed before the customer saw the bug. Blaming the developer ignores the other four.

"Every production incident is a system failure, not a people failure. If one person can ship a bug to production, your system has a single point of failure, and it is not the developer."

This is why measuring software quality at the individual level is counterproductive. It creates fear, which leads to slower shipping, which leads to larger batches, which leads to more risk. The research backs this up: Full Scale found that teams where 80% of high-complexity work is handled by 2-3 people show 67% higher burnout rates and 45% more production incidents within 6 months. Knowledge silos are a quality problem.

What System-Level Quality Looks Like

Instead of tracking who wrote the bug, track these system properties:

Review distribution: Are reviews concentrated in 2-3 people, or spread across the team? Concentrated reviews mean concentrated knowledge, which means concentrated risk.
Knowledge coverage: How many people can modify each critical module? If the answer is 1, that module is one resignation away from becoming a liability.
Feedback loop speed: How quickly does a developer learn their change caused a problem? Same day is good. Next sprint is a failure.
Recovery capability: Can you roll back a bad deploy in minutes, or does it require a war room?

For more on identifying knowledge risks, read our Tech Lead Metrics Guide, which covers how Staff+ engineers can use these signals to protect their codebase.

The Practical Shift

Moving from individual blame to system measurement requires three changes:

Blameless post-mortems. Document what happened and what system changes prevent recurrence. Never document who caused it.
Team-level metrics. Report quality metrics at the team or repository level, not the individual level. If you need individual data, use it for coaching conversations, not performance reviews.
Investment in prevention. For every hour spent on incident response, spend two hours improving the system that allowed the incident. This ratio is the difference between teams that improve and teams that just survive.

Frequently Asked Questions

Focus on predictive metrics: code churn rate, change failure rate, review coverage, merge without approval rate, file hotspot frequency, test failure rate, PR size, and failed deployment recovery time. These 8 metrics correlate with future incidents rather than just describing past performance.

For related reading, explore our Code Quality Tools Comparison to evaluate tooling options. Our GitHub Code Quality Metrics guide covers GitHub-specific quality signals. And the Tech Lead Metrics Guide explains how Staff+ engineers can operationalize these metrics in their teams.

Software Quality Metrics: 8 That Predict Incidents

See these metrics for your own team