Skip to main content
All Stories
For Code Quality7 min read

The Quality Cliff

A 35-engineer team's incident rate dropped 71% after discovering which 0.8% of their codebase caused 67% of failures

Production Incidents
14.3/month4.1/month
71% reduction
Change Failure Rate
18%4.2%
-77%
Hotspot Coverage
1.2 reviewers3.4 reviewers
+183%
Review Depth (hotspots)
0.8 comments3.1 comments
+287%

Executive Summary

Datastream Analytics maintained rigorous PR practices—100% review coverage, mandatory CI, and thorough QA. Yet production incidents increased 180% year-over-year. Analysis of 1,247 PRs and 43 production incidents revealed that process compliance masked a deeper problem: 12 files (0.8% of the codebase) were responsible for 67% of all incidents, and these files received the least thorough reviews.

The Paradox

By every traditional measure, Datastream's engineering practices looked strong:

  • 100% PR review coverage
  • Mandatory CI passing before merge
  • Dedicated QA environment
  • Weekly security scans

So why were production incidents at an all-time high?

Key Finding: Hotspot Concentration

Correlation analysis between file change frequency and incident occurrence revealed a striking pattern:

12 files = 67% of production incidents

These "hotspots" represented only 0.8% of the codebase but generated two-thirds of all production issues. They were changed frequently, poorly understood, and—critically—reviewed superficially.

Change Failure Rate by Module

ModuleChange Failure Rate% of PRsRisk Level
Payment Processing23%8%Critical
API Gateway18%12%High
Data Pipeline15%6%High
Core Domain4%31%Low
UI Components2%43%Low

The payment processing module had a 23% failure rate—nearly 1 in 4 changes caused issues in production. Yet these PRs received the same level of review as low-risk UI changes.

Review Depth Analysis

MetricHotspot FilesOther FilesGap
Avg Comments per PR0.82.4-67%
Avg Review Time8 min22 min-64%
Reviewers per PR1.21.8-33%

The most critical code received the most superficial reviews. Why? The same 3 developers owned these modules and reviewed each other's work—creating blind spots where institutional patterns went unquestioned.

Results After Intervention

MetricBeforeAfter (3 months)Change
Production Incidents/Month14.34.1-71%
Change Failure Rate (overall)18%4.2%-77%
Hotspot Coverage (reviewers)1.23.4+183%
Review Depth (comments/PR)0.83.1+287%

Targeted interventions—hotspot alerts, mandatory cross-team review, and focused refactoring—reduced incidents by 71% in three months.

The Lesson

"Quality isn't about reviewing everything equally—it's about reviewing the risky code deeply."

About Datastream Analytics

A real-time data processing platform serving financial services companies. Founded in 2018, they process over 2 billion events daily for clients across banking, insurance, and trading.

Names and some details have been changed to protect confidentiality. Incident counts and improvement metrics are representative of actual results.

What's hiding in your GitHub data?

Every engineering organization has invisible bottlenecks, hidden risks, and unrecognized performers. Find yours in minutes.

Prefer a walkthrough? Talk to sales