Thought Leadership

The McKinsey Developer Productivity Report Was Wrong

In August 2023, McKinsey told the C-suite they could finally measure developer productivity. The engineering world pushed back hard. Three years later, the data sides with the critics.

66%

Don't Trust the Metrics

of developers say current metrics miss their real contributions

Opinion piece backed by industry research and our 803K PR dataset

In August 2023, McKinsey published "Yes, you can measure software developer productivity." The report proposed measuring individual developer output using a mix of inner/outer loop metrics and capability scores. It landed in every VP of Engineering's inbox within 48 hours, forwarded by CFOs and board members who finally had ammunition for "why can't we just measure this?"

The engineering community's response was swift and devastating.

The Industry Pushback

Within weeks, some of the most respected names in software engineering published detailed rebuttals. This wasn't Twitter noise from disgruntled ICs. These were people who have shaped how the industry thinks about software delivery for decades.

"Absurdly naive... the damage is potentially going to take years to undo."
Kent Beck and Gergely Orosz, Pragmatic Engineer newsletter

Beck and Orosz picked apart the report's core flaw: McKinsey only measures effort and output, not outcomes and impact. They cited a Facebook case study where survey scores were turned into performance metrics, creating perverse incentives. Managers pressured teams: "Give me a 5 and I'll make sure you get an 'exceeds expectations.'"

"Attempting to measure the individual contribution of a person is like trying to measure the individual contribution of a piston in an engine. The question itself makes no sense."
Dan North, creator of BDD

North pointed out that McKinsey had recommended pulling senior developers out of design sessions to maximize their coding time. The implication: the most important work a senior engineer does, enabling other developers, architectural thinking, mentoring, would be penalized because it doesn't show up in commit logs. As North put it: "Code, code monkey!"

"We have no way of reasonably measuring productivity because we cannot measure output. Any true measure of software development productivity must be based on delivered business value."
Martin Fowler, ThoughtWorks

Fowler's critique cuts deepest: lines of code, PRs, and velocity points are all proxies for output, not measures of value. Well-factored code is shorter. The best architectural decision might be to delete a system entirely. And the most valuable thing a developer does in a given week might be a whiteboard conversation that prevents six months of wasted effort.

See your engineering metrics in 5 minutes with CodePulse

Goodhart's Law: When Metrics Become Targets

"When a measure becomes a target, it ceases to be a good measure." This isn't abstract philosophy. It happens every time an organization ties individual metrics to performance reviews. Here's what the gaming looks like in practice:

Metric	How It Gets Gamed	Real-World Impact
Lines of Code	Verbose code, copy-paste, avoid refactoring	Bill Atkinson removed 2,000 lines from QuickDraw (a real improvement) and reported it as negative productivity
PR Count	Split work into trivially small PRs	Individual metric improves while team throughput drops from reviewer burden
Cycle Time	Prototype on laptops before "starting" the clock	Artificially low numbers hide actual development duration
Review Velocity	Rubber-stamp approvals to hit speed targets	Our data shows 90% of 1000+ line PRs ship without meaningful review
Commit Count	Merge commits to others' PRs to inflate numbers	Signal-to-noise ratio collapses, making real trends invisible
"Productivity Score"	Redundant emails, unnecessary chat messages	Microsoft's own metric created busywork instead of measuring actual work

The Bill Atkinson story says it all. While working on Apple's QuickDraw, Atkinson spent weeks optimizing the codebase, removing 2,000 lines while making the software faster and more reliable. When asked to fill out a form tracking lines of code written, he wrote "-2000." Management was not amused. The work that made the product better registered as negative productivity.

That was 1982. We're still making the same mistake.

What Our Data Actually Shows

We analyzed 803,000 merged pull requests from public GitHub repositories. The takeaway: individual output metrics are unreliable signals of anything useful.

Volume is Noise

A developer merging 15 PRs per week might be shipping trivial changes. A developer merging 2 might be shipping the architectural work that unblocks the entire team. PR count alone tells you nothing about impact.

Speed Can Be Harmful

Teams with the fastest cycle times sometimes have the worst review quality. Speed without scrutiny creates tech debt. Our data shows 71% of PRs are self-merged. Fast cycles may just mean no one is looking.

Teams, Not People

Cycle time varies 50x across top projects. The difference is team process, not individual talent. The same developer ships faster in a team with clear review norms than in one without them.

"In my almost ten years of software engineering, I've never once seen an organization deploy developer productivity metrics and do anything useful with them except bludgeon developers that management already don't like."
Hacker News commenter, McKinsey response thread (800+ comments)

The Alternative: Measure Systems, Not People

Bryan Finster put it simply in his response to McKinsey: "Measure the system, not the people." Individual output metrics fracture teams into what he called "a pandemonium of developers," all optimizing for personal metrics at the expense of collective outcomes.

DORA metrics (deployment frequency, lead time, change failure rate, time to restore) work because they measure the system's capability, not any one person's contribution. A team with high deployment frequency and low change failure rates is healthy. It doesn't matter whether any single developer's commit count looks impressive.

Team-Level Flow Metrics

Cycle time, throughput, and review wait time at the team level show where process breaks down. If PRs sit for days before review, that's a team coordination problem, not an individual performance problem.

Bottleneck Identification

Where do PRs stall? Is one reviewer overloaded? Are certain repos consistently slower? These questions lead to real improvements without pointing fingers.

DORA as Signals, Not KPIs

Nicole Forsgren, who created the DORA framework, has noted that AI is breaking traditional metrics. Lines of code are meaningless. Commits are beside the point. Use DORA to spot trends and start conversations, not as report cards.

Workload Distribution

Review network maps and workload balance metrics show when someone is carrying too much or when knowledge silos are forming, without turning that data into a ranking.

Identify bottlenecks slowing your team with CodePulse

Even the "Father of Measurement" Changed His Mind

Tom DeMarco coined "you can't control what you can't measure" in 1982. It became the mantra of a generation of managers. Then, 27 years later, he retracted it.

"Do I still believe that metrics are a must? My answers are no, no, and no. Metrics cost a ton of money. At worst, metrics can do actual harm."
Tom DeMarco, retracting his own 1982 maxim

The strongest argument against individual developer productivity metrics comes from Google's Project Aristotle research: the number one factor in high-performing teams is psychological safety. Stack-ranking developers on output metrics directly undermines that safety. MIT Sloan found that 92% of monitored employees feel distrust toward their employer, and 42% plan to leave within a year.

You cannot simultaneously demand trust and creativity from your engineers while surveilling their every commit.

What to Measure Instead

The goal is not to measure less. It's to measure the right things. Here's what high-trust engineering organizations track:

Delivery Health

Team-level cycle time trends, PR throughput, and review participation rates. Are things getting faster or slower over time? Where are the bottlenecks?

Quality Signals

Review depth (comments per PR), self-merge rates, code churn. These tell you about process quality without assigning blame.

Collaboration Patterns

Review network health, knowledge distribution, bus factor. Is knowledge concentrated in one person? Are review loads balanced across the team?

Sustainability Indicators

Weekend and after-hours work patterns, workload distribution, burnout risk. Our data shows 25% of commits happen on weekends. That's a process problem, not a badge of honor.

Our Take

McKinsey's report wasn't just wrong. It was dangerous.

It gave non-technical executives a framework for something they desperately wanted: a simple way to rank and compare developers. That desire is understandable. But the implementation, measuring individual output via commit metrics, PR counts, and code volume, optimizes for the wrong thing. It rewards busywork and penalizes the work that actually moves organizations forward: mentoring, architectural decisions, code deletion, and cross-team collaboration.

We built CodePulse around a different belief: engineering metrics should measure team delivery health and surface systemic bottlenecks. Individual data exists for self-reflection and mentorship, never for stack ranking. DORA metrics are signals to start conversations, not KPIs to hit. Velocity is not a goal. If you need a number to bludgeon developers with, we're not your tool. That's by design.

Related Research

2025 Engineering Benchmarks

803K PRs analyzed. What the data reveals about how software actually ships.

What Separates Elite Teams

Cycle time varies 50x across top projects. It's team process, not individual talent.

See your team's delivery health without surveillance

CodePulse measures systems, not people. Team-level insights in 5 minutes.

Try CodePulse Free