In August 2023, McKinsey told the C-suite they could finally measure developer productivity. The engineering world pushed back hard. Three years later, the data sides with the critics.
Don't Trust the Metrics
of developers say current metrics miss their real contributions
Opinion piece backed by industry research and our 803K PR dataset
In August 2023, McKinsey published "Yes, you can measure software developer productivity." The report proposed measuring individual developer output using a mix of inner/outer loop metrics and capability scores. It landed in every VP of Engineering's inbox within 48 hours, forwarded by CFOs and board members who finally had ammunition for "why can't we just measure this?"
The engineering community's response was swift and devastating.
Within weeks, some of the most respected names in software engineering published detailed rebuttals. This wasn't Twitter noise from disgruntled ICs. These were people who have shaped how the industry thinks about software delivery for decades.
"Absurdly naive... the damage is potentially going to take years to undo."
Beck and Orosz picked apart the report's core flaw: McKinsey only measures effort and output, not outcomes and impact. They cited a Facebook case study where survey scores were turned into performance metrics, creating perverse incentives. Managers pressured teams: "Give me a 5 and I'll make sure you get an 'exceeds expectations.'"
"Attempting to measure the individual contribution of a person is like trying to measure the individual contribution of a piston in an engine. The question itself makes no sense."
North pointed out that McKinsey had recommended pulling senior developers out of design sessions to maximize their coding time. The implication: the most important work a senior engineer does, enabling other developers, architectural thinking, mentoring, would be penalized because it doesn't show up in commit logs. As North put it: "Code, code monkey!"
"We have no way of reasonably measuring productivity because we cannot measure output. Any true measure of software development productivity must be based on delivered business value."
Fowler's critique cuts deepest: lines of code, PRs, and velocity points are all proxies for output, not measures of value. Well-factored code is shorter. The best architectural decision might be to delete a system entirely. And the most valuable thing a developer does in a given week might be a whiteboard conversation that prevents six months of wasted effort.
"When a measure becomes a target, it ceases to be a good measure." This isn't abstract philosophy. It happens every time an organization ties individual metrics to performance reviews. Here's what the gaming looks like in practice:
| Metric | How It Gets Gamed | Real-World Impact |
|---|---|---|
| Lines of Code | Verbose code, copy-paste, avoid refactoring | Bill Atkinson removed 2,000 lines from QuickDraw (a real improvement) and reported it as negative productivity |
| PR Count | Split work into trivially small PRs | Individual metric improves while team throughput drops from reviewer burden |
| Cycle Time | Prototype on laptops before "starting" the clock | Artificially low numbers hide actual development duration |
| Review Velocity | Rubber-stamp approvals to hit speed targets | Our data shows 90% of 1000+ line PRs ship without meaningful review |
| Commit Count | Merge commits to others' PRs to inflate numbers | Signal-to-noise ratio collapses, making real trends invisible |
| "Productivity Score" | Redundant emails, unnecessary chat messages | Microsoft's own metric created busywork instead of measuring actual work |
The Bill Atkinson story says it all. While working on Apple's QuickDraw, Atkinson spent weeks optimizing the codebase, removing 2,000 lines while making the software faster and more reliable. When asked to fill out a form tracking lines of code written, he wrote "-2000." Management was not amused. The work that made the product better registered as negative productivity.
That was 1982. We're still making the same mistake.
We analyzed 803,000 merged pull requests from public GitHub repositories. The takeaway: individual output metrics are unreliable signals of anything useful.
A developer merging 15 PRs per week might be shipping trivial changes. A developer merging 2 might be shipping the architectural work that unblocks the entire team. PR count alone tells you nothing about impact.
Teams with the fastest cycle times sometimes have the worst review quality. Speed without scrutiny creates tech debt. Our data shows 71% of PRs are self-merged. Fast cycles may just mean no one is looking.
Cycle time varies 50x across top projects. The difference is team process, not individual talent. The same developer ships faster in a team with clear review norms than in one without them.
"In my almost ten years of software engineering, I've never once seen an organization deploy developer productivity metrics and do anything useful with them except bludgeon developers that management already don't like."
Bryan Finster put it simply in his response to McKinsey: "Measure the system, not the people." Individual output metrics fracture teams into what he called "a pandemonium of developers," all optimizing for personal metrics at the expense of collective outcomes.
DORA metrics (deployment frequency, lead time, change failure rate, time to restore) work because they measure the system's capability, not any one person's contribution. A team with high deployment frequency and low change failure rates is healthy. It doesn't matter whether any single developer's commit count looks impressive.
Cycle time, throughput, and review wait time at the team level show where process breaks down. If PRs sit for days before review, that's a team coordination problem, not an individual performance problem.
Where do PRs stall? Is one reviewer overloaded? Are certain repos consistently slower? These questions lead to real improvements without pointing fingers.
Nicole Forsgren, who created the DORA framework, has noted that AI is breaking traditional metrics. Lines of code are meaningless. Commits are beside the point. Use DORA to spot trends and start conversations, not as report cards.
Review network maps and workload balance metrics show when someone is carrying too much or when knowledge silos are forming, without turning that data into a ranking.
Tom DeMarco coined "you can't control what you can't measure" in 1982. It became the mantra of a generation of managers. Then, 27 years later, he retracted it.
"Do I still believe that metrics are a must? My answers are no, no, and no. Metrics cost a ton of money. At worst, metrics can do actual harm."
The strongest argument against individual developer productivity metrics comes from Google's Project Aristotle research: the number one factor in high-performing teams is psychological safety. Stack-ranking developers on output metrics directly undermines that safety. MIT Sloan found that 92% of monitored employees feel distrust toward their employer, and 42% plan to leave within a year.
You cannot simultaneously demand trust and creativity from your engineers while surveilling their every commit.
The goal is not to measure less. It's to measure the right things. Here's what high-trust engineering organizations track:
Team-level cycle time trends, PR throughput, and review participation rates. Are things getting faster or slower over time? Where are the bottlenecks?
Review depth (comments per PR), self-merge rates, code churn. These tell you about process quality without assigning blame.
Review network health, knowledge distribution, bus factor. Is knowledge concentrated in one person? Are review loads balanced across the team?
Weekend and after-hours work patterns, workload distribution, burnout risk. Our data shows 25% of commits happen on weekends. That's a process problem, not a badge of honor.
McKinsey's report wasn't just wrong. It was dangerous.
It gave non-technical executives a framework for something they desperately wanted: a simple way to rank and compare developers. That desire is understandable. But the implementation, measuring individual output via commit metrics, PR counts, and code volume, optimizes for the wrong thing. It rewards busywork and penalizes the work that actually moves organizations forward: mentoring, architectural decisions, code deletion, and cross-team collaboration.
We built CodePulse around a different belief: engineering metrics should measure team delivery health and surface systemic bottlenecks. Individual data exists for self-reflection and mentorship, never for stack ranking. DORA metrics are signals to start conversations, not KPIs to hit. Velocity is not a goal. If you need a number to bludgeon developers with, we're not your tool. That's by design.
CodePulse measures systems, not people. Team-level insights in 5 minutes.