Story Points Are a Scam. Here's What Actually Works

We spend hours in "Planning Poker" debating whether a task is a 3 or a 5. Then we miss the deadline anyway. Story points were supposed to make estimation easier. Instead, they became a ritual that feels productive but predicts nothing. There is a better way: Probabilistic Forecasting based on your actual historical data, not your team's collective optimism.

"Your last 10 sprints are the best predictor of your next sprint. Not your estimates. Not your planning poker hands. Your actual throughput data."

This guide will show you why estimation fails, how story points get gamed, and how to replace guesswork with probability-based forecasting that actually predicts when you'll ship. You don't need a PhD in statistics. You need your Git history.

🔥 Our Take

The moment velocity becomes a goal, it stops being useful.

You'll get more PRs, but they'll be smaller, simpler, and less valuable. You'll hit your numbers while missing your targets. Velocity is an output, not an input. You improve velocity by removing friction, not by demanding more. The #NoEstimates movement isn't about being lazy—it's about recognizing that historical data beats collective guessing.

Why Estimation Fails: The Psychology Problem

Software estimation isn't just hard—it's psychologically rigged against you. Before we abandon estimation, it's worth understanding why it fails so reliably.

The Planning Fallacy

In 1979, psychologists Daniel Kahneman and Amos Tversky identified the "planning fallacy": people systematically underestimate how long tasks will take, even when they've done similar tasks before and overrun their estimates.

The planning fallacy persists because:

We focus on best-case scenarios: We imagine the task going smoothly, not the interruptions, blockers, and edge cases.
We ignore base rates: "This time will be different" when it never is.
We underweight unknown unknowns: We estimate what we can see, not what we'll discover.
We're motivated to be optimistic: Nobody wants to say "6 months" when the PM wants to hear "6 weeks."

The Planning Fallacy in Action:

Developer's mental model when estimating:
  "Write the API endpoint: 2 hours
   Add validation: 1 hour
   Write tests: 2 hours
   Done: 5 hours = 1 story point"

What actually happens:
  - Write the API endpoint: 2 hours
  - Realize the database schema needs changes: 1 hour
  - Update the schema, fix broken tests: 3 hours
  - Write the endpoint again: 1.5 hours
  - Add validation: 1 hour
  - Edge case discovered in code review: 2 hours
  - Write tests: 2 hours
  - Fix flaky test in CI: 1 hour
  - Total: 13.5 hours = ???? story points

The estimate was off by 170%. This is normal.

The Social Pressure Problem

Estimation doesn't happen in a vacuum. It happens in rooms with social dynamics:

Anchoring: The first person to speak sets the range. Everyone else adjusts relative to that anchor.
Authority deference: Junior engineers defer to senior opinions, even when they have valid concerns.
Deadline pressure: "We promised this for Q2" distorts estimates downward. Nobody wants to be the bearer of bad news.
Groupthink: Dissenting opinions get smoothed away. Consensus doesn't mean accuracy.

"Planning Poker doesn't eliminate bias. It democratizes it. Instead of one person being wrong, you get consensus on being wrong."

The Story Points Problem: Why They Get Gamed

Story points were invented to decouple effort from time. The idea was noble: let teams discuss relative complexity without committing to hours. But story points have a fatal flaw: they become a target.

When Story Points Become a Metric

The moment someone asks "How many story points did we complete?" the system starts to break down:

What Happens	Why It's Bad	The Incentive
Points inflate over time	"3" today was "2" last quarter	Higher velocity looks good in reports
Work gets split artificially	One feature becomes 10 small tickets	More "completed" points per sprint
Large work gets avoided	Important refactors never get done	Big tickets hurt velocity numbers
Complexity gets hidden	Real risk gets underestimated	High estimates get questioned

The Inconsistency Problem

Story points are supposed to be relative ("this is twice as complex as that"). In practice:

A "3" today isn't a "3" next month: As teams learn, complexity perception changes.
A "3" for Team A isn't a "3" for Team B: Points are not comparable across teams.
A "3" in Frontend isn't a "3" in Backend: Even within teams, context matters enormously.

Story Point Inconsistency Example:

Team estimates "Add user settings page" as 5 points.

What they didn't account for:
  - The settings API doesn't exist yet (discovered mid-sprint)
  - The design file has 3 states they didn't see
  - The form validation needs 8 edge cases
  - There's a permission check that requires backend work

Actual effort: 13 points worth of work.

What happens next sprint?
  - Team inflates estimates to avoid this happening again
  - Manager thinks "velocity dropped from 40 to 30"
  - Team gets pressure to "improve velocity"
  - Estimates inflate further

The system is eating itself.

Identify bottlenecks slowing your team with CodePulse

Throughput: The Metric That Doesn't Lie

If estimation is broken, what should we use instead? Throughput: the count of items (PRs, tickets, features) completed per unit of time.

Why Throughput Works

Throughput doesn't require anyone to guess anything. It's purely empirical:

Objective: A merged PR is a merged PR. No debate needed.
Historical: Based on what actually happened, not what we hoped.
Inclusive: Automatically includes meetings, blockers, sick days, production incidents—all the "overhead" that estimates ignore.
Predictive: Your last 10 weeks are your best predictor of your next 10 weeks.

"But some tickets are bigger than others!"

Yes. But over a sufficient time window (4-6 weeks), the "size" averages out. This is the Law of Large Numbers at work. Your historical throughput bakes in the variation because it's measuring actual completions, not estimated effort.

Throughput vs Velocity

Velocity (Story Points)	Throughput (Item Count)
Based on estimates	Based on completions
Can be gamed	Hard to game (work is done or it isn't)
Inconsistent over time	Stable over time
Not comparable across teams	Comparable with normalization
Requires estimation ceremony	Automatic from Git/Jira data

The Forecast Confidence Model

We've developed a simple framework for communicating forecasts to stakeholders. Instead of a single date (which is always wrong), provide three dates with confidence levels.

The Three-Date Forecast

The Forecast Confidence Model:

OPTIMISTIC (50% confidence):
  "If everything goes well, we could finish by [DATE]"
  This is your best-case scenario. Half the time you'll beat it,
  half the time you won't.

LIKELY (85% confidence):
  "We're highly likely to finish by [DATE]"
  This is your planning date. Safe for external commitments.
  You'll miss it ~15% of the time.

SAFE (95% confidence):
  "We're almost certain to finish by [DATE]"
  This is your buffer. Use for contractual commitments.
  Only miss this if something extraordinary happens.

Example:
  "Based on our historical throughput:
   - 50% confident: November 1
   - 85% confident: November 15
   - 95% confident: November 30

   Which level of risk is appropriate for this commitment?"

"The question isn't 'When will it be done?' The question is 'How much schedule risk are you willing to accept?' Forecasting answers that question. Estimation doesn't."

How to Calculate Each Date

Here's a simple method using your throughput data:

Count your backlog: How many items (PRs, tickets) need to ship?
Calculate your throughput range: Look at your weekly completions for the last 10-12 weeks. Note the minimum, average, and maximum.
Project the dates:
- 50% date = Backlog / Best weekly throughput
- 85% date = Backlog / Average weekly throughput
- 95% date = Backlog / Worst weekly throughput

Example Calculation:

Backlog: 40 items to complete
Last 12 weeks throughput (items/week):
  12, 15, 8, 14, 10, 16, 11, 9, 13, 15, 12, 14

Statistics:
  Best week:     16 items
  Average:       12.4 items
  Worst week:    8 items

Forecasts:
  50% (best-case):    40 / 16 = 2.5 weeks
  85% (planning):     40 / 12.4 = 3.2 weeks
  95% (safe):         40 / 8 = 5 weeks

"We're 85% confident we'll finish in 3-4 weeks.
 If we hit our best pace, could be done in 2.5 weeks.
 Worst case, 5 weeks."

Monte Carlo Simulation: The Pro Move

Averages can be dangerous. They hide risk. (If you cross a river that is "on average" 3 feet deep, you might still drown in a 10-foot section.)

Monte Carlo Simulation gives you better forecasts by running thousands of "what if" scenarios using random samples from your historical data.

How Monte Carlo Works

Instead of using the average, you run a simulation:

Sample randomly from history: Pick a random week's throughput from your historical data.
Simulate progress: Subtract that throughput from the remaining work.
Repeat until done: Keep sampling random weeks until the backlog hits zero.
Record how long it took: This is one simulation result.
Run 10,000 times: Build up a distribution of possible outcomes.

Monte Carlo Example:

Historical weekly throughput: [8, 12, 15, 10, 14, 9, 16, 11, 13, 12]
Backlog: 40 items

Simulation 1:
  Week 1: Random sample = 12. Remaining: 28
  Week 2: Random sample = 8.  Remaining: 20
  Week 3: Random sample = 15. Remaining: 5
  Week 4: Random sample = 10. Remaining: 0
  Result: 4 weeks (actually 3.5 adjusted)

Simulation 2:
  Week 1: Random sample = 8.  Remaining: 32
  Week 2: Random sample = 9.  Remaining: 23
  Week 3: Random sample = 10. Remaining: 13
  Week 4: Random sample = 8.  Remaining: 5
  Week 5: Random sample = 11. Remaining: 0
  Result: 5 weeks (actually 4.5 adjusted)

... (repeat 10,000 times)

Results distribution:
  10th percentile: 3 weeks   (optimistic)
  50th percentile: 4 weeks   (median)
  85th percentile: 5 weeks   (likely)
  95th percentile: 6 weeks   (safe)

Why Monte Carlo Is Better

Captures variability: Shows the range of outcomes, not just the average.
No assumptions about distribution: Uses your actual historical pattern, whatever shape it has.
Natural confidence intervals: The percentiles directly give you probability-based dates.
Easy to update: Add more historical data, run again.

You can run Monte Carlo simulations in a spreadsheet, or use specialized tools. The key insight is that randomness in your simulation reflects real-world uncertainty.

📊Historical Data in CodePulse

CodePulse automatically tracks the throughput data you need for forecasting:

Dashboard shows PRs merged per week with 8-week trends
Executive Summary provides throughput trends and variance
Repositories lets you see throughput per project
Export historical data for Monte Carlo simulations

How to Transition: From Estimation to Forecasting

You can't flip a switch and stop estimating. Teams have habits, stakeholders have expectations, and "how long will this take?" isn't going away. Here's how to transition.

Phase 1: Run Both in Parallel (4-6 weeks)

Keep doing estimation meetings, but also track actual throughput:

Continue Planning Poker (don't disrupt yet)
Record weekly throughput: PRs merged, tickets closed
At the end of each sprint, compare estimate vs actual
Build up data showing forecast accuracy vs estimation accuracy

Parallel Tracking Example:

Sprint 1:
  Estimated: 35 story points
  Completed: 28 story points (80% accuracy)
  Throughput: 14 items

Sprint 2:
  Estimated: 30 story points
  Completed: 32 story points (107% accuracy)
  Throughput: 12 items

Sprint 3:
  Estimated: 35 story points
  Completed: 25 story points (71% accuracy)
  Throughput: 13 items

Estimation variance: 71% - 107% (36% swing)
Throughput variance: 12 - 14 items (15% swing)

Throughput is more predictable than estimates.

Phase 2: Introduce Probabilistic Forecasts (2-4 weeks)

When asked "When will X be done?", respond with three dates
"Based on our throughput, we're 50%/85%/95% confident in these dates"
Keep tracking accuracy: did reality fall within your ranges?
Share wins: "We said 85% by Nov 15, and we shipped Nov 12"

Phase 3: Reduce Estimation Ceremony (4-8 weeks)

Stop estimating small items: Anything clearly less than a week doesn't need points.
Use rough sizes: "Small / Medium / Large" instead of Fibonacci points.
Focus on breaking down work: Split large items into smaller ones instead of estimating them.

Phase 4: Full Throughput-Based Planning

Forecasts replace estimates for delivery commitments
Planning focuses on prioritization and sequencing, not estimation
Time spent in estimation meetings drops 60-80%
Forecast accuracy improves because it's based on data

Identify bottlenecks slowing your team with CodePulse

Handling Stakeholder Pushback

"But when will it be done?" is the question that kills forecasting initiatives. Here's how to handle common pushback.

"I Need a Single Date for the Roadmap"

Response: "Use the 85% date. That's the date we're highly confident we'll hit. If you need even more certainty, use the 95% date."

Reframe the conversation:

Before (estimation-based):
  PM: "When will this be done?"
  Dev: "March 15."
  (March 15 arrives)
  PM: "Why is it late?"
  Dev: "It was more complex than we thought."

After (forecast-based):
  PM: "When will this be done?"
  Dev: "We're 50% confident by March 1, 85% by March 15,
        95% by March 30. Which confidence level do you need?"
  PM: "Let's use March 15 for planning."
  (March 12 arrives, feature ships)
  PM: "Nice, we beat the forecast!"

Same information, different framing, better outcome.

"Our Estimates Have Always Been Fine"

Response: "Let's measure. I'll track forecast accuracy vs estimation accuracy for the next quarter. We can compare."

Almost always, this comparison shows forecasts are more accurate and less effort. The data speaks for itself.

"The Business Needs to Know Exactly When"

Response: "The business needs to know risk. A single date hides risk. A range with probabilities shows exactly how much risk they're accepting."

For marketing launches: Use the 95% date, then try to beat it.
For internal planning: Use the 85% date with contingency.
For rough roadmaps: Use the 50% date with clear caveats.

"But Story Points Help the Team Discuss Complexity"

Response: "Keep the discussion, drop the numbers. Asking 'Is this bigger or smaller than that?' is useful. Pretending that discussion produces accurate predictions isn't."

You can have complexity conversations without assigning fake-precise numbers. "This feels risky because X" is more useful than "I think it's an 8."

Forecasting with CodePulse

CodePulse provides the historical data foundation you need for forecasting:

🔮Building Forecasts from CodePulse Data

Use these data points for your forecasts:

Throughput trends: Dashboard shows PRs merged over time
Cycle time volatility: High variance in cycle time = higher forecast risk
Team capacity: Developers shows active contributors (not just headcount)
Repository breakdown: See throughput per project for project-level forecasts
Export data: Pull historical throughput for Monte Carlo simulations

Key Metrics for Forecasting

Metric	What It Tells You	Where to Find It
PRs Merged / Week	Raw throughput for forecasts	Dashboard, 8-week trend
Cycle Time	Delivery speed and variability	Dashboard, cycle time breakdown
Active Contributors	Effective team size (not headcount)	Developers page
Deployment Frequency	PRs per working day	Dashboard

Action Plan: Start This Week

This Week

Export your throughput: Get the last 10-12 weeks of PRs merged from Dashboard.
Calculate your range: Find your min, max, and average weekly throughput.
Try one forecast: For your current work, calculate 50%/85%/95% dates using the simple method above.

This Month

Run parallel tracking: Keep doing estimates, but also record throughput and forecast accuracy.
Present forecasts to stakeholders: Start using three-date forecasts in status updates.
Measure accuracy: Did reality fall within your forecast range?

This Quarter

Reduce estimation ceremony: Cut Planning Poker time by 50%.
Compare accuracy: Show stakeholders forecast accuracy vs estimation accuracy over 3 months.
Build the case for change: If forecasts are more accurate with less effort, make the switch.

"Every hour spent in estimation meetings is an hour not spent shipping. Your historical data is already a better predictor than your collective guessing."

For more on data-driven engineering management, see our guides on Capacity Planning with PR Data, Building the Business Case for Engineering Metrics, and Progress Tracking Without Jira.