Your team committed to 40 story points. You delivered 28. Again. Sprint after sprint, the pattern repeats: optimistic planning, frantic mid-sprint scrambles, and demoralized retrospectives asking "why do we keep missing?" The answer is not "estimate better." The answer is that story points are the wrong tool. This guide shows how to use PR data to predict spillover before it happens and break the cycle of missed commitments.
"Story points measure effort guesses. PRs measure actual completions. One of these is useful for forecasting. Hint: it is not the one that requires a meeting."
Sprint spillover is not a team discipline problem. It is a visibility problem. When you cannot see that a sprint is going sideways until day 8 of 10, intervention comes too late. But PR data gives you leading indicators by day 3. This guide introduces the Spillover Early Warning System: a framework for predicting and preventing missed sprint commitments using metrics that do not require anyone to guess.
🔥 Our Take
Story points are a relic of a pre-data era. They persist because teams are comfortable with them, not because they work.
Every study on estimation accuracy shows the same thing: teams overestimate what they can complete. Planning Poker does not fix this because it democratizes the optimism bias instead of eliminating it. Throughput-based forecasting using your PR history is more accurate, requires less ceremony, and cannot be gamed. The only reason to keep story points is organizational inertia. If you want to actually predict when work will ship, use data that measures completions, not intentions.
The Spillover Problem: Why 70% of Sprints Miss
Research from multiple agile studies shows a consistent pattern: approximately 70% of sprints do not complete all committed work. This is not a few teams doing poorly. This is the default outcome of sprint-based planning.
The Statistics Are Damning
| Metric | Industry Average | Source |
|---|---|---|
| Sprints with spillover | 68-72% | State of Agile Reports 2021-2024 |
| Average commitment completion rate | 72-78% | Scrum.org benchmarks |
| Teams that regularly hit 90%+ completion | Less than 20% | VersionOne surveys |
| Stories re-estimated mid-sprint | 35-40% | Rally/Broadcom data |
| Velocity prediction accuracy (story points) | +/- 25-40% | Multiple studies |
Compare this to throughput-based forecasting (counting completed items rather than estimated points): teams using historical throughput for forecasting typically achieve +/- 10-15% accuracy. Same teams, same work, different measurement system.
"We do not have an estimation problem. We have a measurement problem. Story points measure intentions. PRs merged measure reality. One of these correlates with delivered value."
Why Traditional Sprint Planning Fails
| Failure Mode | What Happens | How PR Data Fixes It |
|---|---|---|
| Planning Fallacy | Teams imagine best-case scenarios | Historical throughput includes actual interruptions |
| Anchoring Bias | First estimate sets the range | PR data is objective, no discussion needed |
| Social Pressure | Nobody wants to be the pessimist | Data does not feel social pressure |
| Point Inflation | Teams inflate points to hit velocity targets | PR count cannot be inflated without splitting work |
| Invisible Work | Bugs, support, meetings not in estimates | Historical throughput bakes in all overhead |
| Scope Creep | Work expands after commitment | Cycle time trends reveal mid-sprint changes |
Root Causes of Sprint Spillover (Data-Driven)
Spillover has patterns. When you analyze PR data from teams with chronic spillover, you see the same root causes repeatedly. Fixing spillover means identifying which pattern is dominant for your team.
The Spillover Causes Framework
| Cause | PR Data Signal | Frequency | Fix |
|---|---|---|---|
| Oversized PRs | Average PR size >400 lines | 35% of cases | Enforce smaller PRs, break work down |
| Review Bottlenecks | Wait-for-review > Coding time | 25% of cases | Review SLAs, load balancing |
| High WIP | >2 active PRs per developer | 20% of cases | WIP limits, finish before starting |
| Cycle Time Variance | Standard deviation > mean | 10% of cases | Reduce blockers, improve process |
| Late-Sprint Starts | >40% PRs opened in final third | 10% of cases | Earlier starts, better task breakdown |
How Each Cause Creates Spillover
Oversized PRs (35% of spillover)
Large PRs are the single biggest predictor of spillover. When a "2-day" story turns into a 700-line PR, it sits in review for 3 days instead of 4 hours. The math does not work.
PR Size Impact on Cycle Time: PR Size (lines) Avg Cycle Time Review Wait Spillover Risk ---------------------------------------------------------------- < 100 6-8 hours 1-2 hours Low (5%) 100-300 1-2 days 4-8 hours Moderate (15%) 300-500 2-4 days 1-2 days High (35%) 500-1000 4-7 days 2-4 days Very High (60%) > 1000 7-14+ days 4-7+ days Almost Certain (85%) Every 100 lines above 200 adds ~0.5 days to cycle time. A 800-line PR takes 4x longer than two 400-line PRs combined.
Review Bottlenecks (25% of spillover)
Work is done, but it cannot merge. PRs pile up waiting for the same 1-2 reviewers. By the time reviews happen, the sprint is over.
High WIP (20% of spillover)
When every developer has 3+ PRs open, none of them are getting finished. Context switching kills throughput. See our High Activity, Low Progress Guide for the detailed diagnosis.
Cycle Time Variance (10% of spillover)
Even if your average cycle time is fine, high variance means unpredictable delivery. Some PRs ship in hours, others take weeks. You cannot plan with that variance.
Late-Sprint Starts (10% of spillover)
Work that starts on day 7 of a 10-day sprint cannot complete in time. If more than 40% of your PRs open in the final third of the sprint, you are setting up for spillover before any code is written.
📊How to See This in CodePulse
Identify your dominant spillover cause:
- Dashboard shows cycle time breakdown including wait-for-review time
- Forecasting displays throughput trends and delivery predictions
- Developer Analytics reveals WIP per developer and PR size patterns
- Repository Metrics shows PR size distribution and merge rates
Using PR Data to Predict Spillover Risk
The advantage of PR data is that it gives you early warning. By day 3 of a sprint, you can predict with high accuracy whether you will complete your commitments.
The Three Key Predictors
Spillover Risk Score
Calculate spillover risk using weighted PR metrics. Each factor scales from 0-100. Total risk above 60 indicates high spillover probability.
Examples:
Interpretation:
Factor Calculations
| Factor | Calculation | Why It Matters |
|---|---|---|
| PR Size Factor | ((Avg Lines - 100) / 5) capped at 100 | Large PRs have exponentially longer cycle times |
| Cycle Time Factor | (Avg Hours / Sprint Hours) x 100 | If one PR takes 50% of sprint time, you can only fit 2 |
| WIP Factor | ((Active PRs/Dev - 1) x 40) capped at 100 | WIP above 2 means context switching is killing flow |
The Spillover Early Warning System
Stop waiting until the retrospective to discover your sprint failed. Implement checkpoints that surface problems while there is still time to act.
Sprint Health Checkpoints
Day 1-2: Launch Check
- All sprint items have PRs or branches created
- No items larger than 400 expected lines
- Review assignments distributed evenly
- WIP per developer at or below 2
Day 3-4 (Mid-Sprint): Momentum Check
- At least 25% of PRs have received first review
- No PR waiting more than 24 hours for review
- Average cycle time on track (< 40% of sprint length)
- No developer with 0 merged PRs yet
Day 6-7: Completion Check
- At least 60% of sprint PRs merged
- No PR with more than 2 review cycles
- Items still in progress have clear path to merge
- Scope cuts identified if needed
Day 8-10: Close-Out Check
- All remaining PRs in final review
- No new PRs being opened
- Carryover items clearly identified
- Root cause of any spillover documented
Warning Signs by Day
Sprint Warning Signs (10-day sprint example): Day 2 Warnings (Severe Impact): - < 50% of items have PRs started - Any item estimated at > 5 days work - Review queue already building Day 4 Warnings (High Impact): - < 20% of PRs merged - Average PR age > 48 hours - WIP per developer > 3 Day 6 Warnings (Moderate Impact): - < 50% of PRs merged - Any PR waiting > 72 hours for review - New scope added to sprint Day 8 Warnings (Limited Recovery): - < 70% of PRs merged - PRs still being opened - Large PRs still in progress After Day 8, spillover is likely unavoidable. Intervention windows shrink rapidly.
"By day 4 of a 10-day sprint, your outcome is 80% determined. The signals are there. Most teams just do not look at them until the retrospective."
Automated Monitoring
Manual checks get skipped when teams are busy. Automate the warning system:
- Daily dashboard review: Add sprint health to daily standup agenda
- Threshold alerts: Notify when any metric crosses warning threshold
- Trend detection: Alert when trajectory suggests missing deadline
- Review queue monitoring: Alert when queue exceeds 8-hour SLA
🔔Setting Up Spillover Alerts in CodePulse
Create early warning alerts for spillover signals:
- Navigate to Alert Rules
- Create alert: Wait-for-review time exceeds 8 hours
- Create alert: Any developer with more than 3 active PRs
- Create alert: Average PR size exceeds 400 lines
- Create alert: Cycle time exceeds 72 hours
Reducing Spillover Without Gaming Velocity
The goal is not to hit arbitrary story point targets. The goal is predictable, sustainable delivery. Here is how to reduce spillover through process improvements rather than point manipulation.
Strategy 1: Right-Size Your PRs
The single highest-impact change you can make. Smaller PRs mean faster reviews, fewer merge conflicts, and more predictable cycle times.
| Before | After | Impact |
|---|---|---|
| One 800-line PR per feature | Three 250-line PRs per feature | 40% faster cycle time |
| Reviews take 2-4 hours | Reviews take 20-30 minutes | Reviewers stay engaged |
| Merge conflicts common | Merge conflicts rare | Less rework |
| Feedback late in process | Feedback early and often | Less wasted effort |
Strategy 2: Implement WIP Limits
Stop starting and start finishing. A strict WIP limit forces completion before new work begins. Counterintuitively, doing less at once means delivering more over time.
WIP Limit Implementation: Step 1: Measure current WIP - Count active PRs per developer right now - Typical finding: 3-5 per person Step 2: Set limit at current - 1 - If average is 4, set limit at 3 - This is the "easy" step Step 3: Reduce by 1 every 2 weeks - 3 -> 2 is the hard step - Stay at 2 - this is sustainable Step 4: Enforce - New work cannot start until WIP < limit - Blocked? Help unblock, don't work around it Expected results: - Week 1-2: Painful adjustment, velocity dips - Week 3-4: Flow improves, cycle time drops - Week 5+: Throughput increases 20-40%
Strategy 3: Fix Review Bottlenecks
- Set SLAs: First review within 4 hours, all reviews within 24 hours
- Distribute load: No one reviews more than 2x team average
- Make it easy: Smaller PRs get reviewed faster
- Protected time: Block 1-2 hours daily for review
Strategy 4: Use Throughput for Planning
Instead of estimating story points, count items. Your historical throughput tells you how many items you can complete per sprint. Use that number.
Throughput-Based Sprint Planning
Use historical throughput with a 15% buffer. This accounts for estimation optimism and unexpected work.
Examples:
Interpretation:
For more on transitioning from estimation to throughput, see our Stop Estimating, Start Forecasting guide.
Strategy 5: Commit vs Forecast Separation
Not everything in a sprint needs to be a commitment. Separate what you are committing to from what you are forecasting as stretch goals:
- Commit (70% of capacity): What you will definitely complete
- Stretch (20% of capacity): What you will attempt if things go well
- Reserve (10% of capacity): Buffer for unplanned work
For more on capacity planning, see our Capacity Planning with PR Data guide.
Frequently Asked Questions
Q: We use Scrum and story points are required. What do we do?
A: You can use story points for relative sizing discussions while using throughput for actual forecasting. Run both in parallel: do your Planning Poker, but track items completed (not points completed) for your predictions. After a few months, you will have data showing which method is more accurate. Most teams find throughput wins by a wide margin.
Q: How do we handle items that are genuinely different sizes?
A: Over a sufficient time window (4-6 sprints), size variation averages out. Your historical throughput already includes the mix of small and large items your team naturally works on. If you have occasional truly massive items, break them down or track them separately. Most "large" items can be split into multiple PRs anyway.
Q: Will smaller PRs mean more overhead?
A: Initially, yes. You will have more PRs to review. But smaller PRs review faster (a 100-line PR takes 15 minutes to review; a 500-line PR takes 2 hours). The net time is less, and the feedback loop is faster. After 2-3 sprints, teams report that smaller PRs feel lighter, not heavier.
Q: What if management still wants story point velocity reports?
A: Give them what they need while internally using what works. Report story points for compliance, but make decisions based on throughput. Over time, educate stakeholders on why throughput is a better predictor. Show them the accuracy comparison. Most leaders care about predictability more than any specific methodology.
Q: How quickly can we see improvement?
A: PR size reduction shows impact within 1-2 sprints. WIP limits typically hurt before they help (expect a dip in sprint 1, improvement by sprint 3). Review bottleneck fixes are immediate but require enforcement. Overall, expect measurable spillover reduction within 4-6 sprints if you address your dominant cause.
Q: Is some spillover acceptable?
A: Yes. Targeting 100% sprint completion every time means you are sandbagging. 85-90% completion is healthy: it means you are challenging yourselves while remaining realistic. Below 70% consistently indicates a systemic problem. Above 95% consistently suggests you are not committing to enough.
Action Plan: This Sprint
This Week
- Baseline your metrics: Check your current average PR size, cycle time, and WIP per developer in Dashboard
- Calculate your risk score: Use the formula above to see your current spillover risk
- Identify dominant cause: Which of the five causes matches your data most closely?
This Sprint
- Implement checkpoints: Add the Day 3-4 momentum check to your standup routine
- Set one improvement target: Focus on your dominant cause only
- Track daily: Monitor the key metric for your target cause
This Quarter
- Transition to throughput: Start tracking items completed alongside story points
- Compare accuracy: After 4-6 sprints, compare throughput predictions to point-based predictions
- Reduce ceremony: As throughput proves more accurate, reduce time spent on estimation meetings
For related guidance, see our guides on Stop Estimating, Start Forecasting, Capacity Planning with PR Data, and High Activity, Low Progress.
See these insights for your team
CodePulse connects to your GitHub and shows you actionable engineering metrics in minutes. No complex setup required.
Free tier available. No credit card required.
Related Guides
Story Points Are a Scam. Here's What Actually Works
Story points are often waste. Learn how to use historical throughput and cycle time to forecast delivery dates with higher accuracy and less meeting time.
Stop Guessing Capacity. Your PRs Already Know
Use PR metrics to build data-driven capacity models, plan sprints realistically, and communicate engineering capacity to stakeholders.
Lots of Commits, No Features: The Productivity Illusion
Your engineers are committing code, but features aren't launching. Use the HALT Framework to diagnose churn, abandon rate, review bottlenecks, and WIP explosion - with targeted fixes for each.
Agile vs DevOps: Why the Debate Is Missing the Point
Agile and DevOps are not competitors—they are complementary. Agile handles planning and prioritization; DevOps handles delivery and operations. This guide shows how they work together.
