Dashboards are great for analysis, but they require you to look at them. Alerts bring the insights to you—notifying you when PRs are stuck, SLAs are breached, or metrics drift outside acceptable ranges. Done well, alerts let you stay informed without constantly checking dashboards.
This guide covers how to set up effective alerts for engineering metrics, including what to alert on, how to configure thresholds, how to avoid the dreaded alert fatigue, and how to build an alert system that actually drives action.
Our Take
Most engineering teams have more alerts than they need and fewer alerts that matter. If your team ignores alerts, you don't have an "alert fatigue" problem—you have a "bad alerts" problem. Every alert should pass this test: "If this fires at 3 AM, would I be glad we set it up?" If the answer is no, delete it.
Why Real-Time Alerts Beat Reports
The Problem with Dashboard-Only Visibility
Dashboards are valuable, but they have limitations:
- Requires active checking: If you don't look, you don't know
- Delayed awareness: Issues fester until someone happens to check
- Context switching: You have to stop what you're doing to look
- Information overload: Dashboards show everything, making it hard to spot what matters
"The best alert is one that tells you something you didn't know, at a time when you can still do something about it."
The Value of Proactive Alerts
Alerts solve these problems by:
- Pushing information to you: No need to remember to check
- Catching issues early: Immediate notification when something goes wrong
- Filtering signal from noise: Only alert on what exceeds thresholds
- Enabling faster response: Minutes instead of hours or days
Alerts vs. Reports: When to Use Each
Use Real-Time Alerts for: - Threshold breaches (PR stuck > 24 hours) - Anomalies (cycle time spiked this week) - Process violations (PR merged without approval) - Time-sensitive issues (Friday deploy with no reviewer) - Individual PR states that need immediate attention Use Daily Digests for: - Aggregate metrics (5 PRs are stuck today) - Trend summaries (cycle time up 15% this week) - Team health snapshots (review load imbalance) - Non-urgent patterns worth noting Use Weekly Reports for: - Trends over time (cycle time month-over-month) - Comparative analysis (team A vs team B) - Executive summaries (weekly health scorecard) - Deep dives (why did cycle time increase?)
The Alert Fatigue Problem
How Alert Fatigue Kills Your System
Alert fatigue is the silent killer of engineering alerting systems. It happens gradually: you start with a few important alerts, add more over time, and suddenly your team ignores them all.
"An alert that's ignored is worse than no alert at all. It creates the illusion of monitoring while providing none of the benefit."
Signs your team has alert fatigue:
- Alerts go unacknowledged for hours or days
- Team members mute or filter alert channels
- "It's probably nothing" attitude toward new alerts
- Real issues missed because alerts are ignored
- New team members ask "do we actually look at these?"
- Alert channels have hundreds of unread messages
Our Take
If your team has more than 10 alerts firing per day on average, you have too many. The math is simple: 10 alerts/day means each one gets less than 5 minutes of attention on average. That's not alerting—that's noise generation. Cut ruthlessly until every alert feels important.
The Psychology of Alert Fatigue
Understanding why alert fatigue happens helps you prevent it:
- Cry wolf effect: After enough false positives, the brain learns to dismiss all alerts from that source
- Cognitive overload: More than 3-5 pieces of information at once overwhelms working memory
- Decision fatigue: Each alert requires a decision; too many depletes willpower
- Learned helplessness: If alerts can't be acted on, people stop trying
Alert Prioritization Framework
The P0-P3 Priority System
Not all alerts deserve the same response. Implement a clear priority system that everyone understands:
P0 - CRITICAL (Immediate Response Required) ├── Definition: Production impact or security risk ├── Response time: Minutes ├── Notification: Phone call, SMS, @here in dedicated channel ├── Engineering examples: │ ├── Merge to main with failing security scan │ ├── PR merged without any approval to protected branch │ └── Suspected credential leak in committed code └── Owner: On-call engineer P1 - HIGH (Same Business Day) ├── Definition: Process violation or significant blocker ├── Response time: 2-4 hours ├── Notification: Direct Slack message + channel post ├── Engineering examples: │ ├── PR stuck in review > 48 hours │ ├── SLA breach on critical path work │ └── Test failure rate spike > 50% └── Owner: Team lead or PR author P2 - MEDIUM (Next Business Day) ├── Definition: Needs attention but not urgent ├── Response time: 24-48 hours ├── Notification: Team channel only ├── Engineering examples: │ ├── PR awaiting review > 24 hours │ ├── Cycle time trending up week-over-week │ └── Review coverage dropped below target └── Owner: Team or assigned reviewer P3 - LOW (Weekly Review) ├── Definition: Informational, track for trends ├── Response time: Within a week ├── Notification: Daily/weekly digest only ├── Engineering examples: │ ├── PR size exceeds guideline (non-blocking) │ ├── Test flakiness detected │ └── Minor metric drift └── Owner: Team during retrospective
Which Metrics Deserve Real-Time Alerts
The question isn't "can we alert on this?" but "should we?" Here's a framework for deciding:
REAL-TIME ALERTS (Within minutes) ┌─────────────────────────────────────────────────────────────┐ │ Criteria: │ │ - Actionable right now │ │ - Getting worse by the minute │ │ - Clear owner who can respond │ ├─────────────────────────────────────────────────────────────┤ │ Examples: │ │ ✓ PR merged without approval │ │ ✓ Critical PR stuck > 4 hours │ │ ✓ Failing checks overridden to merge │ │ ✓ SLA breach on customer-facing work │ │ ✗ Cycle time increased 10% this week (use digest) │ │ ✗ PR is large (informational, not urgent) │ └─────────────────────────────────────────────────────────────┘ DAILY DIGEST (Once per day, morning) ┌─────────────────────────────────────────────────────────────┐ │ Criteria: │ │ - Needs attention today but not this minute │ │ - Multiple items that should be reviewed together │ │ - Trend data that's meaningful in aggregate │ ├─────────────────────────────────────────────────────────────┤ │ Examples: │ │ ✓ 5 PRs awaiting review > 24 hours │ │ ✓ Yesterday's velocity summary │ │ ✓ Review load imbalance across team │ │ ✓ PRs approved but not merged │ └─────────────────────────────────────────────────────────────┘ WEEKLY REPORT (Monday morning) ┌─────────────────────────────────────────────────────────────┐ │ Criteria: │ │ - Trend data over time │ │ - Strategic, not tactical │ │ - Requires analysis, not immediate action │ ├─────────────────────────────────────────────────────────────┤ │ Examples: │ │ ✓ Week-over-week cycle time comparison │ │ ✓ Review coverage trends │ │ ✓ Team contribution balance │ │ ✓ SLA compliance rate │ └─────────────────────────────────────────────────────────────┘
Essential Alerts to Set Up
Category 1: Stuck Work Alerts
Stuck PRs are one of the biggest sources of wasted time. These alerts ensure nothing falls through the cracks. For more on reducing cycle time, see our guide on reducing PR cycle time.
PR awaiting review too long:
- Threshold: PR open > 24 hours with no review
- Action: Notify author and potential reviewers
- Escalation: After 48 hours, notify team lead
PR stuck in review:
- Threshold: Changes requested > 24 hours ago, no update
- Action: Remind author to address feedback
- Context: Include link to PR and pending comments
Approved but not merged:
- Threshold: Approved > 8 hours ago, not merged
- Action: Remind author to merge or explain delay
- Context: Check for failing CI that might be blocking
Category 2: SLA Breach Alerts
If you have review SLAs (and you should—see our PR SLA implementation guide), alert when they're breached:
Time to first review SLA:
- Example SLA: First review within 4 hours during business hours
- Alert: When PR passes 4-hour mark with no review
- Route to: Assigned reviewers, then backup reviewers
Cycle time SLA:
- Example SLA: PRs merged within 2 business days
- Alert: When PR approaches or exceeds target
- Route to: Author and manager
Category 3: Quality Risk Alerts
Alert when review quality might be compromised:
Rubber-stamp reviews:
- Condition: Large PR approved in under 5 minutes
- Alert: "PR #123 (450 lines) approved after 3 minutes—verify review quality"
- Route to: Team lead or secondary reviewer
Merge without approval:
- Condition: PR merged with zero approvals
- Alert: Immediate notification to team lead
- Context: Was this an emergency? Document justification
Failing checks merged:
- Condition: CI checks failing but PR merged anyway
- Alert: "PR #456 merged with failing tests—investigate"
- Route to: Author and on-call
Category 4: Trend Alerts
Alert when metrics trend in the wrong direction:
Cycle time increasing:
- Condition: Week-over-week cycle time up > 20%
- Alert: Weekly summary with trend data
- Route to: Engineering manager
Review coverage dropping:
- Condition: Percentage of PRs with reviews drops below threshold
- Alert: "Only 85% of PRs were reviewed this week (target: 95%)"
- Route to: Team lead
🔔 How CodePulse Helps
CodePulse's Alerts page lets you create custom alert rules:
- Set thresholds for any metric (cycle time, PR count, review coverage)
- Choose comparison operators (greater than, less than, etc.)
- Receive email notifications when alerts trigger
- View active and historical alerts on your Dashboard
Slack Channel Organization Strategies
The Channel Structure That Works
Poor channel organization is a leading cause of alert fatigue. Here are three proven patterns:
Pattern 1: By Severity (Recommended for most teams)
#eng-alerts-critical ├── P0 alerts only ├── @here notifications enabled ├── Expected volume: < 5 per week └── Must acknowledge within 15 minutes #eng-alerts ├── P1 and P2 alerts ├── No @here, no @channel ├── Expected volume: 5-15 per day └── Check at least twice daily #eng-metrics-digest ├── Daily summaries, weekly reports ├── P3 and informational ├── Muting OK └── Review during planning/retros
Pattern 2: By Team (For larger organizations)
#team-platform-alerts ├── All alerts for Platform team's repos ├── Team members only └── Reduces noise for unrelated work #team-frontend-alerts ├── All alerts for Frontend team's repos ├── Team members only └── Context stays relevant #eng-alerts-cross-team ├── Alerts that span multiple teams ├── Escalations └── Organization-wide issues
Pattern 3: Hybrid (Severity + Team)
#eng-alerts-critical (all teams, P0 only) #team-platform-alerts (platform team, P1-P3) #team-frontend-alerts (frontend team, P1-P3) #eng-metrics-weekly (all teams, weekly digest)
Our Take
The hybrid pattern works best for teams of 15-50 engineers. Smaller teams can use severity-only. Larger orgs need team-based routing. But whatever you choose, be consistent. The worst outcome is channels that exist but nobody knows which one to watch.
Example Alert Messages That Get Action
The Anatomy of an Effective Alert
Good alerts have five elements: severity indicator, clear problem, relevant context, suggested action, and a direct link. Bad alerts have only one or two.
Bad alert (no context, no action):
PR #1234 needs review
Good alert (complete information):
🟡 P2: PR Awaiting Review > 24 Hours 📋 PR: Add user authentication flow (#1234) 👤 Author: @alice ⏱️ Waiting since: 26 hours (opened Tuesday 2pm) 👥 Requested reviewers: @bob, @carol 📊 Context: • PR size: 245 lines (+180, -65) • Linked issue: AUTH-456 (high priority) • CI status: ✅ All checks passing 🎯 Action needed: Review or reassign 🔗 https://github.com/org/repo/pull/1234
Templates for Common Alerts
═══════════════════════════════════════════════════════════════
TEMPLATE: Stuck PR Alert
═══════════════════════════════════════════════════════════════
🟡 P2: PR Awaiting Review > {threshold}
📋 PR: {title} (#{number})
👤 Author: @{author}
⏱️ Waiting: {hours_waiting} hours (opened {date_opened})
👥 Reviewers: {reviewers}
📊 Size: {additions}+ / {deletions}- ({total_changes} lines)
🎯 Action: Please review or ping for reassignment
🔗 {pr_url}
═══════════════════════════════════════════════════════════════
TEMPLATE: SLA Breach Alert
═══════════════════════════════════════════════════════════════
🔴 P1: Review SLA Breached
📋 PR: {title} (#{number})
👤 Author: @{author}
⏱️ SLA target: {sla_hours}h | Actual: {actual_hours}h
👥 Reviewers: {reviewers}
📊 Impact: {linked_issue_priority} priority work blocked
🎯 Escalation: {team_lead} - please assign a reviewer immediately
🔗 {pr_url}
═══════════════════════════════════════════════════════════════
TEMPLATE: Quality Risk Alert
═══════════════════════════════════════════════════════════════
🔴 P1: Potential Rubber-Stamp Review
📋 PR: {title} (#{number})
📊 Size: {total_lines} lines changed
⏱️ Review time: {review_minutes} minutes
👤 Reviewer: @{reviewer}
⚠️ Concern: Large PR approved very quickly
🎯 Action: @{team_lead} - verify review quality
🔗 {pr_url}
═══════════════════════════════════════════════════════════════
TEMPLATE: Process Violation Alert
═══════════════════════════════════════════════════════════════
🔴 P0: PR Merged Without Approval
📋 PR: {title} (#{number})
👤 Author: @{author}
🕐 Merged at: {merge_time}
⚠️ Approvals: 0
🎯 Immediate action required:
1. Was this an emergency? Document in PR
2. If not: revert and get proper approval
🔗 {pr_url}Configuring Alert Thresholds
Start with Baselines
Before setting thresholds, understand your current performance. For guidance on what metrics to baseline, see our engineering metrics dashboard guide.
- Measure current metrics over 30-90 days
- Calculate averages and standard deviations
- Identify natural variation vs. problems
Example baseline analysis: PR cycle time (last 90 days): Mean: 1.8 days Std dev: 0.6 days 90th percentile: 2.8 days Threshold options: Conservative: Alert at > 3 days (mean + 2 std dev) Moderate: Alert at > 2.5 days (90th percentile) Aggressive: Alert at > 2 days (above average) Recommendation: Start conservative, tighten over time
Threshold Setting Guidelines
For stuck PR alerts:
- Start with 24 hours for first review alert
- Adjust based on team norms and timezone distribution
- Consider business hours vs. calendar hours
For SLA alerts:
- Set threshold at the SLA target, not above it
- Consider warning alerts at 80% of SLA
- Example: 4-hour SLA → warn at 3.2 hours, alert at 4 hours
For trend alerts:
- Use percentage change, not absolute values
- 20-30% week-over-week change is usually significant
- Consider requiring consecutive weeks before alerting
Iterating on Thresholds
Thresholds should evolve over time:
- Start conservative: Better to miss some alerts than drown in noise
- Track false positives: If most alerts aren't actionable, raise threshold
- Track false negatives: If issues slip through, lower threshold
- Tighten as you improve: As team gets faster, lower cycle time thresholds
Slack Workflow Builder Integrations
Automating Alert Response
Slack Workflow Builder can automate common responses to alerts, reducing manual toil and ensuring consistent handling.
Workflow 1: Alert Acknowledgment
Trigger: Emoji reaction 👀 on alert message
Actions:
1. Add thread reply: "@{user} is investigating"
2. Update message with: "⏳ Being handled by @{user}"
3. Set reminder for user: "Follow up on alert" in 2 hours
Result: Team knows who's handling it, owner has reminderWorkflow 2: Escalation Request
Trigger: Emoji reaction 🆘 on alert message
Actions:
1. Send DM to team lead: "Alert escalation requested"
2. Add thread reply: "Escalated to @{team_lead}"
3. Post to #eng-alerts-critical if P2 or lower
Result: One-click escalation without context lossWorkflow 3: False Positive Tracking
Trigger: Emoji reaction ❌ on alert message
Actions:
1. Add thread reply: "Marked as false positive by @{user}"
2. Log to spreadsheet: alert type, date, user, reason form
3. If 5+ false positives this week: notify alert admin
Result: Data for threshold tuning, trend visibilityWorkflow 4: Daily Digest Summary
Trigger: Scheduled, 9am Monday-Friday
Actions:
1. Collect unresolved alerts from past 24 hours
2. Group by priority and type
3. Post summary to #eng-metrics-digest:
"📊 Daily Alert Summary
🔴 P0/P1: {count} ({unresolved} unresolved)
🟡 P2: {count} ({unresolved} unresolved)
🟢 P3: {count}
Top issues: {list of oldest unresolved}"
Result: Morning awareness without notification fatigueEscalation Patterns When Alerts Are Ignored
Time-Based Escalation
When alerts go unacknowledged, automatic escalation prevents issues from festering. Here's a proven pattern:
ESCALATION LADDER: Stuck PR
T+0h: PR opened
└─ No alert (normal state)
T+4h: First alert
├─ Recipients: Assigned reviewers only
├─ Channel: Thread in #eng-alerts
└─ Message: "PR awaiting first review"
T+8h: Second alert (if no activity)
├─ Recipients: Reviewers + PR author
├─ Channel: #eng-alerts (new message)
└─ Message: "Still awaiting review - needs attention"
T+24h: Team escalation
├─ Recipients: Add team channel
├─ Channel: #team-{name} + #eng-alerts
└─ Message: "PR stuck 24h+ - team please help unblock"
T+48h: Manager escalation
├─ Recipients: Add team lead DM
├─ Channel: Previous + DM to lead
└─ Message: "Requires intervention - SLA significantly breached"
T+72h: Skip-level escalation
├─ Recipients: Add engineering manager
├─ Channel: Previous + DM to EM
└─ Message: "Chronic blocker - process review needed"Our Take
Escalation should feel uncomfortable—for everyone. If your escalation ladder gets used frequently, you have a process problem, not an alerting problem. The goal is for escalation to happen rarely because people respond to the initial alert.
Ownership-Based Escalation
Sometimes the right escalation is lateral (to a different owner), not vertical (to management):
LATERAL ESCALATION: Reviewer Unavailable
If assigned reviewer:
├─ Is OOO (calendar check)
├─ Has >3 pending reviews (overloaded)
└─ Hasn't responded in 8 hours
Then:
1. Find backup reviewer (CODEOWNERS or rotation)
2. Auto-assign backup
3. Notify original reviewer: "Reassigned to @backup due to {reason}"
4. Track for reviewer load balancingThe Quiet Hours Concept
Non-Urgent Alerts Don't Need 24/7 Delivery
Not every alert needs to interrupt dinner or wake someone up. Implementing quiet hours improves quality of life without sacrificing coverage for what truly matters.
QUIET HOURS CONFIGURATION P0 - Critical: └─ Always delivered immediately (no quiet hours) P1 - High: ├─ Quiet hours: 10pm - 7am local time └─ Held alerts delivered at 7am P2 - Medium: ├─ Quiet hours: 8pm - 9am local time ├─ Weekends: Held until Monday 9am └─ Held alerts delivered in morning digest P3 - Low: ├─ Quiet hours: 6pm - 10am local time ├─ Weekends: No delivery └─ Delivered in daily digest only TIMEZONE HANDLING: ├─ Use each user's local timezone ├─ For team channels: Use team's primary timezone └─ Cross-timezone teams: Deliver during overlap hours
"An engineer who sleeps well reviews code better than one who was woken up by a P3 alert about a non-urgent PR."
Implementing Quiet Hours in Slack
# Approach 1: Slack's Built-in DND Encourage team members to set personal notification schedules: Settings > Notifications > Notification schedule # Approach 2: Alert Queue System Your alerting tool holds non-critical alerts: 1. Check alert priority 2. Check recipient's timezone 3. If P2+ and within quiet hours: queue 4. Deliver queued alerts at quiet hours end # Approach 3: Digest Channels Route P2/P3 to digest-only channels: - Real-time: #eng-alerts-critical (P0/P1 only) - Digest: #eng-alerts-daily (P2/P3, posted at 9am)
Measuring Alert Effectiveness
Meta-Metrics: Alerts About Your Alerts
Track these metrics to ensure your alerting system is healthy:
ALERT HEALTH DASHBOARD
Volume Metrics:
├── Alerts per day/week (by priority)
├── Trend: Is volume increasing or decreasing?
└── Target: <10 P1/P2 alerts per day for a 10-person team
Response Metrics:
├── Time to acknowledge (first reaction/reply)
├── Time to resolve (underlying issue fixed)
├── Acknowledgment rate (% alerts that get any response)
└── Targets:
P0: Acknowledge <15min, Resolve <1hr
P1: Acknowledge <1hr, Resolve <4hr
P2: Acknowledge <4hr, Resolve <24hr
Quality Metrics:
├── False positive rate (alerts that needed no action)
├── False negative rate (issues missed by alerts)
├── Action rate (alerts that resulted in meaningful action)
└── Targets:
False positive: <10%
False negative: <5%
Action rate: >80%
Engagement Metrics:
├── Click-through rate (% that clicked the link)
├── Thread participation (% with discussion)
├── Escalation rate (% that escalated)
└── Use for: Identifying poorly-formatted alertsMonthly Alert Review Process
Schedule a monthly review to tune your alerting system:
- Volume review: Are we drowning in alerts? Cut low-value ones.
- False positive review: Which alerts fire but don't need action? Raise thresholds.
- False negative review: What slipped through? Add missing alerts.
- Response time review: Are alerts being acknowledged quickly enough?
- Threshold review: Have our metrics improved? Tighten thresholds.
MONTHLY ALERT REVIEW TEMPLATE
Date: {date}
Reviewer: {name}
VOLUME SUMMARY:
- Total alerts this month: {count}
- By priority: P0={x}, P1={y}, P2={z}, P3={w}
- Trend vs last month: {+/-x%}
TOP 5 NOISIEST ALERTS:
1. {alert_name}: {count} times, {action_rate}% action rate
2. ...
RECOMMENDED CHANGES:
- [ ] Raise threshold on {alert} from X to Y
- [ ] Disable {alert} - consistently ignored
- [ ] Add alert for {gap} - missed issues
- [ ] Change routing for {alert} from #channel to #other
UNRESOLVED FROM LAST MONTH:
- {item} - status: {status}Alert Anti-Patterns to Avoid
Anti-Pattern 1: Too Many Alerts
❌ Problem: Alert on every PR state change "PR opened" "PR updated" "PR approved" "PR merged" Result: 50+ alerts per day, all ignored ✅ Solution: Alert on exceptions only "PR stuck" "SLA breach" "Merged without approval" Result: 5-10 meaningful alerts per day
Anti-Pattern 2: Unclear Alerts
❌ Problem: Vague alert messages
"Metric threshold exceeded"
"PR needs attention"
Result: Recipients don't know what to do
✅ Solution: Specific, actionable alerts
"Cycle time hit 4.2 days (threshold: 3 days).
Top contributor: PR #1234 open 6 days.
Action: Review PR or raise threshold."
Result: Clear problem, clear actionAnti-Pattern 3: No Ownership
❌ Problem: Alerts go to large channels with no @mention Posted to #engineering (200 people) Result: "Someone else will handle it" → nobody handles it ✅ Solution: Clear routing and ownership Post to #team-platform (15 people) @mention: @platform-reviewers Escalation: @alice (team lead) Result: Named individuals feel responsible
Anti-Pattern 4: Alert and Forget
❌ Problem: Set up alerts, never review them Thresholds from 2 years ago Team has improved but alerts still fire constantly Result: Learned helplessness, ignore all alerts ✅ Solution: Monthly alert hygiene Review alert volume monthly Tighten thresholds as team improves Delete alerts that aren't actionable Result: Alerts stay relevant and respected
Anti-Pattern 5: Duplicate Alerts
❌ Problem: Same issue triggers multiple alert types "PR stuck" + "SLA warning" + "SLA breach" + "Cycle time high" All for the same PR Result: 4x noise, same information ✅ Solution: Deduplicate and consolidate One alert per issue, with escalation built in "PR stuck (24h)" → escalates to "SLA breach (48h)" Result: One notification stream per problem
Our Take
The #1 predictor of alert system success isn't the tool you use—it's whether someone owns alert hygiene. Assign an "Alert DRI" (directly responsible individual) who reviews alert health monthly. Without ownership, entropy wins and alert fatigue becomes inevitable.
Building an Alert Playbook
Document how to respond to each alert type. This is especially valuable for on-call rotations and new team members.
═══════════════════════════════════════════════════════════════ ALERT PLAYBOOK ═══════════════════════════════════════════════════════════════ ALERT: PR awaiting review > 24 hours PRIORITY: P2 (Medium) RECIPIENTS: Assigned reviewers → team channel → team lead EXPECTED VOLUME: 2-5 per day Response steps: 1. Check if reviewers are available (not OOO) 2. If reviewers busy, reassign to available reviewer 3. If no reviewers available, escalate to team lead 4. If PR is urgent, ping in team channel Expected resolution: Within 4 hours of alert Escalation: If unresolved after 4 hours, notify manager Common false positives: - Draft PRs (exclude from alerting) - PRs marked "WIP" (exclude from alerting) - Weekends/holidays (adjust for business hours) ─────────────────────────────────────────────────────────────── ALERT: Merge without approval PRIORITY: P0 (Critical) RECIPIENTS: Author, team lead, engineering manager EXPECTED VOLUME: < 1 per week Response steps: 1. Verify if this was an emergency hotfix 2. If emergency: Document justification in PR comments 3. If not emergency: Discuss with author, consider revert 4. If pattern: Address in 1:1, review branch protection Expected resolution: Within 2 hours of alert Escalation: Multiple occurrences → process review in retro Common false positives: - Admin overrides for migrations (document in advance) - Bot commits (filter by author) ─────────────────────────────────────────────────────────────── ALERT: Cycle time exceeded 3 days PRIORITY: P2 (Medium) RECIPIENTS: PR author, team lead EXPECTED VOLUME: 5-10 per week Response steps: 1. Identify blocking stage (waiting? review? CI?) 2. If waiting for review: trigger stuck PR workflow 3. If waiting for author: ping author 4. If CI issues: escalate to platform team Expected resolution: PR merged within 24 hours of alert Escalation: If PR exceeds 5 days, manager involvement Common false positives: - Large refactors (expected longer cycle) - Vacation/holiday periods - Dependencies on external teams
Getting Started This Week
Week 1: Audit and Plan
- List all current alerts (if any)
- For each: what's the action rate? Keep only >50% action rate
- Identify top 3 issues that slip through (need new alerts)
- Define your channel structure
Week 2: Implement Core Alerts
- Set up stuck PR alert (24-hour threshold)
- Set up merge without approval alert
- Set up weekly metrics digest
- Configure quiet hours for P2/P3
Week 3: Add Workflows
- Build acknowledgment workflow (👀 reaction)
- Build escalation workflow (🆘 reaction)
- Build false positive tracking (❌ reaction)
Week 4: Measure and Iterate
- Review alert volume and action rate
- Adjust thresholds based on data
- Document playbook for each alert type
- Schedule monthly alert review
"The best alerting system is one your team actually trusts. Build that trust by ensuring every alert is worth their attention."
Conclusion
Well-configured alerts turn your engineering metrics from passive data into active intelligence—surfacing issues when they matter and keeping your team informed without overwhelming them.
The key principles to remember:
- Quality over quantity: Fewer, better alerts beat more, ignored alerts
- Clear ownership: Every alert needs someone responsible for responding
- Continuous tuning: Alert systems need maintenance like any other system
- Context matters: Good alerts tell you what's wrong and what to do about it
- Respect attention: Your team's focus is precious—only interrupt when it matters
For related guidance, see our articles on implementing PR SLAs, reducing cycle time, and building effective metrics dashboards.
See these insights for your team
CodePulse connects to your GitHub and shows you actionable engineering metrics in minutes. No complex setup required.
Free tier available. No credit card required.
Related Guides
We Cut PR Cycle Time by 47%. Here's the Exact Playbook
A practical playbook for engineering managers to identify bottlenecks, improve review processes, and ship code faster—without sacrificing review quality.
The Only 7 Metrics Your VP Dashboard Actually Needs
Skip vanity metrics. Here are the 7 engineering metrics VPs actually need to track team performance, delivery, and quality.