For platform engineers and staff developers, the idea of treating engineering analytics like infrastructure—version-controlled, automated, and programmatic—is appealing. This guide explores the "analytics as code" approach: building your own metrics pipeline from GitHub data, when it makes sense, and when you should buy instead of build.
We'll cover the technical approach, including code examples for the GitHub API, along with the hidden complexity that makes DIY analytics harder than it first appears.
What Is Analytics as Code
The Philosophy
"Analytics as code" applies infrastructure-as-code principles to engineering metrics:
- Version controlled: Metric definitions live in Git, not UI configs
- Automated: Data collection runs in pipelines, not manual exports
- Reproducible: Anyone can understand how metrics are calculated
- Extensible: Adding new metrics follows standard development workflow
The Appeal for Technical Teams
For staff engineers and platform teams, building your own analytics has appeal:
Why engineers want to build: Control: "We can measure exactly what we care about" "No vendor lock-in" "Customize to our specific workflow" Integration: "Embed metrics in our developer portal" "Connect to our existing data warehouse" "Use our own visualization tools" Cost: "GitHub API is free" "We have the skills in-house" "Avoid per-seat SaaS pricing" Learning: "Our team will understand it deeply" "Good project for platform engineers"
The DIY Approach: GitHub API + Your Pipeline
GitHub API Basics
GitHub provides both REST and GraphQL APIs for accessing repository data. For analytics, GraphQL is generally more efficient because you can fetch exactly the fields you need in fewer requests.
Authentication
# Personal Access Token (for development/small scale) curl -H "Authorization: token YOUR_PAT" \ https://api.github.com/repos/owner/repo/pulls # GitHub App (for production/organization-wide) # 1. Create GitHub App # 2. Install on organization # 3. Generate installation access token # 4. Use token for API calls (expires after 1 hour)
Fetching Pull Request Data (REST)
import requests
from datetime import datetime, timedelta
def get_merged_prs(owner, repo, token, since_days=30):
"""Fetch PRs merged in the last N days."""
headers = {
"Authorization": f"token {token}",
"Accept": "application/vnd.github.v3+json"
}
since = datetime.now() - timedelta(days=since_days)
prs = []
page = 1
while True:
response = requests.get(
f"https://api.github.com/repos/{owner}/{repo}/pulls",
headers=headers,
params={
"state": "closed",
"sort": "updated",
"direction": "desc",
"per_page": 100,
"page": page
}
)
response.raise_for_status()
page_prs = response.json()
if not page_prs:
break
for pr in page_prs:
if pr.get("merged_at"):
merged_at = datetime.fromisoformat(
pr["merged_at"].replace("Z", "+00:00")
)
if merged_at >= since.replace(tzinfo=merged_at.tzinfo):
prs.append(pr)
else:
# PRs are sorted by updated, so we might
# need to continue for older merges
pass
page += 1
# Basic rate limit check
remaining = int(response.headers.get("X-RateLimit-Remaining", 0))
if remaining < 10:
break # Stop before hitting limit
return prsFetching with GraphQL (More Efficient)
import requests
QUERY = """
query($owner: String!, $repo: String!, $cursor: String) {
repository(owner: $owner, name: $repo) {
pullRequests(
first: 100
states: [MERGED]
orderBy: {field: UPDATED_AT, direction: DESC}
after: $cursor
) {
pageInfo {
hasNextPage
endCursor
}
nodes {
number
title
createdAt
mergedAt
additions
deletions
changedFiles
author {
login
}
reviews(first: 10) {
nodes {
author { login }
state
submittedAt
}
}
}
}
}
}
"""
def fetch_prs_graphql(owner, repo, token):
"""Fetch PRs with review data in single query."""
headers = {
"Authorization": f"bearer {token}",
"Content-Type": "application/json"
}
all_prs = []
cursor = None
while True:
response = requests.post(
"https://api.github.com/graphql",
headers=headers,
json={
"query": QUERY,
"variables": {
"owner": owner,
"repo": repo,
"cursor": cursor
}
}
)
data = response.json()
if "errors" in data:
raise Exception(f"GraphQL errors: {data['errors']}")
prs = data["data"]["repository"]["pullRequests"]
all_prs.extend(prs["nodes"])
if not prs["pageInfo"]["hasNextPage"]:
break
cursor = prs["pageInfo"]["endCursor"]
return all_prsCalculating Cycle Time
from datetime import datetime
from statistics import median
def calculate_cycle_time(prs):
"""Calculate median cycle time from PR data."""
cycle_times = []
for pr in prs:
created = datetime.fromisoformat(
pr["createdAt"].replace("Z", "+00:00")
)
merged = datetime.fromisoformat(
pr["mergedAt"].replace("Z", "+00:00")
)
# Cycle time in hours
hours = (merged - created).total_seconds() / 3600
cycle_times.append(hours)
if not cycle_times:
return None
return {
"median_hours": median(cycle_times),
"min_hours": min(cycle_times),
"max_hours": max(cycle_times),
"count": len(cycle_times)
}
# Usage
prs = fetch_prs_graphql("myorg", "myrepo", token)
metrics = calculate_cycle_time(prs)
print(f"Median cycle time: {metrics['median_hours']:.1f} hours")Hidden Complexity: Rate Limits, Data Quality, Maintenance
Rate Limit Challenges
GitHub's API has strict rate limits that complicate production use:
GitHub API Rate Limits: REST API: - 5,000 requests/hour (authenticated) - Pagination required for large result sets - Each page = 1 request GraphQL API: - Point-based system (not simple request count) - Complex queries cost more points - 5,000 points/hour Real-world impact: - 50 repos × 100 PRs each = 5,000+ requests - Fetching reviews, comments, files = more requests - Historical backfill can take days - Incremental sync required for production
Rate Limit Handling Code
import time
def handle_rate_limit(response):
"""Check rate limits and wait if needed."""
remaining = int(response.headers.get("X-RateLimit-Remaining", 0))
reset_time = int(response.headers.get("X-RateLimit-Reset", 0))
if remaining < 100:
wait_seconds = max(reset_time - time.time(), 0) + 10
print(f"Rate limit low. Waiting {wait_seconds}s...")
time.sleep(wait_seconds)
if response.status_code == 403:
# Secondary rate limit hit
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Secondary rate limit. Waiting {retry_after}s...")
time.sleep(retry_after)Data Quality Issues
Raw GitHub data requires significant cleaning:
- Bot commits: Dependabot, Renovate, GitHub Actions pollute metrics
- Force pushes: Can lose commit history
- Squash merges: Hide individual commit timestamps
- Timezones: All times are UTC, working hours vary
- Deleted branches: Reference data may be missing
# Bot detection
BOT_PATTERNS = [
"[bot]",
"dependabot",
"renovate",
"github-actions",
"greenkeeper",
]
def is_bot(username):
"""Detect bot accounts."""
if not username:
return True
username_lower = username.lower()
return any(pattern in username_lower for pattern in BOT_PATTERNS)
# Filter PRs
human_prs = [pr for pr in prs if not is_bot(pr["author"]["login"])]Maintenance Burden
Building is just the start. Ongoing maintenance includes:
- API changes and deprecations
- Schema updates when GitHub adds fields
- Handling new edge cases as repos evolve
- Performance optimization as data grows
- Dashboard maintenance and feature requests
- On-call for data pipeline failures
When to Build vs Buy
Build When:
- Unique requirements: You need metrics no vendor provides
- Deep integration: Metrics must embed in custom internal tools
- Learning opportunity: Building is a valuable platform team project
- Small scale: Just a few repos, simple metrics
- Air-gapped: No ability to use SaaS products
Buy When:
- Time-to-value: You need insights now, not in 6 months
- Scale: Many repos, teams, or organizations
- Standard metrics: DORA, cycle time, review coverage
- Limited platform capacity: Team should focus on core product
- Enterprise features: SSO, audit logs, support SLAs
Build vs Buy Calculation: DIY costs: - Initial build: 2-3 engineers × 2-3 months = 4-9 eng-months - Ongoing maintenance: 0.25-0.5 engineer ongoing - Opportunity cost: What else could that time produce? SaaS costs: - $X per seat per month - Implementation: days, not months - Maintenance: vendor handles it Break-even analysis: Engineering cost: $150K/year fully loaded 9 eng-months = ~$112K initial investment 0.5 engineer ongoing = $75K/year vs. SaaS at $50/seat/month × 50 engineers = $30K/year DIY break-even: 3+ years (if no maintenance growth) Reality: DIY usually costs more than expected
Using APIs for Custom Dashboards
Hybrid Approach
You don't have to choose all-or-nothing. Many teams use a platform for data collection and core metrics, then build custom dashboards on top:
- Use platform API: Pull aggregated metrics programmatically
- Embed in portal: Display in your internal developer portal
- Custom visualizations: Build team-specific views
- Integrate with other data: Join with incident, OKR, or product data
# Example: Embed metrics in internal portal
# Fetch from platform API (hypothetical)
def get_team_metrics(team_id, period="weekly"):
response = requests.get(
f"https://api.yourplatform.com/v1/teams/{team_id}/metrics",
headers={"Authorization": f"Bearer {API_KEY}"},
params={"period": period}
)
return response.json()
# Display in your portal
metrics = get_team_metrics("platform-team")
print(f"""
Team Dashboard:
Cycle Time: {metrics['cycle_time_hours']:.1f} hours
Deployment Frequency: {metrics['deploy_frequency_per_week']:.1f}/week
Change Failure Rate: {metrics['cfr_percent']:.1f}%
""")Embedding Metrics in Developer Portals
Backstage Integration Example
If you use Backstage or a similar developer portal, you can embed engineering metrics as a plugin or widget:
// Backstage plugin component (simplified)
import React, { useEffect, useState } from 'react';
import { useApi } from '@backstage/core-plugin-api';
export const TeamMetricsCard = ({ teamId }) => {
const [metrics, setMetrics] = useState(null);
useEffect(() => {
fetch(`/api/metrics/teams/${teamId}`)
.then(res => res.json())
.then(setMetrics);
}, [teamId]);
if (!metrics) return <Loading />;
return (
<Card>
<CardHeader title="Engineering Metrics" />
<CardContent>
<MetricRow
label="Cycle Time"
value={`${metrics.cycleTimeHours.toFixed(1)}h`}
/>
<MetricRow
label="Deploy Frequency"
value={`${metrics.deployFrequency}/week`}
/>
<MetricRow
label="PR Throughput"
value={`${metrics.prsPerWeek} PRs/week`}
/>
</CardContent>
</Card>
);
};📊 CodePulse API Access
CodePulse handles the complexity of GitHub data collection while providing API access for custom integrations:
- Pre-calculated DORA metrics and cycle time data
- Bot filtering and data quality built in
- Export to CSV for custom analysis
- Export guide - Get your data in the format you need
💡 Start Simple
If you're drawn to the analytics-as-code approach, start with a simple proof of concept: one metric, one repo, one week of data. You'll quickly discover the edge cases and maintenance burden. Then decide if the investment makes sense for your team.
See these insights for your team
CodePulse connects to your GitHub and shows you actionable engineering metrics in minutes. No complex setup required.
Free tier available. No credit card required.
Related Guides
Engineering Analytics Tools: The Brutally Honest Comparison (2026)
An objective comparison of engineering analytics platforms including LinearB, Haystack, Jellyfish, Swarmia, and CodePulse.
Export Engineering Metrics in 2 Minutes (Copy This Script)
Learn how to export engineering metrics from CodePulse, build custom reports, and integrate with other tools for comprehensive analysis.
The GitHub Permission Mistake That Gets Security Rejected
Understand GitHub OAuth scopes and app permissions required for engineering analytics tools, with security best practices.