How to Measure Productivity in Software Engineering Teams
At UshaOm, I watched an engineer game our story point system for an entire quarter. He consistently took the highest-pointed tickets, broke them into pieces, and closed them fast. His "velocity" was double anyone else's. His actual output? About average. The system was measuring his skill at estimating, not his skill at engineering.
If you measure something, people optimize for it. This creates a dangerous trap in engineering: the metrics that are easiest to measure are almost always the wrong ones. Lines of code, commits per developer, hours logged, story points—these are all easily quantifiable and largely meaningless. And when engineers know they're being measured on them, they optimize ruthlessly: writing verbose code to boost LOC metrics, creating tiny commits to boost commit counts, working while sick to boost hours, inflating estimates to boost velocity.
The challenge is finding metrics that actually reflect productivity—that measure whether your team is creating value, shipping fast, and improving continuously—without creating surveillance culture where engineers feel watched and evaluated.
What NOT to Measure
Let's start with what doesn't work, because many organizations are still trying to optimize these metrics.
Lines of Code (LOC) or Files Changed: This is perhaps the most toxic metric in engineering. High LOC can indicate verbose, inefficient code or unnecessary changes. A developer who deletes 1,000 lines of redundant code has just done something valuable, but LOC metrics would rate this as negative contribution. It encourages code bloat and discourages refactoring.
Number of Commits: Trivial to game. Engineers can split a feature into 50 commits to look productive, or squash 50 commits into one. Commit count tells you nothing about value delivered.
Bugs Fixed: Incentivizes reactive work over prevention. Better to prevent bugs from existing in the first place, but "preventing bugs that never existed" doesn't show up in metrics. Also creates perverse incentives to create bugs you can then fix.
Hours Logged or Presence: Completely divorced from productivity. Some of the most productive developers work unconventional hours. Some people log 12 hours at their desk while writing 3 lines of code. Hours worked measures presence, not output.
Velocity or Story Points: Velocity is useful for planning sprints, but it's gameable and context-dependent. Teams learn that estimating generously improves velocity, so estimates become inflated. Comparing velocity across teams is meaningless. A team that estimates 50 story points and delivers 50 is not necessarily more productive than a team that estimates 40 and delivers 40. And measuring individual velocity creates silos and competition.
Code Coverage Percentage: A team with 100% code coverage can still ship bugs if they're testing the wrong things. A team with 60% coverage that's testing the critical paths might be more productive. Coverage metrics encourage writing tests that pass but don't catch real problems.
On-time Delivery of Estimated Tasks: Estimates are fiction. No one knows how long something will take. When you measure whether teams hit estimates, teams respond by padding estimates and shipping less ambitious features. You're not measuring productivity; you're measuring estimation accuracy.
The common thread: all these metrics are easy to game and don't correlate with value delivered to customers.
What TO Measure
The metrics that actually reflect productivity are harder to measure but worth the effort.
Cycle Time (Lead Time for Changes)
This is the time from when work is started (or requested) until it's deployed to production and customers can use it.
Why it matters: Cycle time is the most comprehensive measure of how efficiently your entire organization converts ideas into value. It includes planning, development, review, testing, and deployment. Short cycle time correlates strongly with business outcomes: faster iteration, faster learning, faster response to market changes.
How to measure it: Use commit timestamps to deployment timestamps. Most CI/CD platforms track this automatically. The metric is: time from first commit to merge to production deployment.
The useful breakdowns:
- Code development time (commit to PR open)
- Review time (PR open to approval)
- Deployment time (approval to production)
This granularity shows you where your bottlenecks are. If review time is 3 days, that's your problem. If deployment takes 2 hours, that's your problem.
Target: Industry leaders have cycle times under 1 hour for typical changes. 80th percentile is around 8-24 hours.
Deployment Frequency
How often can your organization confidently ship code to production?
Why it matters: Deployment frequency correlates with quality, learning speed, and revenue growth. Organizations that deploy frequently also tend to have fewer incidents and faster incident recovery. This seems counterintuitive but it's consistent: frequent deployments force you to make the process safe, reliable, and fast.
How to measure it: Count successful deployments per day/week/month. The metric is: number of times code reaches production.
The nuance: Don't count failed deployments. A failed deployment that gets rolled back isn't progress.
Target: Industry leaders deploy multiple times per day. Healthy organizations are at 1-5 deployments per day. Organizations still in "big bang quarterly releases" are at 1-4 per quarter.
Change Failure Rate
Of all the deployments you make, what percentage cause incidents or require rollbacks?
Why it matters: This is your actual quality metric. It measures whether the code you're shipping works. Not "whether you think it works based on tests," but whether customers experience problems.
How to measure it: (Number of deployments that cause incidents) / (total deployments). This requires incident tracking: when something breaks in production, you need to correlate it back to which deployment caused it.
Target: Industry leaders are under 15%. 50th percentile is around 30-50%.
Mean Time to Recovery (MTTR)
When something breaks in production, how long does it take to fix?
Why it matters: You can't prevent all failures. But you can get really good at fixing them fast. Teams with excellent MTTR end up with better actual uptime than teams trying to prevent all failures (because prevention efforts create false negatives).
How to measure it: From incident detection to full resolution. Most incident tracking systems (PagerDuty, Opsgenie, etc.) track this automatically.
Target: Industry leaders are under 30 minutes. Average is 1-4 hours.
Flow Efficiency (or Flow Health)
This is a deeper metric that measures what percentage of time your team is actively moving work forward versus being blocked or waiting.
Why it matters: Identifies systemic bottlenecks. If work sits waiting for review 60% of the time, that's a bottleneck worth addressing. If work is blocked on dependencies 30% of the time, that's architectural friction.
How to measure it: Use project management or issue tracking systems to measure:
- Active work time (when someone is actively developing)
- Waiting time (PR waiting for review, deployment waiting for approval, etc.)
- Blocked time (waiting on dependencies, information, or decisions)
Formula: Active time / Total cycle time = Flow efficiency ratio.
Target: 40% flow efficiency is below average. 60%+ is good. 70%+ is excellent.
Developer Satisfaction and Retention
Ultimately, if engineers are miserable, productivity is already declining.
Why it matters: The best engineers leave unhappy organizations. Retention of senior engineers directly correlates with team stability and knowledge preservation. Organizations with high turnover end up onboarding constantly instead of building momentum.
How to measure it: Annual surveys asking engineers to rate:
- Overall satisfaction (1-10)
- Clarity of role and goals
- Psychological safety
- Ability to focus and do deep work
- Growth opportunities
- Trust in leadership
Why this matters more than you think: Teams with high developer satisfaction also have shorter cycle times, fewer bugs, and higher deployment frequency. When engineers are engaged, they care about doing good work. When they're disengaged, they optimize for political safety and tenure.
Building a Measurement Program Without Creating Surveillance
The biggest risk with measurement is that it becomes surveillance. When engineers feel watched and judged, several things happen:
- They optimize for metrics instead of value
- They hide problems instead of escalating them
- The best engineers leave
- Collaboration decreases (people work in silos to own their metrics)
To avoid this:
Share the metrics publicly: Don't measure individual engineers on these metrics. Measure teams and the organization. When metrics are team-level, they become shared problems rather than individual indictments.
Focus on systems, not people: When cycle time is long, don't blame engineers for being slow. Ask: what's blocking work? What process is creating bottlenecks? How is the system structured? Engineers almost never want to be slow. Usually, the system makes slowness inevitable.
Use metrics for questions, not judgment: Metrics should answer the question "what should we improve next?" not "who's underperforming?" If deployment is rare, the question is "why is deployment risky?" not "why aren't people deploying?"
Include both leading and lagging indicators: Lagging indicators (cycle time, quality) tell you what happened. Leading indicators (code review queue size, deployment readiness, incident severity) tell you what's about to happen. Leading indicators let you prevent problems rather than measure them after they occur.
Measure the system, not the individuals: Cycle time, deployment frequency, and quality are system metrics. They're affected by architecture, processes, tools, and culture—not by how hard individuals are working.
Common Anti-Patterns to Avoid
Individual developer velocity metrics: Creates silos and competition. Good systems require collaboration, and collaboration isn't rewarded by individual metrics.
Comparing teams on velocity or productivity: Different teams work on different things. A team maintaining legacy systems might have lower velocity than a team building new features. Comparison is meaningless and demoralizing.
Measuring on-call engineers on the same metrics as feature developers: On-call rotation is essential work. Engineers handling incidents are less productive on features but are delivering essential value. Measure separately.
Using metrics from one tool without understanding why they're different in another tool: If GitHub says you deployed 50 times but your monitoring says production was updated 45 times, the gap is important. Metrics should trace back to actual business outcomes.
Setting targets for metrics and then measuring everyone against those targets: This is how you get Goodhart's Law (metrics become targets and stop being useful measures). Instead, measure trends. Is it improving? That matters more than absolute numbers.
Measuring without a plan to act on the data: If you're not going to use these metrics to make decisions, don't collect them. Measurement without action just creates reporting overhead.
Implementing Measurement
Start simple:
-
Pick three metrics: Cycle time, deployment frequency, and change failure rate. These three together paint a complete picture.
-
Get visibility: Use your CI/CD platform and incident tracking system to collect these automatically. Manual collection is error-prone and expensive.
-
Track trends over time: Absolute numbers matter less than whether things are improving. Is cycle time trending down? Is deployment frequency trending up? Is quality improving?
-
Share publicly: Every week, post these metrics somewhere visible (Slack, Slack dashboard, wiki, dashboard). No secrecy.
-
Ask questions, don't judge: When a metric moves negatively, ask "why?" with genuine curiosity, not blame.
-
Make improvements: Use the metrics to drive process changes. If cycle time is long, investigate where time is being lost. If quality is poor, strengthen your testing or review process.
The organizations with the highest productivity don't have the most complex measurement systems. They have simple, clear metrics that everyone understands and agrees matter. They measure outcomes, not activity. And they use measurement to continuously improve the system, not to evaluate people.
That's how you get measurement that drives real productivity without creating a culture of surveillance.
Related Reading
- Programmer Productivity: Why Measuring Output Is the Wrong Question
- Developer Productivity: Stop Measuring Output, Start Measuring Impact
- Code Productivity: Why Your Best Engineers Aren't Your Most Productive
- Coding Metrics That Actually Matter
- DORA Metrics: The Complete Guide for Engineering Leaders
- Sprint Velocity: The Misunderstood Metric