Engineering Bottleneck Detection: Finding Constraints Before They Kill Velocity
At Salesken, we once spent three sprints optimizing our ML model training pipeline — shaving minutes off each training run, parallelizing data preprocessing, upgrading GPU instances. Then I mapped the full delivery flow and realized the bottleneck was code review. PRs sat for two days on average. We'd optimized the wrong thing because we hadn't looked at the whole system.
A bottleneck in software development is like a bottleneck in a bottle: no matter how much liquid you pour, the flow rate is limited by the narrowest point. You can optimize everything upstream and downstream, but if there's a constraint at the bottleneck, nothing improves.
The same principle applies to engineering organizations. You can have brilliant architects, fast developers, and clean code. But if code review is slow, deployments are gated, or incident response is chaotic, those become bottlenecks that constrain the entire organization's velocity.
The challenge is identifying bottlenecks before they become acute problems. Most organizations only notice bottlenecks when they're already strangling the system—pull request queues are three weeks long, deployments happen quarterly, incident response is 48 hours. By then, damage has been done.
The organizations with the highest velocity spot bottlenecks early using systematic detection methods. This article shows how.
Common Bottleneck Patterns
Before diving into detection methods, let's identify the typical bottlenecks that slow engineering teams.
Code Review Bottleneck
Symptoms: Pull requests sit 2-5 days waiting for review. Authors context-switch to other work while waiting.
Root causes:
- Reviewers are overloaded (too much other work)
- PRs are too large (takes too long to review)
- Review expectations aren't clear (reviewers over-scrutinize)
- Certain people are always needed for review (knowledge concentration)
Impact: Cycle time explodes. A 2-day coding task becomes 5-7 days when stuck in review queues. At scale, a team of 50 might have 200+ PRs in flight, each waiting.
CI/CD Pipeline Bottleneck
Symptoms: Builds take 45+ minutes. Deployments are infrequent. When deployment does happen, something breaks because the gap between development and deployment is so long.
Root causes:
- Tests run serially instead of in parallel
- Excessive test coverage (testing things that don't need testing)
- Infrastructure limitations (builds are I/O bound)
- Approval gates in the pipeline
Impact: Developers can't ship often. Feedback loops are slow. Bugs take longer to reach production. Risk accumulates.
Deployment Gate Bottleneck
Symptoms: Code is ready to ship but can't deploy because it requires:
- Manual approval from a busy person
- Waiting for a change control board that meets once per week
- Waiting for a maintenance window
- Waiting for a "deployment day"
Root causes:
- Fear of deploying (previous bad experiences)
- Compliance requirements that mandate approval
- Organizational policy that requires governance
- Lack of rollback capability
Impact: Code sits ready-to-ship for days or weeks. Business requests get delayed. Risk accumulates in waiting code.
Knowledge Concentration Bottleneck
Symptoms: Certain engineers are always needed to approve code, make decisions, or handle incidents.
Root causes:
- Knowledge lives in few people's heads
- Code ownership isn't distributed
- Mentoring isn't systematized
- Architecture decisions aren't documented
Impact: These people become organizational scalability limits. They're always on fire. Organization can't grow beyond what these few people can manage.
On-Call Bottleneck
Symptoms: One person or a small group is constantly on-call. Incidents pull them from planned work constantly.
Root causes:
- Systems are fragile (too many incidents)
- On-call rotation is too narrow
- Incident response isn't systematized
- No runbooks for common incidents
Impact: On-call people burn out. Planned work doesn't happen because they're always handling fires. Quality of incident response degrades from exhaustion.
Incident Response Bottleneck
Symptoms: When production breaks, it takes 4+ hours to fix. Multiple people investigate the same problem. Communication is chaotic.
Root causes:
- No runbooks for common incidents
- Slow log/metric access
- Communication isn't structured
- No clear incident commander
- Root cause analysis is poor
Impact: Every incident bleeds time and attention. MTTR is high. Customer impact is prolonged.
Dependency Bottleneck
Symptoms: Team A's work is blocked waiting for Team B. Team B's work is blocked waiting on infrastructure provisioning.
Root causes:
- System design has tight coupling
- Shared resources aren't provisioned efficiently
- Communication between teams is slow
- Architectural decisions create unavoidable dependencies
Impact: Parallel work isn't possible. Critical path elongates. Velocity becomes unpredictable because external blockers aren't controllable.
How to Detect Bottlenecks: Three Methods
Method 1: Statistical Analysis of Cycle Time
The simplest bottleneck detection method: analyze where time is spent in your cycle.
How to do it:
Track the time in each stage:
- Code development (from start to PR open): average 1-2 days
- Code review (PR open to approval): average ? days
- Deployment (approval to production): average ? days
Calculate percentiles. Where are the outliers?
If code review takes 2 days on average but the 95th percentile is 8 days, you have a code review bottleneck. When PRs get stuck, they get stuck for a long time.
What to measure:
- PR review turnaround (P50, P95)
- Time from approval to deployment (P50, P95)
- Number of PRs waiting for review at any time
- PR age (how long since opened)
Red flags:
- P95 review time > 24 hours
- Consistently multiple PRs waiting > 1 day
- P95 deployment time > 1 hour
- More than 10% of PRs are in review queue at any time
This method is simple and requires only git history + CI/CD logs.
Method 2: Trend Monitoring and Constraint Theory
Goldratt's theory of constraints says: a system's bottleneck is the resource with the longest queue.
How to apply this:
- Track queue sizes in each stage
- The stage with the longest queue is your bottleneck
What to monitor:
- PRs waiting for review: Queue size growing? This is a bottleneck.
- Work waiting for deployment: Growing queue? Deployment is a bottleneck.
- Incidents waiting for resolution: Queue size > team size? Incident response is a bottleneck.
- Blocked work waiting on dependencies: Growing over time? Dependencies are a bottleneck.
How to detect: Weekly, calculate:
- Average queue size in each stage
- Trend (is it growing, stable, shrinking?)
- P95 queue wait time
If queue size is growing over time, it indicates a constraint forming. This is early warning.
Example: Code review queue is 5 PRs on average, growing to 15 PRs. This trend indicates a bottleneck forming. You can fix it now (add reviewers, reduce PR size, improve tools) before it becomes acute.
Method 3: Proactive Pattern Detection with AI Agents
The newest approach: use AI agents to continuously analyze your development system and alert you to bottlenecks forming.
What this means:
- Agents analyze PR age distributions: "PRs are aging 2x faster than last month, suggesting code review bottleneck"
- Agents track deployment frequency trends: "Deployment frequency dropped 30%, suggesting gating bottleneck"
- Agents correlate metrics: "When on-call team size dropped to 2, incident MTTR increased 3x, suggesting on-call bottleneck"
- Agents detect knowledge concentration: "Only Sarah has >20% approval rate on auth service, suggesting knowledge bottleneck"
- Agents forecast future bottlenecks: "At current growth rate, code review queue will exceed team capacity in 3 weeks"
Tools like Glue exemplify this approach: continuously monitoring your codebase, development process, and team dynamics to surface constraints before they become acute.
The advantage: Humans are terrible at spotting trends in noisy data. Agents excel at it. Continuous monitoring catches problems early when they're easier to fix.
Eliminating Bottlenecks: Action Framework
Once you've identified a bottleneck, the fix depends on the type.
Code Review Bottleneck
Immediate actions:
- Establish review SLA (2-hour target)
- Create review assignment to ensure reviewers are available
- Automate trivial reviews (linting, formatting, dependency updates)
- Make PRs smaller (max 400 lines)
System improvements:
- Distribute code ownership so review isn't bottlenecked on one person
- Train more engineers in critical areas
- Create clear review standards so reviewers don't over-scrutinize
Expected impact: Review turnaround drops from 2-5 days to 2-6 hours. Cycle time decreases 30-50%.
CI/CD Bottleneck
Immediate actions:
- Measure where time is spent in pipeline
- Parallelize test execution
- Move slow tests to optional ("run nightly, not on every commit")
System improvements:
- Optimize slow tests
- Fix flaky tests
- Implement fast-fail (run quick checks first)
- Cache builds and dependencies
Expected impact: Build times drop from 45+ minutes to <15 minutes. Deployment frequency increases.
Deployment Gate Bottleneck
Immediate actions:
- Move approval gates to automated quality checks
- Document what automatic checks mean it's safe to deploy
- Delegate approval authority (don't require VP approval)
System improvements:
- Improve test coverage and confidence
- Build feature flags so deployment and feature release are separate
- Improve monitoring so problems are detected quickly
- Build fast rollback capability
Expected impact: Code reaches production hours instead of weeks after it's ready.
Knowledge Concentration Bottleneck
Immediate actions:
- Document critical knowledge (architecture, decisions, runbooks)
- Pair high-knowledge person with others for knowledge transfer
- Distribute code review responsibility
System improvements:
- Create runbooks for common problems
- Make documentation searchable and accessible
- Onboarding should include knowledge transfer
- Decision documentation should be standard practice
Expected impact: Organization becomes less dependent on specific individuals. Scalability improves.
On-Call Bottleneck
Immediate actions:
- Expand on-call rotation (instead of 2 people, make it 4-5)
- Create runbooks for common incidents
- Reduce alert noise (only page for real problems)
System improvements:
- Improve system reliability (fewer incidents)
- Improve mean time to recovery (fix problems faster)
- Improve monitoring (surface problems faster)
- Systematize incident response (don't make it ad hoc)
Expected impact: On-call burden distributes. Individual MTTR might increase but total burden decreases.
Incident Response Bottleneck
Immediate actions:
- Create runbooks for top 10 incident types
- Establish clear incident roles (commander, communication, technical lead)
- Improve log/metric access
- Run blameless postmortems
System improvements:
- Improve system design to reduce incident causes
- Improve monitoring to detect issues earlier
- Build better dashboards
- Systematize root cause analysis
Expected impact: MTTR decreases 50%+. Confidence in incident response increases.
Monitoring for New Bottlenecks
Bottleneck elimination isn't one-time work. As you grow and systems change, new bottlenecks form.
Continuous monitoring:
- Weekly, calculate key metrics: review turnaround, deployment frequency, cycle time, on-call burden
- Track trends: are metrics improving or degrading?
- Look for correlations: when X changed, did Y also change?
- Alert on thresholds: if review queue exceeds 10 PRs, investigate
Quarterly deep dives:
- Analyze full cycle time distribution
- Look for queue sizes that are growing
- Interview teams about what's slowing them down
- Identify the top 3 bottlenecks
Annual assessment:
- Has organization architecture changed in ways that created new bottlenecks?
- Have team sizes grown in ways that broke previous solutions?
- What new bottlenecks are emerging as you scale?
The goal is continuous evolution, not static optimization. Every few months, conditions change. Your detection and elimination process has to adapt.
The Evolution from Dashboards to Proactive Detection
Traditional approach: leaders check dashboards. When metrics look bad, they investigate.
Modern approach: AI agents continuously analyze your system and alert leaders when patterns suggest bottlenecks.
The old way is reactive. You only know about problems after they've already slowed the organization. The new way is proactive. You detect constraints forming and can address them before they cause pain.
Systems like Glue represent this evolution: continuous monitoring of your codebase and development process, automatic detection of patterns that indicate bottlenecks, proactive surfacing of constraints before they become acute.
The benefit: by the time a human would notice a bottleneck from a dashboard, agents have already been monitoring it for weeks and can suggest what's causing it and how to fix it.
Conclusion: Bottlenecks Are Opportunities
A bottleneck is where the system's constraint lives. It's also where the biggest leverage is.
If code review is your bottleneck and you fix it, cycle time improves dramatically. If CI/CD is your bottleneck and you fix it, deployment frequency jumps. If knowledge concentration is your bottleneck and you fix it, organization scales.
The organizations that grow fastest aren't the ones trying to optimize everything equally. They're the ones that identify the constraint, attack it relentlessly, and move to the next constraint as the first is eliminated.
This is how engineering organizations scale: through systematic bottleneck identification and elimination, continuously, using both data analysis and intelligent monitoring.
Related Reading
- Cycle Time: Definition, Formula, and Why It Matters
- Lead Time: Definition, Measurement, and How to Reduce It
- PR Size and Code Review: Why Smaller Is Better
- DORA Metrics: The Complete Guide for Engineering Leaders
- Engineering Efficiency Metrics: The 12 Numbers That Actually Matter
- Code Dependencies: The Complete Guide