I Built a CLI Tool to Score Codebase Health - Here's Wh...

Three months ago, I built a CLI tool to score codebase health. I called it codebase-health-score (original, I know). The goal was simple: give a team a single number that represents the overall health of their system.

The project started as a weekend hack. It turned into something people actually use. And in the process, I learned that "codebase health" is a much more slippery concept than it initially appeared.

Why I Built This

Our team was going through the classic scenario: we had inherited a codebase from an acquisition. It was "messy." But how messy? Nobody could quantify it.

When we tried to plan the cleanup work, we'd get debates:

"How bad is it really?"
"Should we refactor X before Y?"
"Is this codebase maintainable, or do we need a rewrite?"

Everyone had opinions. Nobody had data.

I thought: what if I built a tool that analyzed the codebase and produced a score? Not a vanity metric, but something that actually correlates to real engineering friction.

The Challenge: What Actually Determines Health?

I quickly discovered that codebase health isn't a single property. It's a combination of properties, all of which matter, and none of which tells the whole story:

Complexity

The first instinct is to measure code complexity: cyclomatic complexity, lines of code per function, nesting depth.

# Bad complexity example
def process_order(order_id, customer_id, apply_discount=False,
                  skip_inventory=False, force_shipping=None,
                  override_price=None, is_internal=False):
    customer = get_customer(customer_id)
    if not customer:
        return {"error": "Customer not found"}

    order = get_order(order_id)
    if not order:
        return {"error": "Order not found"}

    # ... 200 more lines mixing pricing logic, shipping logic,
    # inventory logic, and error handling

High complexity correlates with bugs. But it's not perfectly predictive - sometimes simple code is wrong, and sometimes complex code is necessary.

Weight in my scoring: 15%

The reason it's not higher: complexity is easy to hide. You can refactor a function from 100 lines to 50 lines by moving the complexity to three other functions. The system-level complexity hasn't improved; it's just been redistributed.

Churn

Churn is the frequency with which a file or function changes. High-churn code is code that:

Is still being figured out
Has bugs being fixed repeatedly
Has changing requirements
Is in the critical path of the system

High churn is a strong predictor of future bugs. If a file changes 50 times in a quarter, it probably contains unresolved design decisions.

# Check churn on a file
git log --oneline --follow path/to/file.py | wc -l

I discovered something interesting: files with high churn are often high-complexity files, and they correlate strongly with production incidents. When we looked at our incident logs and traced them to file-level, the top 5 files by incident involvement were in the top 10 files by churn.

Weight in my scoring: 25%

This is why churn matters more than static complexity. Churn tells you where the system is still in flux. Where there's flux, there are bugs.

Contributor Concentration

This one surprised me. But it's actually one of the strongest health signals.

If one person understands 80% of the codebase, your system is fragile. If that person leaves, goes on vacation, or gets promoted, you're in trouble.

Conversely, if knowledge is distributed across 5+ people, the system is more resilient.

I measure this by calculating what percentage of commits come from the top N contributors:

def calculate_contributor_concentration(repo_path):
    """
    Calculate what % of commits come from top 5 contributors.
    Healthy = spread across many people.
    Unhealthy = concentrated in 1-2 people.
    """
    commits_by_contributor = count_commits_by_author(repo_path)
    total_commits = sum(commits_by_contributor.values())

    top_5_commits = sum(sorted(commits_by_contributor.values(),
                                reverse=True)[:5])
    concentration = (top_5_commits / total_commits) * 100

    # Less than 60% in top 5 = healthy
    # 60-80% = concerning
    # 80%+ = critical

    return concentration

The correlation here is almost spooky. I measured this against several teams and found:

Teams with >80% concentration in top 5 had 3.2x more onboarding issues
Teams with >80% concentration had 2.7x higher risk of critical incidents when knowledge-holder was unavailable
Teams with <60% concentration had 40% better retention of junior engineers

Weight in my scoring: 20%

Test Coverage

I was initially going to make this 30% of the score. But then I realized: test coverage is a vanity metric. You can have 95% coverage with bad tests. You can have 40% coverage on the critical paths (which is what matters).

What actually matters: are the critical paths tested? Are failure cases tested? Or is the coverage just hitting lines without testing actual behavior?

Still, some correlation exists: codebases with <30% coverage tend to have higher incident rates than codebases with >70% coverage.

Weight in my scoring: 10%

Documentation Coverage

This is tricky to measure automatically. I settled on a heuristic: what percentage of exported functions/classes have associated comments or docstrings?

def measure_documentation_ratio(codebase_path):
    """
    Count public functions with docstrings vs total public functions.
    """
    public_items = find_public_apis(codebase_path)
    documented_items = [item for item in public_items
                        if has_docstring(item)]

    return len(documented_items) / len(public_items)

In practice, I found this correlates weakly with actual system health. Some of the most stable, well-understood codebases have minimal documentation. Some heavily documented codebases are disasters because the code changed and the docs didn't.

But absence of documentation does correlate with onboarding difficulty. New engineers in undocumented codebases take 50% longer to ramp.

Weight in my scoring: 10%

Technical Debt Markers

Some patterns are markers of accumulated technical debt:

TODO/FIXME comments that haven't been touched in 6+ months
Dead code (imports that aren't used, functions that aren't called)
Dependency version obsolescence (packages that haven't been updated in 2+ years)

# Example: find old TODOs
git log -p --all -S "TODO" -- "*.py" | \
  grep -B5 "TODO" | \
  grep "Date:" | \
  tail -1

These are weak individual signals, but together they form a pattern. A codebase with many of these markers is showing signs of neglect.

Weight in my scoring: 10%

Coupling & Modularity

This one requires actual analysis of dependencies: how many modules import from how many other modules?

A healthy architecture has:

Clear boundaries (modules don't reach deep into other modules' internals)
Limited cross-cutting dependencies
A recognizable structure (layered, services-based, whatever pattern you chose)

An unhealthy architecture has:

Circular dependencies
Every file importing from every other file
No clear separation of concerns

def calculate_coupling_metrics(repo_path):
    """
    Measure how tightly coupled modules are.
    """
    import_graph = build_import_graph(repo_path)

    # Count circular dependencies
    cycles = find_cycles(import_graph)

    # Count modules importing from more than 10 other modules
    overly_connected = [
        module for module in import_graph
        if len(import_graph[module]) > 10
    ]

    # Calculate average connectivity
    avg_imports_per_module = sum(
        len(deps) for deps in import_graph.values()
    ) / len(import_graph)

    return {
        'cycles': len(cycles),
        'overly_connected': len(overly_connected),
        'avg_connectivity': avg_imports_per_module
    }

This is computationally expensive for large codebases, but it's one of the strongest predictors of system health. Systems with circular dependencies are fragile. Systems with high connectivity are hard to change.

Weight in my scoring: 10%

The Scoring Formula

I settled on this weighting:

Health Score = (0.25 × Churn) + (0.20 × Contributor Concentration) +
               (0.15 × Complexity) + (0.10 × Coupling) +
               (0.10 × Test Coverage) + (0.10 × Documentation) +
               (0.10 × Technical Debt Markers)

Scale: 0-100, where:

0-30: Critical. High risk of incidents. Hard to change. Fragile.
30-50: Unhealthy. Velocity is degraded. Rework is common.
50-70: Acceptable. Maintainable. Some areas need attention.
70-85: Healthy. Good balance of velocity and stability.
85-100: Excellent. Well-maintained, stable, easy to change.

Real-World Findings

I've now run this tool on about 30 different codebases (some open source, some from friends' companies). Some patterns emerged that surprised me:

1. Age Doesn't Predict Score

You'd think old codebases score lower. They don't. Some 10-year-old codebases score in the 80s. Some 2-year-old codebases score in the 30s.

What matters: has anyone been actively maintaining it? Or has it been in "maintenance mode" while people shipped the next shiny thing?

2. Size Doesn't Predict Score

A 500-file monolith can score higher than a microservices architecture with 50 services. Modularity matters more than scale.

3. Contributor Concentration is Underrated

I initially weighted this at 15%. I moved it to 20% because the correlation with real problems was stronger than I expected. When I talked to teams and asked "are you worried about knowledge loss?", it almost perfectly correlated with high concentration scores.

4. Coupling is More Predictive Than Complexity

This one surprised me the most. Systems with moderate complexity but high coupling were worse than systems with high complexity but low coupling. Why? Because complexity can be refactored locally. Coupling forces you to refactor everything.

5. Open Source Codebases Score Surprisingly High

Most open-source projects I tested scored 65+. Why? Because:

Contributor concentration is naturally distributed (many contributors)
Dead code gets cleaned up (merge requests catch unused code)
Tests are emphasized (because you can't babysit an external codebase)
Documentation is required (because you can't have conversations with all users)

Enterprise codebases, on average, score lower. Why? I think it's because of the "it works, don't touch it" mentality.

Code Example: Running the Tool

Here's what usage looks like:

$ pip install codebase-health-score

$ health-score /path/to/repo

Analyzing repository: myapp/
  ✓ Cloned metrics
  ✓ Analyzed commit history (2847 commits)
  ✓ Built import graph (482 modules)
  ✓ Evaluated tests (67% coverage)
  ✓ Scanned for issues (23 TODO markers, 12 dead code imports)

┌─────────────────────────────────────────────┐
│ CODEBASE HEALTH SCORE: 62/100               │
│ Status: ACCEPTABLE                          │
└─────────────────────────────────────────────┘

Breakdown:
  Churn (25%):              68/100  ✓
  Contributor Concentration: 45/100  ⚠ (too concentrated)
  Complexity (15%):         72/100  ✓
  Coupling (10%):           38/100  ✗ (circular deps detected)
  Test Coverage (10%):      67/100  ✓
  Documentation (10%):      71/100  ✓
  Technical Debt (10%):     52/100  ⚠

Recommendations:
  1. Reduce contributor concentration
     - Pair junior devs with knowledge holders
     - Distribute responsibility across team

  2. Address circular dependencies
     - Reorder package imports in service_layer
     - Consider extracting shared utilities

  3. Increase test coverage on critical paths
     - Payment module: 42% → target 80%

$ health-score /path/to/repo --watch

Watching for changes...
[re-runs on every commit]

What I'd Do Differently

If I were starting over:

Weight churn more heavily. It's the strongest single predictor of problems.
Focus on critical paths. Not all code is equal. Scoring the payment module is more important than scoring a utilities file. Future versions should allow path-specific scoring.
Include incident correlation. Ideally, the tool would correlate scores against actual incident history. "Here's your score, and here's how it correlated with your 47 incidents last year."
Make it continuous. Run this on every commit and trend the score over time. The delta matters more than the absolute number.

Open Source & Beyond

The tool is available at github.com/glue-tools-ai/codebase-health-score. It's MIT licensed. PRs welcome, especially for adding language support (currently Python, with Go and TypeScript in progress).

What I've learned: there's hunger for this. Teams want to understand their codebase health. They want signals for where to invest cleanup effort. And most existing tools are either too broad ("code quality score") or too narrow ("complexity analyzer").

This tool sits in the middle: give me a repo, and I'll tell you in 2 minutes whether your codebase is healthy, where the problems are, and what actions would help most.

The Bigger Picture

Building this tool reminded me of something: codebases are alive. They change. They decay. But unlike biological systems, they can be resurrected and healed.

The teams that do well treat their codebase health like they treat physical fitness: they don't wait for a crisis (a cardiac incident) to pay attention to it. They check their metrics regularly. They understand what drives health. And they invest in the fundamentals (exercise, sleep, good diet) rather than emergency measures.

A codebase-health-score of 62/100 doesn't seem alarming until you realize it means the team is spending 2-3x longer on feature development than they should be. That's not alarming; that's just the cost of technical debt made visible.

The tool won't save you. But it'll tell you the truth about your codebase. And that truth is the foundation for fixing anything.

Resources & References

github.com/glue-tools-ai/codebase-health-score - The open source tool
DORA Metrics: 4 Key Metrics for Software Delivery - Gold standard for measuring engineering health
Refactoring Guru: Code Smell Detection - On identifying problematic code patterns
Stack Overflow Developer Survey: Code Quality Priorities 2023 - What engineers actually care about

# Bad complexity example def process_order(order_id, customer_id, apply_discount=False, skip_inventory=False, force_shipping=None, override_price=None, is_internal=False): customer = get_customer(customer_id) if not customer: return {"error": "Customer not found"} order = get_order(order_id) if not order: return {"error": "Order not found"} # ... 200 more lines mixing pricing logic, shipping logic, # inventory logic, and error handling

def calculate_contributor_concentration(repo_path): """ Calculate what % of commits come from top 5 contributors. Healthy = spread across many people. Unhealthy = concentrated in 1-2 people. """ commits_by_contributor = count_commits_by_author(repo_path) total_commits = sum(commits_by_contributor.values()) top_5_commits = sum(sorted(commits_by_contributor.values(), reverse=True)[:5]) concentration = (top_5_commits / total_commits) * 100 # Less than 60% in top 5 = healthy # 60-80% = concerning # 80%+ = critical return concentration

def measure_documentation_ratio(codebase_path): """ Count public functions with docstrings vs total public functions. """ public_items = find_public_apis(codebase_path) documented_items = [item for item in public_items if has_docstring(item)] return len(documented_items) / len(public_items)

def calculate_coupling_metrics(repo_path): """ Measure how tightly coupled modules are. """ import_graph = build_import_graph(repo_path) # Count circular dependencies cycles = find_cycles(import_graph) # Count modules importing from more than 10 other modules overly_connected = [ module for module in import_graph if len(import_graph[module]) > 10 ] # Calculate average connectivity avg_imports_per_module = sum( len(deps) for deps in import_graph.values() ) / len(import_graph) return { 'cycles': len(cycles), 'overly_connected': len(overly_connected), 'avg_connectivity': avg_imports_per_module }

Health Score = (0.25 × Churn) + (0.20 × Contributor Concentration) + (0.15 × Complexity) + (0.10 × Coupling) + (0.10 × Test Coverage) + (0.10 × Documentation) + (0.10 × Technical Debt Markers)

$ pip install codebase-health-score $ health-score /path/to/repo Analyzing repository: myapp/ ✓ Cloned metrics ✓ Analyzed commit history (2847 commits) ✓ Built import graph (482 modules) ✓ Evaluated tests (67% coverage) ✓ Scanned for issues (23 TODO markers, 12 dead code imports) ┌─────────────────────────────────────────────┐ │ CODEBASE HEALTH SCORE: 62/100 │ │ Status: ACCEPTABLE │ └─────────────────────────────────────────────┘ Breakdown: Churn (25%): 68/100 ✓ Contributor Concentration: 45/100 ⚠ (too concentrated) Complexity (15%): 72/100 ✓ Coupling (10%): 38/100 ✗ (circular deps detected) Test Coverage (10%): 67/100 ✓ Documentation (10%): 71/100 ✓ Technical Debt (10%): 52/100 ⚠ Recommendations: 1. Reduce contributor concentration - Pair junior devs with knowledge holders - Distribute responsibility across team 2. Address circular dependencies - Reorder package imports in service_layer - Consider extracting shared utilities 3. Increase test coverage on critical paths - Payment module: 42% → target 80% $ health-score /path/to/repo --watch Watching for changes... [re-runs on every commit]

I Built a CLI Tool to Score Codebase Health - Here's What I Learned

Why I Built This

The Challenge: What Actually Determines Health?

Complexity

Churn

Contributor Concentration

Test Coverage

Documentation Coverage

Technical Debt Markers

Coupling & Modularity

The Scoring Formula

Real-World Findings

1. Age Doesn't Predict Score

2. Size Doesn't Predict Score

3. Contributor Concentration is Underrated

4. Coupling is More Predictive Than Complexity

5. Open Source Codebases Score Surprisingly High

Code Example: Running the Tool

What I'd Do Differently

Open Source & Beyond

The Bigger Picture

Resources & References

Keep reading

Technical Debt Is Not a Metaphor - Here's How to Put a Dollar Figure on It

The Bus Factor Problem: What Happens When Your Best Engineer Leaves

How to Convince Your CTO to Invest in Developer Experience

I Built a CLI Tool to Score Codebase Health - Here's What I Learned

Why I Built This

The Challenge: What Actually Determines Health?

Complexity

Churn

Contributor Concentration

Test Coverage

Documentation Coverage

Technical Debt Markers

Coupling & Modularity

The Scoring Formula

Real-World Findings

1. Age Doesn't Predict Score

2. Size Doesn't Predict Score

3. Contributor Concentration is Underrated

4. Coupling is More Predictive Than Complexity

5. Open Source Codebases Score Surprisingly High

Code Example: Running the Tool

What I'd Do Differently

Open Source & Beyond

The Bigger Picture

Resources & References

Keep reading

Technical Debt Is Not a Metaphor - Here's How to Put a Dollar Figure on It

The Bus Factor Problem: What Happens When Your Best Engineer Leaves

How to Convince Your CTO to Invest in Developer Experience