Glossary
PMs need to understand training data quality, model accuracy in context, and drift over time to build ML products effectively without needing the math.
At Salesken, our product managers were some of the smartest people in the room. But they were making decisions with incomplete information because the codebase was a black box to them.
Machine learning for product managers is the discipline of applying ML techniques to make better product decisions and to build products that use ML. It encompasses two things: using ML as a tool (predict which customers will churn, which features matter most) and building ML-powered features (recommendation engines, content moderation, personalization).
Most product managers don't build ML models themselves. They work with data scientists and ML engineers. But they need enough understanding of ML to know: what's possible, what's hard, what data is needed, and what could go wrong.
Product managers work with constraints. Roadmaps are shaped by what's feasible. Understanding ML lets PMs expand the feasible set: "We thought personalization was months away. But an ML-based recommendation system could ship in weeks." Or it lets them understand why something isn't feasible: "We want a content moderation system, but we don't have labeled training data. That's the blocker."
Understanding ML also makes you a better collaborator. Data scientists speak in probabilities and trade-offs. When a data scientist says "this model has 85% accuracy", you need to understand: is that good? (Depends on what happens in the other 15%). What could go wrong? (Depends on which 15% it misclassifies).
1. ML for Decision-Making
Using ML to understand the product and make better decisions. Examples: predicting churn, forecasting demand, ranking feature importance.
You don't ship these as product features. You use them internally. Example: "Our churn model predicts that 2% of enterprise customers will churn in the next quarter. The top churn factors are: low feature adoption and long response times to support." That informs product decisions.
2. ML-Powered Features
Building product features that use ML. Examples: recommendations, personalization, prediction, classification.
These are what users see. Users interact with the output ("here are recommendations for you") but don't see the ML.
3. Enabling User Agency
Building tools that let users apply ML to their own problems. Example: email spam filters, fraud detection, anomaly detection. Users configure what patterns matter to them, and the system learns.
ML projects have different constraints:
Data dependency: ML projects need data. Lots of it. Good data. If you don't have data, or if your data is biased, the ML project fails. This constraint doesn't exist in normal software projects.
Impossible to predict exactly: Normal software is deterministic. Given input X, you get output Y. ML is probabilistic. Given input X, you get output Y with 87% confidence. You need to think in probabilities.
Bias and fairness: ML models can amplify bias in training data. An ML hiring filter trained on historical hiring data will discriminate the same way your historical hiring did. Normal software doesn't have this problem (assuming you're not purposely discriminating).
Harder to test: You can test software with test cases. ML is harder. You can test that the model runs, but you need to test whether it works well, which requires labeled data and statistical analysis.
Model drift: Models trained on historical data degrade over time as the world changes. This is called drift. Your churn model trained on 2020 data might not work in 2024. Normal software doesn't have this problem.
1. Define the problem: What exactly are we trying to predict or classify? "Predict which customers will churn" is vague. "Predict which enterprise customers will churn in the next 90 days, for the purpose of targeting retention campaigns" is specific.
2. Collect data: What data do we need? Do we have it? Is it clean? This often takes 40% of the project time.
3. Explore and prepare: Data scientists dive deep. Are there patterns? What features matter? What's the signal-to-noise ratio? This exploration is critical. Bad data exploration leads to bad models.
4. Build the model: Train on historical data. Test on held-out data. Evaluate: does it work? How well? What's the accuracy? The false positive rate?
5. Deploy: Take the model from laptop to production. This is often harder than building it. Production models need to: handle new data they weren't trained on, degrade gracefully when they're uncertain, be monitored for drift.
6. Monitor: In production, does the model perform as expected? Is accuracy holding? Is drift happening? Monitoring is ongoing.
Training data: Historical data used to teach the model. "Here are examples of churned customers and non-churned customers. Learn the difference."
Test data: Held-out data used to evaluate the model. Important: must be separate from training data, or you'll be optimizing the model to your test set, not to reality.
Accuracy: What percentage of predictions are correct? 85% accuracy means 85% of predictions are right, 15% are wrong.
Precision: Of the things the model predicts as positive, how many are actually positive? Churn prediction example: "The model predicts 100 customers will churn. Of those 100, 80 actually do. Precision = 80%." High precision = few false positives.
Recall: Of the actual positive cases, how many does the model catch? Churn prediction: "There are 200 customers who will actually churn. The model predicts 80 of them. Recall = 40%." High recall = few false negatives.
The precision-recall tradeoff: You can usually increase one at the expense of the other. Stricter models catch fewer false positives (high precision) but miss more true positives (low recall). Looser models catch more true positives (high recall) but have more false positives (low precision).
Model drift: Over time, the relationship between inputs and outputs changes. A model trained on 2020 hiring data doesn't work well in 2024 if hiring practices changed. Drift is normal and requires retraining.
"We have data, so we should build an ML model." Data is necessary, not sufficient. You also need: a clear problem, sufficient labeled data, and a good evaluation metric. Having data doesn't mean an ML project will succeed.
"Higher accuracy is always better." Depends on the cost of different mistakes. For medical diagnosis, false negatives (missing an illness) are worse than false positives (diagnosing something that isn't there). For spam filtering, false positives (marking legitimate email as spam) might be worse. The right accuracy depends on your use case.
"The model is the project." Wrong. Building the model is 20% of the work. Data collection and cleaning is 40%. Deployment is 30%. Monitoring is 10%. Many teams fail at deployment or monitoring, not at modeling.
"We trained the model, so we're done." Models decay over time. You need monitoring and retraining. A deployed model is the start of the work, not the end.
Q: How much ML knowledge do I need as a PM?
A: Enough to ask good questions: "How much data do we need? How long will it take to collect? What could go wrong? How will we know if the model is working?" You don't need to build models yourself. You need to understand enough to plan and monitor.
Q: How long does an ML project take?
A: Longer than you think. A typical ML project: 2-4 weeks to define the problem and assess feasibility, 2-8 weeks to collect and prepare data, 2-4 weeks to build and evaluate, 2-4 weeks to deploy, then ongoing monitoring. 2-6 months is typical. Faster if you have good data and clear problem. Slower if data collection is hard.
Q: When should we build an ML feature vs. using rules?
A: ML is useful when: the pattern is complex (hard to express as rules), data changes frequently (rules would need constant updates), or you have lots of data. Rules are simpler and often sufficient. Don't use ML just because it's trendy.
Q: How do we avoid bias in ML models?
A: Bias in models comes from bias in data. If your training data is biased, your model will be biased. Solutions: audit your training data for bias, use representative data, test your model on different groups, have fairness metrics alongside accuracy. Prevention is easier than fixing bias after deployment.
Keep reading
Related resources