Model Evaluation
Beyond accuracy — precision, recall, F1-score, and how to tell if your model is actually good.
If a model predicts that 99% of people do not have a rare disease, and it’s right 99% of the time, is it a good model? No. If it misses the 1% who actually have the disease, it’s useless. In machine learning, Accuracy is often a lying metric.
Most classifiers output a score, not a hard label — you pick a threshold to turn scores into Yes/No. Slide the threshold below and watch the confusion matrix and the accuracy / precision / recall metrics update live.
The Confusion Matrix
To truly evaluate a model, we look at four outcomes:
- True Positive (TP): Predicted “Yes”, actually “Yes”. (Good)
- True Negative (TN): Predicted “No”, actually “No”. (Good)
- False Positive (FP): Predicted “Yes”, actually “No”. (Type I Error)
- False Negative (FN): Predicted “No”, actually “Yes”. (Type II Error - The “Miss”)
Better Metrics
1. Precision (Quality)
“Of all the times the model said ‘Yes’, how many were actually ‘Yes’?”
High precision means few “False Alarms”.
2. Recall (Completeness)
“Of all the ‘Yes’ cases in the data, how many did the model find?”
High recall means the model doesn’t “Miss” many cases.
3. F1-Score (The Balance)
The harmonic mean of precision and recall. It’s the best single number to use if you have an imbalanced dataset.
Generalization: Train vs Test
A model that memorizes its training data but fails on new data is Overfitting. To detect this, we always split our data:
- Training Set: Used to adjust the weights.
- Test Set: A “hidden” set used only at the very end to see how the model performs on data it has never seen before.
Takeaways
- Accuracy is misleading for imbalanced data.
- Precision measures “false alarms”; Recall measures “misses”.
- Overfitting is when a model memorizes data rather than learning patterns.