cs.thefarshad
easy

Model Evaluation

Beyond accuracy — precision, recall, F1-score, and how to tell if your model is actually good.

If a model predicts that 99% of people do not have a rare disease, and it’s right 99% of the time, is it a good model? No. If it misses the 1% who actually have the disease, it’s useless. In machine learning, Accuracy is often a lying metric.

Most classifiers output a score, not a hard label — you pick a threshold to turn scores into Yes/No. Slide the threshold below and watch the confusion matrix and the accuracy / precision / recall metrics update live.

00.250.50.751t = 0.50
actual positive (above line) actual negative (below line)
6
True Positive
3
False Positive
2
False Negative
5
True Negative
69%
accuracy
67%
precision
75%
recall
Lower the threshold → catches more positives (recall up) but more false alarms (precision down).

The Confusion Matrix

To truly evaluate a model, we look at four outcomes:

  • True Positive (TP): Predicted “Yes”, actually “Yes”. (Good)
  • True Negative (TN): Predicted “No”, actually “No”. (Good)
  • False Positive (FP): Predicted “Yes”, actually “No”. (Type I Error)
  • False Negative (FN): Predicted “No”, actually “Yes”. (Type II Error - The “Miss”)

Better Metrics

1. Precision (Quality)

“Of all the times the model said ‘Yes’, how many were actually ‘Yes’?”

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

High precision means few “False Alarms”.

2. Recall (Completeness)

“Of all the ‘Yes’ cases in the data, how many did the model find?”

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

High recall means the model doesn’t “Miss” many cases.

3. F1-Score (The Balance)

The harmonic mean of precision and recall. It’s the best single number to use if you have an imbalanced dataset.

Generalization: Train vs Test

A model that memorizes its training data but fails on new data is Overfitting. To detect this, we always split our data:

  • Training Set: Used to adjust the weights.
  • Test Set: A “hidden” set used only at the very end to see how the model performs on data it has never seen before.

Takeaways

  • Accuracy is misleading for imbalanced data.
  • Precision measures “false alarms”; Recall measures “misses”.
  • Overfitting is when a model memorizes data rather than learning patterns.