Evaluating AI Models (Class 10)
Updated till October 2025
After building an AI model, the next step is Evaluation. Evaluation is the process of testing an AI model to measure its reliability and performance. This process uses a new, unseen “test dataset” to compare the model’s predictions against the actual answers. This stage is the final check before a model is used for real tasks, like diagnosing diseases or filtering spam. This page explains how a model’s performance is measured quantitatively.
Testing Models: The Train-Test Split
A model should not be tested using the same data it was trained on. This is because the model might “memorize” the training set, leading to a perfect score that is deceptive. This error is known as overfitting.
Think of training data as practice problems and test data as the final, unseen exam. Overfitting occurs when a model memorizes the practice problems but fails to understand the underlying concepts. The model learns the training data “too well” and cannot generalize—or perform well—on new, unseen data. Underfitting is the opposite: the model is too simple and fails to learn the patterns, performing poorly on both training and test data.
The goal is a model with good generalization. The train-test split—separating data into a training set and a testing set—is the standard method to simulate a model’s performance in the real world.
| Model State | Performance on Training Data | Performance on Test Data | Analogy: The Student |
|---|---|---|---|
| Underfitting | Poor | Poor | Did not study. Fails both practice and the final exam. |
| Good Fit (Generalization) | Good | Good | Understood the concepts. Passes both practice and the final exam. |
| Overfitting | Excellent | Poor | Memorized the answers. Aces the practice but fails the final exam. |
The Language of Evaluation: Prediction vs. Reality
Model evaluation for classification compares what the model predicted and what the reality was. We will use two scenarios: a spam filter (classifies emails as “Spam” or “Not Spam”) and a medical test (classifies a scan as “Has Disease” or “Healthy”).
Positive and Negative Class
In a binary classification problem (two outcomes), one outcome is the “Positive” (1) class and the other is the “Negative” (0) class. By convention, the Positive class is the “thing of interest,” often the rarer event. The Negative class is the “normal” or common event.
- Spam Filter: Positive (1) = “Spam.” Negative (0) = “Not Spam.”
- Medical Test: Positive (1) = “Has Disease.” Negative (0) = “Healthy.”
The Four Possible Outcomes
For every prediction, there are four possible outcomes. These are the building blocks of all evaluation metrics.
- True Positive (TP):
Prediction: Positive (“Spam”). Reality: Positive (“Spam”).
Result: Correct. The model correctly identified a spam email. - True Negative (TN):
Prediction: Negative (“Not Spam”). Reality: Negative (“Not Spam”).
Result: Correct. The model correctly identified a legitimate email. - False Positive (FP) (Type 1 Error):
Prediction: Positive (“Spam”). Reality: Negative (“Not Spam”).
Result: Incorrect. A “false alarm.” The model incorrectly flagged a good email as spam. - False Negative (FN) (Type 2 Error):
Prediction: Negative (“Not Spam”). Reality: Positive (“Spam”).
Result: Incorrect. A “dangerous miss.” The model incorrectly allowed a spam email into the inbox.
A simple rule: The second word (Positive/Negative) is the model’s prediction. The first word (True/False) is the verdict on that prediction.
- False Positive (FP): The model predicted “Positive,” and that prediction was “False.”
- False Negative (FN): The model predicted “Negative,” and that prediction was “False.”
The Confusion Matrix: A “Report Card” for Your AI
A Confusion Matrix is a table that visualizes a model’s performance by organizing the counts of TP, TN, FP, and FN into a grid. It shows if the system is “confusing” two classes.
The standard matrix has “Actual” values in the rows and “Predicted” values in the columns.
This matrix can be read quickly:
- The Main Diagonal (TN, TP): All correct predictions. These numbers should be high.
- The Off-Diagonal (FP, FN): All incorrect predictions (the “confusions”). These numbers should be low.
Worked Example: Spam Filter Matrix
A spam filter is tested on 100 emails. The matrix below summarizes the results.
| Predicted: Not Spam (0) | Predicted: Spam (1) | Total Actual | |
|---|---|---|---|
| Actual: Not Spam (0) | TN = 25 | FP = 55 | 80 |
| Actual: Spam (1) | FN = 10 | TP = 10 | 20 |
| Total Predicted | 35 | 65 | 100 |
Analysis: The model correctly caught 10 spam emails (TP) and let 25 good emails through (TN). However, it *missed* 10 spam emails (FN) and *incorrectly flagged 55 good emails* as spam (FP). This model creates a problem by filtering too many legitimate emails.
Interactive Metric Calculator
Enter the values from a confusion matrix to see how the four key metrics are calculated. This tool uses an HTML5 Canvas to draw the results in real-time.
Enter Confusion Matrix Values
Calculated Metrics
Metric 1: Accuracy (The One You Know, and Why It’s a Trap)
Accuracy is the simplest metric. It answers, “Overall, how often was the model correct?”.
Formula: Accuracy = Number of correct predictions / Total number of predictions
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Calculation (Spam Filter Example):
Accuracy = (10 + 25) / (10 + 25 + 55 + 10) = 35 / 100 = 35%
The Accuracy Trap: Why a 99% Accurate Model Can Be Useless
Accuracy is misleading on imbalanced datasets. An imbalanced dataset is one where one class is very common and the other is very rare (e.g., disease detection, bank fraud).
Consider a rare disease scenario on 1,000 people. Only 1 person has the disease (Positive), and 999 are healthy (Negative). A “lazy” model that predicts “Healthy” for everyone would have this matrix:
- TP = 0 (never predicted “Has Disease”)
- FN = 1 (the 1 sick person was missed)
- TN = 999 (all 999 healthy people were correctly predicted)
- FP = 0 (never predicted “Has Disease”)
Accuracy Calculation:
Accuracy = (0 + 999) / (0 + 999 + 0 + 1) = 999 / 1000 = 99.9%
This model has 99.9% accuracy but is 100% useless. It failed its only purpose: to find the 1 sick person. The high score is a “trap” because it only reflects the model’s “lazy” guess of the majority class. This is why more specialized metrics are needed.
Metrics 2 & 3: Precision vs. Recall (The Specialist Metrics)
Precision and Recall are used for imbalanced datasets. They provide a better look at performance by focusing on the Positive class.
Precision: The Metric of “Purity”
Question: “Of all the times the model predicted ‘Positive’, how often was it *actually* correct?” It measures the *purity* of the positive predictions.
Formula:
Precision = TP / (TP + FP)
(The denominator is the total number of times the model predicted “Positive”).
Calculation (Spam Filter Example):
Precision = 10 / (10 + 55) = 10 / 65 = 15.4%
Interpretation: When this filter predicts “Spam,” it is only correct 15.4% of the time. It is not precise.
When to Prioritize Precision
Prioritize Precision when the cost of a False Positive (FP) is high.
- Spam Filter: An FP is when a legitimate email (e.g., a job offer) is sent to the spam folder. The cost of missing this email is high. The model must be sure every email it flags is *actually* spam.
- Video Recommendations: An FP is a bad recommendation. If this happens too often, the user gets annoyed and leaves.
Recall: The Metric of “Completeness”
Question: “Of all the *actual ‘Positive’ cases* that exist, how many did the model successfully *find*?” This is also known as Sensitivity.
Formula:
Recall = TP / (TP + FN)
(The denominator is the total number of *actual* positive cases).
Calculation (Spam Filter Example):
Recall = 10 / (10 + 10) = 10 / 20 = 50%
Interpretation: This filter only *found* 50% of the total spam emails. It missed the other 50% (False Negatives).
When to Prioritize Recall
Prioritize Recall when the cost of a False Negative (FN) is high.
- Medical Diagnosis: An FN is when a sick patient is told they are “Healthy.” The cost is catastrophic; the patient does not get treatment. One would rather have some FPs (telling a healthy person to get more tests) than one FN.
- Bank Fraud Detection: An FN is when a fraudulent transaction is approved. The cost is stolen money. The model must *catch* all fraudulent transactions.
| Why this error is bad (The Cost) | ||||
|---|---|---|---|---|
| Medical Diagnosis | “Has Disease” | False Negative (FN) | A sick patient is missed and does not get treatment. | RECALL |
| Spam Filter | “Is Spam” | False Positive (FP) | An important email is lost in the spam folder. | PRECISION |
| Bank Fraud | “Is Fraud” | False Negative (FN) | A fraudulent charge is approved; money is stolen. | RECALL |
| Video Recommendations | “Good Recommendation” | False Positive (FP) | A bad video is recommended; the user leaves. | PRECISION |
The Great Trade-Off: You Can’t Always Have Both
In most models, there is an inverse relationship between Precision and Recall: as one increases, the other tends to decrease. This is the Precision-Recall Trade-off.
This is controlled by the model’s classification threshold. Most models output a probability score (e.g., “75% likely to be spam”). The developer sets a “cutoff point” (the threshold) for the final “Yes” or “No” decision.
- To Get High Recall (Medical Test): Set a LOW threshold (e.g., 10%). The model flags a scan as “Positive” if it is *even 10% sure*. This catches all sick patients (High Recall) but also flags many healthy people (Low Precision).
- To Get High Precision (Spam Filter): Set a HIGH threshold (e.g., 95%). The model flags an email as “Positive” only if it is *95% or more sure*. This flags only definite spam (High Precision) but misses many other spam emails (Low Recall).
Interactive: The Precision-Recall Trade-off
Move the slider to adjust the classification threshold and watch how Precision and Recall change. A higher threshold demands more certainty, increasing Precision but lowering Recall.
Metric 4: The F1 Score (The “Balance” Metric)
When a balance between Precision and Recall is needed, the F1 Score is used. It is the Harmonic Mean of Precision and Recall.
Formula:
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
The harmonic mean is used because it *punishes* extreme, unbalanced scores. The F1 Score is always pulled down toward the *weaker* of the two metrics.
Calculation (Spam Filter Example):
(Using Precision = 0.154 and Recall = 0.50)
F1 Score = 2 * (0.154 * 0.50) / (0.154 + 0.50) = 0.154 / 0.654 = 23.5%
The low F1 Score confirms the model has a poor balance.
Chapter in Action: A Full Worked Example
Problem: An AI model predicts heart attack risk. “Positive” (1) = “High Risk,” “Negative” (0) = “Low Risk.” The model is tested on 100 patients. Calculate and interpret all four metrics.
Confusion Matrix: Heart Attack Risk| Predicted: Low Risk (0) | Predicted: High Risk (1) | Total Actual | |
|---|---|---|---|
| Actual: Low Risk (0) | TN = 20 | FP = 20 | 40 |
| Actual: High Risk (1) | FN = 10 | TP = 50 | 60 |
| Total Predicted | 30 | 70 | 100 |
Step-by-Step Solution
- Accuracy (Overall Correctness):
(50 + 20) / 100 = 70 / 100 = 70% - Precision (Purity of Positive Predictions):
50 / (50 + 20) = 50 / 70 = 71.4% - Recall (Completeness in Finding Positives):
50 / (50 + 10) = 50 / 60 = 83.3% - F1 Score (Balanced Metric):
2 * (0.714 * 0.833) / (0.714 + 0.833) = 0.769 = 76.9%
Final Interpretation
To evaluate this, we must ask: What is the cost of each error?
- False Positives (FP = 20): 20 “Low Risk” patients were told they are “High Risk.” This causes anxiety and unnecessary tests.
- False Negatives (FN = 10): 10 “High Risk” patients were told they are “Low Risk.” This error is catastrophic. These patients will not seek treatment.
Conclusion: For a medical problem, the cost of a False Negative is far higher. Therefore, Recall is the most important metric. The 10 missed cases (FN=10) are the model’s biggest failure. The primary goal must be to increase Recall.
Q&A: Test Your Knowledge
Q1: What is overfitting?
Hint: Think about memorizing practice problems versus understanding concepts for a final exam.
Answer: Overfitting is when an AI model learns its training data “too well,” including its noise and specific details. This causes the model to perform very well on the training data but fail to generalize and perform poorly on new, unseen test data. It’s like memorizing answers instead of learning the subject.
Q2: When is Accuracy a bad or misleading metric?
Hint: Consider a situation where 99% of the data belongs to one class (e.g., “Healthy”) and 1% belongs to another (e.g., “Sick”).
Answer: Accuracy is misleading on imbalanced datasets. If a dataset has 999 “Healthy” people and 1 “Sick” person, a lazy model that just predicts “Healthy” every time will be 99.9% accurate, but it will be useless because it fails to find the one case it was built to detect.
Q3: You are building a model to detect shoplifters. What is the most important metric to maximize?
Hint: What is the cost of a False Positive (accusing an innocent customer) versus a False Negative (missing an actual shoplifter)?
Answer: This is tricky, but most would argue for Precision.
A False Negative (missing a shoplifter) has a cost: stolen goods.
A False Positive (accusing an innocent customer) has a very high cost: brand damage, potential lawsuits, and a terrible customer experience.
Therefore, the model must be very *precise*. When it flags someone, it needs to be correct. You would rather miss a few shoplifters (lower Recall) than accuse innocent people (lower Precision).
Frequently Asked Questions (FAQs)
Can a model have 100% Precision and 100% Recall?
Yes, but it is extremely rare in real-world applications. A “perfect” model that makes no mistakes (FP = 0 and FN = 0) would have 100% in all four metrics. In practice, data is messy, and models are imperfect. There is almost always a trade-off.
What is the difference between a Type 1 Error and a Type 2 Error?
They are statistical terms for the two types of errors:
- Type 1 Error = False Positive (FP): A “false alarm.” You predict something *is* present when it *is not*. (e.g., A fire alarm rings, but there is no fire).
- Type 2 Error = False Negative (FN): A “miss.” You predict something *is not* present when it *is*. (e.g., A fire starts, but the alarm fails to ring).
Why is the F1 Score a “harmonic mean” and not a simple average?
A simple average ( (Precision + Recall) / 2 ) can be misleading. If a model has 100% Precision but 0% Recall, the simple average is 50%, which looks “okay.” The harmonic mean *punishes* this imbalance. The F1 Score for 100% Precision and 0% Recall would be 0%, which correctly reflects that the model is failing at one of its key tasks.
Beyond Binary: Evaluating Multi-Class Classification
The concepts so far focus on binary classification (Spam/Not Spam). Multi-class classification involves three or more categories (e.g., classifying news as “Sports,” “Weather,” or “Politics”).
In this case, the confusion matrix becomes larger. For a 3-class problem, it would be a 3×3 matrix. “True Positives” still run along the diagonal, and all other cells represent errors.
| Predicted: Sports | Predicted: Weather | Predicted: Politics | |
|---|---|---|---|
| Actual: Sports | TP (Sports) | Error | Error |
| Actual: Weather | Error | TP (Weather) | Error |
| Actual: Politics | Error | Error | TP (Politics) |
To calculate Precision and Recall for the whole model, you can use two main methods:
- Macro-Averaging: Calculate Precision and Recall for *each class* individually (e.g., Precision for “Sports,” Precision for “Weather”) and then find the simple average of those scores. This treats all classes as equally important, even if one is rare.
- Micro-Averaging: Sum all individual True Positives, False Positives, and False Negatives from all classes to get one “giant” TP, FP, and FN count, then calculate one overall Precision and Recall. This method gives more weight to the more common classes.
Evaluating Different AI Problems: Regression
Not all AI models classify. Regression models predict a continuous numerical value, not a class. For example: predicting a house price, tomorrow’s temperature, or the number of sales.
For these models, we cannot use Accuracy or a confusion matrix. Instead, we measure the *error*—the distance between the model’s prediction and the actual value.
| Metric | Full Name | Interpretation |
|---|---|---|
| MAE | Mean Absolute Error | The average absolute difference between predictions and actual values. It is easy to understand and is in the same unit as the output (e.g., “the model is off by an average of 5.5 degrees”). |
| MSE | Mean Squared Error | The average of the *squared* differences. This metric heavily punishes large errors (e.g., being off by 10 is 100x worse than being off by 1). It is useful when large errors are very undesirable. |
| RMSE | Root Mean Squared Error | The square root of the MSE. It brings the metric back into the original units (like MAE) while still retaining the property of punishing large errors more. It is one of the most common metrics for regression. |
Module: Metrics Are Not the Whole Story
A high F1 score or a low RMSE does not automatically mean a model is “good.” Evaluation must also consider real-world factors that numerical metrics do not capture.
- Fairness and Bias: Does the model perform equally well for all groups of people? A medical model that is 95% accurate for one demographic but only 70% accurate for another is a biased and unfair model, even if its overall accuracy is high.
- Computational Cost: How much processing power and time does the model need to make a prediction (this is called “inference time”)? A model for a self-driving car must be extremely fast, while a model for a weekly report can be slower.
- Interpretability: Can humans understand *why* the model made its decision? For a loan application model, a bank needs to explain *why* an applicant was denied. A “black box” model that just says “No” is often unusable.
Next Steps: Where to Go From Here
Understanding evaluation is the first step. Now, see how these metrics are applied in different AI domains and continue your journey in data science.





