Best Practices for Model Evaluation in Supervised Learning and Classification

Best Practices for Model Evaluation in Supervised Learning and Classification

Introduction

No matter why we use Machine Learning, there's one common goal: we want our models to make good predictions. After fitting our model to training data, we want to know if it will generalize well to new, unseen data and not just memorize the training data.

This article will discuss basic terms and techniques in Model Evaluation.

Performance estimation and Model Selection

Estimating the performance of a model usually involves the following steps:

  1. Feed the training data to the machine learning model.

  2. Make predictions on data from the test set.

  3. Count the number of wrong predictions on the test dataset to compute the model's prediction accuracy.

Ideally, the estimated performance of a model tells us how well it will perform on unseen data, which is often the main problem we want to solve in applications of machine learning.

Running a learning algorithm over the training dataset involves experimenting with hyperparameters. Every time we change them, we get a different model. Besides tweaking hyperparameters of a specific learning algorithm, we are often interested in selecting the best-performing algorithm in terms of predictive and computational performance.

To summarize:

  • We want to estimate the generalization performance of our model on unseen data.

  • We want to increase this performance by tweaking the learning algorithm and selecting the best model from a given hypothesis space.

  • We want to identify the best algorithms suited for our problem, compare them, and select the best-performing one as well as the best-performing model from the algorithm's hypothesis space.

Note: Biased performance is okay in model selection or algorithm selection if it affects all models equally. In other words, we are concerned about the "relative" performance of models when comparing them.
Suppose performance estimates are pessimistically biased and we underestimate performances by 15%, then the ranking order of our models will not change.

Model rankingEstimated performanceBiased performance
Model 190%75%
Model 280%65%
Model 360%45%

Some assumptions

This part might seem a bit mathematical or theoretical for some, but I find it important for further discussion..

We assume that the training examples are i.i.d (independent and identically distributed), which means that all examples have been drawn from the same probability distribution and are statistically independent of each other.
Statistical independence will be explained in different section for better context.

We focus on prediction accuracy, defined as the number of all correct predictions divided by the number of examples in the dataset.
We define prediction accuracy as:

$$\text{ACC} = 1 - \text{ERR}$$

The prediction error, is computed as the expected value of the 0-1 loss over n examples in a dataset \(S\)

$$\text{ERR}_S=\sum_{i=1}^{n}L(\hat{y}_i, y_i)$$

The 0-1 loss is defined as:

$$L(\hat{y}_i, y_i) = \begin{cases} 0 & \text{if } \hat{y}_i = y_i \\ 1 & \text{if } \hat{y}_i \ne y_i \end{cases} \quad$$

This may not be very important in the context of this article, but a model with good generalization performance maximizes prediction accuracy or minimizes the chance of making a wrong prediction, as defined below:

$$C(h) = \Pr_{(\mathbf{x}, y)\sim \mathcal{D}} [h(\mathbf{x}) \ne y]$$

\(\mathcal{D}\) is the generating distribution the dataset has been drawn from, \(\mathbf{x}\) is the feature vector of a training example with class label \(y\).

Kronecker's Delta we define as \(\delta(L(\hat{y}_i, y_i)) = 1 - L(\hat{y}_i, y_i) \) .

Resubstitution Validation and the Holdout Method

Holdout method is the simplest estimation technique known to everyone interested in Machine Learning. We split the dataset into two parts: a training set and a test set. We train our model on the training set and then evaluate it on the test set. We record the number of correct predictions the model made. The fraction of correct predictions out of the total number of test samples gives us the estimated accuracy of our model.

Resubstitution validation, or resubstitution evaluation, involves training and evaluating our model on the same training dataset. This method usually introduces an optimistic bias, making it hard to determine if the model will perform well on new, unseen data or if it has simply memorized the training data.
However, using this method we can compute optimism bias as the difference between the training and test accuracy.

Splitting the dataset into training and test sets is a process of random subsampling. This assumes that all data points come from the same probability distribution for each class. In simpler terms, it means that all data points come from the same general pool and are evenly distributed across different categories or classes. However, there are two issues with this approach that we will discuss next.

Stratification

Let us consider the Iris dataset, which has a total of 150 samples and consists of 3 flower species distributed uniformly:

  • 33.3% Setosa [50 examples]

  • 33.3% Versicolor [50 examples]

  • 33.3% Virginica [50 examples]

We will divide this dataset randomly into 70% for training data and 30% for test data. However, when we do this, we violate the assumption of statistical independence.
Statistical independence means that data points in the training set should be independent of those in the test set. This ensures that the model is evaluated on data it has never seen before.

from mlxtend.data import iris_data
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=108)
print(np.bincount(y_train))
print(np.bincount(y_test))

And we get the following output:

[26 36 43]
[24 14 7]

We created two imbalanced datasets. The class ratio that the algorithm will use to learn a model is 24.7%/34.3%/41%. The test set is also unbalanced, and it is even worse because it is unbalanced in the opposite direction: 53.3%/31.1%/15.6%. It is shown in Figure below.

It could be even worse if the dataset had a high imbalance upfront, and in the worst-case scenario, the test set may not contain any instance of the minority class at all. Thus, here comes stratification, which means that we keep the original class proportions when dividing a dataset into train and test sets.

X, y = iris_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    stratify = y)
print(np.bincount(y_train))
print(np.bincount(y_test))

And we get the expected output:

[35 35 35]
[15 15 15]

Figure 2

When working with very large datasets and balanced datasets, random subsampling in non-stratified fashion is usually not a big concern. Still it is easy to implement and usually beneficial in machine learning applications.

Pessimistic Bias

This problem is mostly related to small datasets or very complex models that need to be trained on a larger amount of data. A model's capacity is a measure of its complexity and flexibility. If the model has not reached its capacity, then its generalization performance would be pessimistically biased, assuming that the algorithm could learn a better model. It is an inevitable issue in applications since we withhold some portion of data to assess our predictive performance. However, we may still feed the entire data after deciding on the model to use and potentially fit a better model. This creates a dilemma because we cannot estimate its performance now, which cannot be avoided in real-world applications.

Confidence Intervals

Besides using the holdout method to estimate generalization performance, it is beneficial to find a confidence interval around its estimate. A simple approach is the normal approximation. We assume the predictions follow a normal distribution and consider each prediction as a Bernoulli trial, with the number of correct predictions following a binomial distribution. The confidence interval can be computed as:

$$\text{ACC}_S\pm z\sqrt{\frac{1}{n}\text{ACC}_S(1 - \text{ACC}_S)}$$

Here, as before \(\text{ACC}_S = \frac{1}{n}\sum_{i=1}^{n}\delta(L(\hat{y}_i, y_i))\), \(\alpha\) is the error quantile and \(z\) is \(1-\frac{\alpha}{2}\) quantile. So for the 95% confidence interval if we have \(\alpha = 0.05\), then \(z = 1.96\).

The code below finds the 95% confidence interval for the predictive accuracy of a Logistic Regression Model, using only the first two features as input. You can also clearly see the effect that stratification has on the model's predictive performance.

from sklearn.linear_model import LogisticRegression

def compute_accuracy(stratified=False, random_state = None):
    X, y = iris_data()
    X = X[:, :2]
    if stratified:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=random_state)
    else:
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=None, random_state=random_state)
    logreg = LogisticRegression(C=1e5) # large C essentially removes regularization
    logreg.fit(X_train, y_train)
    pred = logreg.predict(X_test)
    accuracy = sum(int(y_hat == yi) for y_hat, yi in zip(pred, y_test)) / len(y_test)
    return accuracy

num_experiments = 30
z = 1.96

accuracies = [compute_accuracy(random_state=185) for _ in range(num_experiments)]
mean_acc = np.mean(accuracies)
n = len(accuracies)

var_acc = (1/n)*mean_acc*(1 - mean_acc)
std_acc = np.sqrt(var_acc)

ci_lower = mean_acc - z*(std_acc / np.sqrt(n))
ci_upper = mean_acc + z*(std_acc / np.sqrt(n))

# Stratified
accuracies_s = [compute_accuracy(stratified=True, random_state=185) for _ in range(num_experiments)]
mean_acc_s = np.mean(accuracies_s)
n_s = len(accuracies_s)

var_acc_s = (1/n)*mean_acc_s*(1 - mean_acc_s)
std_acc_s = np.sqrt(var_acc_s)

ci_lower = mean_acc - z*(std_acc / np.sqrt(n))
ci_upper = mean_acc + z*(std_acc / np.sqrt(n))

ci_lower_s = mean_acc_s - z*(std_acc_s / np.sqrt(n))
ci_upper_s = mean_acc_s + z*(std_acc_s / np.sqrt(n))


print(f"Mean accuracy: {mean_acc}")
print(f"95% confidence interval: ({ci_lower}, {ci_upper})")
print(f"Mean accuracy (stratified): {mean_acc_s}")
print(f"95% confidence interval (stratified): ({ci_lower_s}, {ci_upper_s})")

Output:

Mean accuracy: 0.7111111111111114
95% confidence interval: (0.6814990274173577, 0.740723194804865)
Mean accuracy (stratified): 0.8444444444444446
95% confidence interval (stratified): (0.8207654573308237, 0.8681234315580654)

Summary

With enough detail, this post explained simple but useful ways to estimate a model's performance and why we want to estimate it. The main methods are the holdout method and resubstitution validation. We also discussed two related issues: stratification and model capacity, and how confidence intervals can help better estimate predictive accuracy.

Source