Model Evaluation in Machine Learning


Machine Learning is growing at a very fast pace and has become an integral part of almost every business. Machine Learning is linked to making predictions and these predictions need to be accurate to create real value for any organization or business. Though training a model is an important step before making predictions but this is not just a step to follow but the model must be trained well otherwise, the model might make incorrect predictions for future products or samples. Thus, it becomes essential to evaluate a model that has been trained before releasing it into production.

I will discuss in this article about the various model evaluation techniques using which you can assess your model and decide whether to rely upon the predictions made by the trained model.

Model Evaluation Techniques

The model evaluation is vital in measuring the performance of a machine learning model as it evaluates the accuracy of the predictions made by the model you have trained. Of course, there are two kinds of machine learning problems, i.e. Regression and classification. The model evaluation method also differs according to the type of problem you are trying to solve applying machine learning.

On a broader scale, we can divide the model evaluation techniques into two categories: Holdout and Cross-Validation. Both of these validation methods utilize the test data to evaluate the model. This is because if we will evaluate the model on data that was utilized for training the model then the model will learn the data and makes 100% percent correct predictions leading to the overfitting of the model. Hence the model is always trained on the different data set while the predictions and model evaluation is performed on different data set.


In the holdout method, the data is divided into three subsets as mentioned below:

· Training set: Training set is the subset of data that is used to train the model.

· Validation set: The validation set is another subset of data that is used to assess the performance of the trained model. This data set also helps in deciding the best performing model by fine-tuning the parameters of the model.

· Test set: This is another subset with unseen data that is used to evaluate the future performance of a model.

The main approach of the holdout method is to use the different subset of data each for training and testing.


The cross-validation technique of model valuation involves dividing the data into a training set which is used to train the model and independent set used to assess the analysis. The most common method used under cross-validation is k-fold cross-validation. In the k-fold technique, the data is partitioned into k equal-sized subsets where the value of k is decided by the user and can vary from 5 to 10. One of the k subsets acts as the test set and the remaining k-1 subsets are used as the train data. This method is repeated k number of times and all the errors are averaged to calculate the overall effectiveness of the model.

Model Evaluation Metrics

There are a variety of metrics available for the evaluation of the model, though the choice of metric depends upon the type of data and the type of machine learning task, e.g. regression, classification, clustering, supervised, unsupervised, etc. Precision-recall is the most common metric which can be used in most of the cases of machine learning. I will discuss all the metrics related to the supervised method of machine learning. In supervised learning, there are two types of metrics: one used for classification problems and the other used for regression problems.

Metrics used for Classification problems:

1. Confusion Matrix:

Confusion matrix is a classification matrix which is used to break the predicted data into correctly classified and incorrectly classified data. For data with N number of classes, the NxN size confusion matrix is generated.

As an example, for data of patients, heart disease can be predicted as positive or negative. The confusion matrix can be represented as a matrix of 2x2 sizes as below:

Terms related to Confusion Matrix:

True Positive — It denotes the number of patients who have heart disease in actual and are predicted correctly as positive.

False Negative — It denotes the number of patients who have heart disease in actual but predicted as negative.

False Positive — It denotes the number of patients who do not have heart disease but predicted to be positive.

True Negative — It denotes the number of patients who do not have heart disease in actual and is predicted correctly as negative.

Type I Error — The False Positive represents the Type I error in Machine Learning where you have predicted the heart disease to be positive for some patient but actually he/she has no heart disease.

Type II Error — The False Negative represents the Type II error in Machine Learning where you have predicted the heart disease to be negative for some patient but he/she is suffering from heart disease.

Thus clearly, Type II error is more dangerous than the Type I error.

2. Classification Accuracy:

This is another way of calculating the accuracy of the model for classification based problems. The accuracy is the ratio of correct predictions to the total number of predictions.

Classification accuracy works well only in case there are balanced samples in each class in the dataset. The cost of the misclassification is quite high thus should not be considered as the only approach of model evaluation.

3. AUC-RUC Curve:

AUC-RUC is another measure of model accuracy in the case of binary classifiers. AUC represents the degree of separability. Higher the AUC Curve implies better the model which further implies that the model is able to distinguish the classes. As part of the Area under Curve, the curve is plotted between True Positive Rate (TPR) also called Sensitivity and False Positive Rate (FPR) also called (1-Specificity).

4. F-measure or F Score:

F Score is another measure of the accuracy of a model that uses Precision and Recall values. Precision can be defined as the total number of correct positive results divided by the total number of predicted positive results.

Recall can be defined as the correct positive results divided by the total number of actual positives.

Regression Metrics

Regression Metrics are used to measure the model accuracy for Regression problems. Mean Absolute Error and Root Mean Square Error are the two most used methods.

1. Mean Absolute Error (MAE):

Mean Absolute Error is the mean of the differences between original and predicted values. MAE measures the deviation of predicted values from the original values.

2. Root Mean Square Error (RMSE):

Root Mean Square Error is similar to Mean Absolute Error with the only difference that Root Mean Square Error is the mean of the square of the differences between the predicted values and original values.


All in all, these are the model evaluation metrics mostly used in machine learning and can be used for both classifications as well as regression type of data. I hope the information shared by me proves to be helpful to you in learning the concepts at the basic level though you can explore them separately to understand them in more detail.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store