Honest Evaluation of Classification Models
SUMMARY MATERIALS


Authors

Summary

Supervised classification is a part of machine learning that has grown in interest over the last years. In the literature, there are many proposals for classification paradigms and learning algorithms that can be applied to specific classification tasks. Therefore, an honest classifier evaluation and a fair comparison among classification models are key points in order to obtain right conclusions from the results achieved, as well as to choose the best model/paradigm. However, there are many researchers that focus their work on proposing new classification algorithms, leaving the honest evaluation of the results aside.

This tutorial hosted at the second Asian Conference on Machine Learning (ACML 2010) presents an overview of performance evaluation methodologies for classifiers. It is organized in four parts. In the first part, we introduce the classification problem and motivate the importance of an honest validation of classification models and model comparison. The second part is devoted to the scores that can be used to measure the goodness of a classifier. The classification error is the most studied and also the most commonly used score. However, there are other scores that may be of interest in certain application domains. The third part of the tutorial is related to estimation methods. We present and motivate the problem of estimating the value of a score for a classifier given a (finite) data set and we elaborate on different estimation methods as well as on their properties and application domains. Finally, the fourth part of the tutorial is dedicated to classifier comparison. In this part, we introduce statistical hypothesis testing and different types of statistical tests that can be used to compare two or more classification models using one or more data sets. Additionally, each part presents conclusions and recommendations to perform honest classifier evaluation according to specific characteristics of the problem or the data set as well as general best practices in classifier evaluation in order to obtain fair conclusions from the results.