Data Science

Model Evaluation

Domino2018-06-14 | 10 min read

Return to blog home

This Domino Data Science Field Note provides some highlights of Alice Zheng’s report, "Evaluating Machine Learning Models", including evaluation metrics for supervised learning models and offline evaluation mechanisms. The full in-depth report also includes coverage on offline vs online evaluation mechanisms, hyperparameter tuning and potential A/B testing pitfalls is available for download. A distilled slide deck that serves as a complement to the report is also available.

Why Model Evaluation Matters

Data scientists make models. Oftentimes, we’ll hear data scientists discuss how they are responsible for building a model as a product or making a slew of models that build on each other that impact business strategy. An aspect of machine learning model development that is both fundamental and challenging is evaluating a model's performance. Unlike statistical models which assume that the distribution of data will remain the same, the distribution of data in machine learning models may drift over time. Evaluating the model and detecting distribution drift enables people to identify when retraining the machine learning model is needed. In Alice Zheng’s “Evaluating Machine Learning Models” report, Zheng advocates for considering model evaluation at the start of any project as it will help answer questions like “how can I measure success for this project?” and avoid “working on ill-formulated projects where good measurement is vague or infeasible.”

Evaluation Metrics for Supervised Learning Models

Zheng indicates that “there are multiple stages in developing a machine learning model…..and it follows that there are multiple places where one needs to evaluate the model.” Zheng advocates for considering model evaluation during the prototyping stage, or when “we try out different models to find the best one (model selection).” Zheng also points out that “evaluation metrics are tied to machine learning tasks” and that “there are different metrics for the tasks.” A few of the evaluation metrics Zheng covers in the report include classification, regression, and ranking for supervised learning. Zheng also mentions that two packages to consider include R’s metrics package and scikit-learn’s model evaluation.


Regarding classification, Zheng references that among the most popular metrics for measuring classification performance include accuracy, confusion matrix, log-loss, and AUC (area under the curve). While accuracy “measures how often the classifier makes the correct predictions” as it is “the ratio between the number of correct predictions and the total number of predictions (the number of data points in the test set), confusion matrix“ (or confusion table) shows a more detailed breakdown of correct and incorrect classifications for each class.“ Zheng notes that using confusion matrix is useful when wanting to understand the distinction between classes, particularly when “the cost of misclassification might differ for the two classes, or one might have a lot more test data of one class than the other.” For example, the consequences of making a false positive or false negative in a cancer diagnosis are different.

As for log-loss (logarithmic loss), Zheng notes that it “if the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. The probability can be understood as a gauge of confidence… it “is a “soft” measurement of accuracy that incorporates this idea of probabilistic confidence.” As for AUC, Zheng describes it as “one way to summarize the ROC curve into a single number, so that it can be compared easily and automatically.” The ROC curve is a whole curve and “provides nuanced details of the classifier.” For even more explanations on AUC and ROC, Zheng recommends this tutorial.


Zheng notes that “one of the primary ranking metrics, precision-recall, is also popular for classification tasks.” While these are two metrics, they are commonly used together. Zheng indicates that “mathematically, precision and recall can be defined as the following:

  • precision = # happy correct answers/# total items returned by ranker
  • recall = # happy correct answers/ # total relevant items."

Also, that “in an underlying implementation, the classifier may assign a numeric score to each item instead of a categorical class label, and the ranker may simply order the items by the raw score." Zheng also notes that personal recommendation is potentially another example of a ranking problem or regression model. Zheng notes that “the recommender might act either as a ranker or a score predictor. In the first case, the output is a ranked list of items for each user. In the case of score prediction, the recommender needs to return a predicted score for each user-item pair—this is an example of a regression model."


With regression, Zheng indicates in the report that “n a regression task, the model learns to predict numeric scores.“ As noted earlier, personalized recommendation is when we “try to predict a user’s rating for an item.” Zheng also notes that one of “the most commonly used metrics for regression tasks is RMSE (root-mean-square-error” which is also known as RMSD (root-mean-square-deviation). Yet, Zheng cautions that while RSME are commonly used, there are some challenges. RSMEs are particularly “sensitive to large outliers. If the regressor performs really badly on a single data point, the average error could be very big” or that “the mean is not robust (to large outliers).” Zheng notes that there will always be “outliers” with real data and “the model will probably not perform very well on them. So it’s important to look at robust estimators of performance that aren’t affected by large outliers.” Zheng motions that looking at the median absolute percentage is useful because it “gives us a relative measure of the typical error.”

Offline Evaluation Mechanisms

Zheng advocates in the paper that

“the model must be evaluated on a dataset that’s statistically independent from the one it was trained on. Why? Because its performance on the training set is an overly optimistic estimate of its true performance on new data. The process of training the model has already adapted to the training data. A more fair evaluation would measure the model’s performance on data that it hasn’t yet seen. In statistical terms, this gives an estimate of the generalization error, which measures how well the model generalizes to new data.“

Zheng also indicates that researchers can use hold-out validation as a way to generate the new data. Hold-out validation, “assuming that all data points are i.i.d. (independently and identically distributed), we simply randomly hold out part of the data for validation. We train the model on the larger portion of the data and evaluate validation metrics on the smaller hold-out set.” Zheng also points out resampling techniques such as bootstrapping or cross-validation may also be used when needing a mechanism that generates additional datasets. Bootstrapping “generates multiple datasets by sampling from a single, original dataset. Each of the “new” datasets can be used to estimate a quantity of interest. Since there are multiple datasets and therefore multiple estimates, one can also calculate things like the variance or a confidence interval for the estimate." Cross validation, Zheng notes, is “useful when the training dataset is so small that one can’t afford to hold out part of the data just for validation purposes.” While there are many variants of cross-validation, one of the most commonly used is k-fold cross-validation which

“divides the training dataset into k-folds….each of the k folds takes turns being the hold-out validation set; a model is trained on the rest of the k -1 folds and measured on the held-out folds. The overall performance is taken to be the average of the performance on all k folds. Repeat this procedure for all of the hyperparameter settings that need to be evaluated, then pick the hyperparameters that resulted in the highest k-fold average.”

Zheng also points out that the sckit-learn cross-validation module may be useful.


As data scientists spend so much time on making models, considering evaluation metrics early on may help data scientists accelerate work and set up their projects for success. Yet, evaluating machine learning models is a known challenge. This Domino Data Science Field note provides a few insights excerpted from Zheng’s report. The full in-depth report is available for download.

The Practical Guide to  Accelerating the Data Science Lifecycle  Lessons from the field on becoming a model-driven businesses.   Read the Guide

Domino Data Science Field Notes provide highlights of data science research, trends, techniques, and more, that support data scientists and data science leaders accelerate their work or careers. If you are interested in your data science work being covered in this blog series, please send us an email at writeforus(at)dominodatalab(dot)com.

Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform unifies the flexibility AI teams want with the visibility and control the enterprise requires. Domino enables a repeatable and agile ML lifecycle for faster, responsible AI impact with lower costs. With Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.

Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.


By submitting this form you agree to receive communications from Domino related to products and services in accordance with Domino's privacy policy and may opt-out at anytime.