How to Properly Compare Machine Learning Models[Breakdowns]

There are a lot of sources of variance that can lead you down the wrong answers.`

Apr 25, 2023

Hey, it’s Devansh 👋👋

In my series Breakdowns, I go through complicated literature on Machine Learning to extract the most valuable insights. Expect concise, jargon-free, but still useful analysis aimed at helping you understand the intricacies of Cutting Edge AI Research and the applications of Deep Learning at the highest level.

If you’d like to support my writing, please consider buying and rating my 1 Dollar Ebook on Amazon or becoming a premium subscriber to my sister publication Tech Made Simple using the button below.

Help me buy chocolate milk

p.s. you can learn more about the paid plan here.

When picking between multiple models, how do you pick which one is best?

That is a more complex question than you would think. In today’s AI climate, models and pipelines are getting more complex and opaque. This means that comparing across multiple dimensions is not sufficient, since there are all kinds of nuances that we have to take into consideration. The only way we can truly know which model is better is to run “multiple trials optimizing the learning pipeline over sources of variation such as data sampling, data augmentation, parameter initialization, and hyperparameters choices.” Unfortunately, to do so is often too expensive for most teams, so they cut corners and rush the evaluation phase of the machine learning pipeline. This isn’t always ideal, as this can lead to people picking the wrong models. The paper “Accounting for Variance in Machine Learning Benchmarks” goes into this issue in more detail.

What am I better with- making memes or writing?

In this article, I will be going over some interesting takeaways from the paper, including the recommendations by the authors to build fantastic benchmarks for Machine Learning Pipelines. Let me know what you think, and which of the takeaways you found most insightful.

We show a counter-intuitive result that adding more sources of variation to an imperfect estimator approaches better the ideal estimator at a reduction in compute 51xcost.

Why Evaluating Models Matters

Simple, helps us compare to determine the best Machine Learning models. The reason we have multiple metrics and ways to compare the models is that for different problems, different metrics will be more relevant.

What most model comparisons get wrong.

More complexity->Too Costly to accurately evaluate the model over the diverse sets

Spend some time reading the passage above. It hints at something really crucial. Having extremely complex models will directly stop you from going over multiple configurations (and data splits) and comparing the models over a diverse set of tests. When we don’t test over these sources of variation, we might actually get the objectively wrong answer.

Remember these are relative to the variance from bootstrapping

Above is a pretty concise illustration of the various ways we could induce variance into our learning agents. The numbers can’t be ignored. The variance can literally change the results of your comparison. Below is a passage that sums up the main point of this section.

This picture is the TL;DR of the whole section

Naturally, there are factors aside from pure metrics that we care about. If the model is too expensive, it’s not worth anything. You might be wondering how we can compute model complexity. Here is a video that introduces one of the best metrics for computing efficiency (performance with respect to complexity), the Bayesian Information Criterion, in less than a minute.

How to Design Better Benchmarks for ML Pipelines

If you’ve read this far, the next question on your mind is going to be about how we can get to building better benchmarks for our models. Fear not. As promised, here are the aspects you want to focus on for great comparison benchmarks.

Randomize as many sources of variations as possible

Good model comparisons will have a lot of randomized choices. Think back to a lot of the arbitrary choices we make during our machine learning process. The random seed for initializations, data order, how we initialize the learners, etc. Randomizing these will allow for better-performing models. To quote,

“a benchmark that varies these arbitrary choices will not only evaluate the associated variance (section 2), but also reduce the error on the expected performance as they enable measures of performance on the test set that are less correlated (3). This counter-intuitive phenomenon is related to the variance reduction of bagging (Breiman, 1996a; Buhlmann et al., 2002), and helps characterizing better the expected behavior of a machine-learning pipeline, as opposed to a specific fit

I found the comparison to bagging particularly interesting. This is why I recommend taking some time to go over various ML concepts etc. It will help you come across ideas and associations to understand things better and be innovative.

Use Multiple Data Splits

Most people use a single train-test-validation split. They will batch their data once and be done with it. More industrious people might also run some cross-validation. I would recommend also playing around with the ratios used for building the sets. In the words of the team, “For pipeline Accounting for Variance in Machine Learning Benchmarks comparisons with more statistical power, it is useful to draw multiple tests, for instance generating random splits with a out-of-bootstrap scheme(detailed in appendix B).”

Account for variance to detect meaningful improvements

It’s important that you always remember that there is a degree of randomness in your results. Running multiple tests is one way to reduce it. But it will never go away unless you go through every possible permutation (this might be impossible, and definitely needlessly expensive). Minor improvements might just be a result of random chance. When dealing with models, always keep a few close-performing ones on hand.

Closing

This was an interesting paper. The authors did a great job showing how many arbitrary choices in the Machine Learning Process can skew the results. It’s talks of the need to have comprehensive testing to account for randomness. The fact that this paper validates so much of what I have been saying was the icing on the cake. While nothing the paper claimed was controversial, the extent to which it showed how variance can change results was certainly eye-opening for me personally.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.