Evaluation in Deep Learning

Over time, I’ve found that a lot of deep learning research is a mirage. While I’ve usually (but not always!) been able to reproduce the authors' results on their chosen benchmarks, the performance gains often disappear if I choose other benchmarks to train/test against. This gets even worse when I try to apply their results to my problems, which is an extremely different data domain (surface electromyography, than the standard vision/text/speech that most people work with. Even in my own research, I’ve found that deep learning is shockingly finicky and that very minor changes sometimes make surprisingly large changes to the benchmark results.

Because of this, I’ve come to believe that the standard deep learning evaluation methodology is flawed. In general, the best practice seems to be:

Decide on the question we should investigate (e.g. does my fancy new architecture work well?)
Pick some kind of test set to benchmark on (either some standard benchmark like ImageNet or LibriSpeech, or some test set for your particular application).
Based on your question, generate some models you want to evaluate
Benchmark your models on the benchmark and report the results

Papers will usually include some “intuition-building” results – small theorems to prove some behavior, examinations of particular data points in the test set, or “artificial” benchmarks to demonstrate a specific behavior – but I’ve found these to be mostly about building intuition and less about answering the “should you use XYZ or not”. At its core, deep learning is very benchmark driven.

I think the core problem is the mismatch between step 3 and step 1 – training a model requires choosing all kinds of nuisance parameters that are unrelated to our research question. If I want to test whether a Transformer outperforms an RNN (for example), then I have to also decide on a huge range of other things like “do I use Adam or SGD w/ Momenutm?” or “what data augmentation do I use?” or “how many parameters should I use in my model?"¹

This impedance mismatch between evaluations on modeling artifacts vs questions about modeling strategies has caused me to re-evaluate my own standards for evaluation at work in two ways:

At the most mechanical level, generating models is inherently stochastic, and so any good benchmark number could just be the result of a “lucky” random number generator. Because of this, I’ve started reporting confidence intervals and ANOVA results instead of reporting a single benchmark number.
More generally, I deliberately design my experiments to establish some kind of “reasonable robustness” to these nuisance parameters. That way, if those results change in the future, my results will (hopefully) still be valid.

I’ll tackle each of these in turn.

Robustness to Random Seeds

Given our software stacks, deep learning is almost an inherently stochastic process. Even assuming you exactly fix all the hyperparameters of your model, there is still inherent randomness in:

The weights you initialize your model with
The batch sampling order
The data augmentation randomness (and in things like dropout)
The inherent non-determinism from CUDA (you could use deterministic matmuls, but are you really going to make training slower?)
Scheduling randomness (e.g. for things like Hogwild or asynchronous SGD).
The train/validation split you use (if you’re doing early stopping).
The hyperparameter search random seed (if you’re doing HP search).

These sources of randomness might seem small, but they all add up. I’ve seen a surprising number of results where just changing the random seed reverses the results of which modeling strategy is better. If your results are dependent on the exact random seed you use, how confident are you that they’re actually true?

There’s a simple solution to this: treat all of this “random seed variation”² as i.i.d random noise and treat each model as coming from some underlying distribution. That lets you do some statistics to take into account this randomness.

Luckily, the majority of ML metrics are averages across test samples – things like accuracy, word error rate, AUC, etc. So by the central limit theorem, that underlying distribution will be very close to Gaussian, assuming your test set is reasonably large.³ That means you only need 2 or 3 samples to satisfy the assumptions of a Student’s t distribution. You might not get a lot of statistical power, but frankly if the improvement from your modeling strategy isn’t many times larger than the standard error here, you should probably rethink how practically relevant your result will be anyways.

Robustness to Nuisance Parameters

After establishing the our results aren’t just due to random chance, the thing I try to establish is that our results are “reasonably” robust to the exact modeling process we use. After all, I’m not the only one working on the model – while I’m investigating the model architecture, one of my colleagues might be introducing a new data augmentation or an auxiliary loss function or some other improvement that’s orthogonal to my own. It would be nice to know that our improvements will stack together and not just disappear once we bring them together.

I haven’t found a silver bullet for this (unfortunately), but I think you can go a long way by just thinking through your experimental design. Lots of other domains have this kind of “generalizability” question, so this isn’t in any way unique to ML.

For some nuisance parameters, I just shove them into an automated HP search algorithm – that way, I can just use the same statistical tests to get a handle on how much the HPs influence the model performance.
- For some HPs this can very expensive (e.g. a Bayesian hyperparameter search that ends up fitting many models within it). To make this cheaper, sometimes I’ll just randomly generate hyperparameters without any search and directly compare the distribution of benchmark outputs. Once you have this distribution, you can also “simulate” a random search by bootstrapping the max value of a fixed-sized subsample.
For other nuisance parameters, I try to choose some qualitatively different “parameter regimes” to search through (e.g. for data augmentation try to fit a ConvNet, a RNN, and a transformer, or try different amounts of data sizes to check for scaling, or different orders of magnitude for the # of parameters). If I get the same result across several different parameter choices, then I can be more confident that the result will generalize to other parameter choices.
- The number of possible parameter choices usually very high, so careful experimental design to subsample the space can be very helpful (e.g. using paired comparisons for better statistical power, or using a latin square design)
- At my most ambitious, I try to show that the same technique works across multiple different applications that we’re working on – that helps show that the technique captures something generalizable about sEMG and our general porblem domain and not just the details of my specific prolbem. This can be a lot of work though, since different applications can look quite different (and often have fairly different codepaths for training models).

Of course, what’s a nuisance parameter depends on the exact question we want answered: while for some questions “what optimizer should I use” isn’t that relevant, for other questions it might be crucial. ↩︎
Yes, I know that technically they’re not all controllable by random seeds, but the name still fits. ↩︎
Technically you also need to establish that your variable has finite variance, but most metrics are inherently bounded. The only time I’ve personally seen this pop-up has been when doing some Bayesian modeling and trying to use the expected log predictive density, which could potentially have unbounded variance. ↩︎

Amateur Hour