Amateur Hour

Evaluation in Deep Learning

Or use some statistics
Oct 30, 2022. Filed under Technical
Tags: machine learning, research

Over time, I’ve found that a lot of deep learning research is a mirage. While I’ve usually (but not always!) been able to reproduce the authors' results on their chosen benchmarks, the performance gains often disappear if I choose other benchmarks to train/test against. This gets even worse when I try to apply their results to my problems, which is an extremely different data domain (surface electromyography, than the standard vision/text/speech that most people work with. Even in my own research, I’ve found that deep learning is shockingly finicky and that very minor changes sometimes make surprisingly large changes to the benchmark results.

Because of this, I’ve come to believe that the standard deep learning evaluation methodology is flawed. In general, the best practice seems to be:

  1. Decide on the question we should investigate (e.g. does my fancy new architecture work well?)
  2. Pick some kind of test set to benchmark on (either some standard benchmark like ImageNet or LibriSpeech, or some test set for your particular application).
  3. Based on your question, generate some models you want to evaluate
  4. Benchmark your models on the benchmark and report the results

Papers will usually include some “intuition-building” results – small theorems to prove some behavior, examinations of particular data points in the test set, or “artificial” benchmarks to demonstrate a specific behavior – but I’ve found these to be mostly about building intuition and less about answering the “should you use XYZ or not”. At its core, deep learning is very benchmark driven.

I think the core problem is the mismatch between step 3 and step 1 – training a model requires choosing all kinds of nuisance parameters that are unrelated to our research question. If I want to test whether a Transformer outperforms an RNN (for example), then I have to also decide on a huge range of other things like “do I use Adam or SGD w/ Momenutm?” or “what data augmentation do I use?” or “how many parameters should I use in my model?"1

This impedance mismatch between evaluations on modeling artifacts vs questions about modeling strategies has caused me to re-evaluate my own standards for evaluation at work in two ways:

I’ll tackle each of these in turn.

Robustness to Random Seeds

Given our software stacks, deep learning is almost an inherently stochastic process. Even assuming you exactly fix all the hyperparameters of your model, there is still inherent randomness in:

  1. The weights you initialize your model with
  2. The batch sampling order
  3. The data augmentation randomness (and in things like dropout)
  4. The inherent non-determinism from CUDA (you could use deterministic matmuls, but are you really going to make training slower?)
  5. Scheduling randomness (e.g. for things like Hogwild or asynchronous SGD).
  6. The train/validation split you use (if you’re doing early stopping).
  7. The hyperparameter search random seed (if you’re doing HP search).

These sources of randomness might seem small, but they all add up. I’ve seen a surprising number of results where just changing the random seed reverses the results of which modeling strategy is better. If your results are dependent on the exact random seed you use, how confident are you that they’re actually true?

There’s a simple solution to this: treat all of this “random seed variation”2 as i.i.d random noise and treat each model as coming from some underlying distribution. That lets you do some statistics to take into account this randomness.

Luckily, the majority of ML metrics are averages across test samples – things like accuracy, word error rate, AUC, etc. So by the central limit theorem, that underlying distribution will be very close to Gaussian, assuming your test set is reasonably large.3 That means you only need 2 or 3 samples to satisfy the assumptions of a Student’s t distribution. You might not get a lot of statistical power, but frankly if the improvement from your modeling strategy isn’t many times larger than the standard error here, you should probably rethink how practically relevant your result will be anyways.

Robustness to Nuisance Parameters

After establishing the our results aren’t just due to random chance, the thing I try to establish is that our results are “reasonably” robust to the exact modeling process we use. After all, I’m not the only one working on the model – while I’m investigating the model architecture, one of my colleagues might be introducing a new data augmentation or an auxiliary loss function or some other improvement that’s orthogonal to my own. It would be nice to know that our improvements will stack together and not just disappear once we bring them together.

I haven’t found a silver bullet for this (unfortunately), but I think you can go a long way by just thinking through your experimental design. Lots of other domains have this kind of “generalizability” question, so this isn’t in any way unique to ML.


  1. Of course, what’s a nuisance parameter depends on the exact question we want answered: while for some questions “what optimizer should I use” isn’t that relevant, for other questions it might be crucial. ↩︎

  2. Yes, I know that technically they’re not all controllable by random seeds, but the name still fits. ↩︎

  3. Technically you also need to establish that your variable has finite variance, but most metrics are inherently bounded. The only time I’ve personally seen this pop-up has been when doing some Bayesian modeling and trying to use the expected log predictive density, which could potentially have unbounded variance. ↩︎