Debugging is a core skill for software engineers, but IMO it’s one that we often neglect. There’s a ton of advice on the internet on how to be a better systems designer, or how to solve algorithms challenges (albeit mostly for interviews), but very little on how to get better at debugging.
I wanted to help solve that by trying to document a real-world bug (https://github.com/HypothesisWorks/hypothesis/issues/3446 stumped me for quite some time. Luckily, it’s in an open source codebase which means I can share a lot of the actual code without having to censor details of our internal codebase. I might’ve gotten some details wrong (it’s been a couple months since the bug, and I went on vacation for like 4 weeks while debugging it), but I reconstructed this as best I can from my notes at the time.
WARNING: this is going to be long, detailed, and somewhat meandering – I wanted to try and document the entire winding journey with all the false starts, red herrings, and debugging hacks needed. Bug reports are often polished to only see the final solution: my hope is that by revealing all the gory details inside that you’ll be able to at least pick up some tricks.
If you don’t want the all the details, you can skip to the end where I talk about some of the lessons I hope you can walk away from.
Some Context
I work at CTRL-labs, a division of Facebook building neural interface technologies for AR/VR. As part of my work, I wrote and maintain a library for building “real-time streaming” Pytorch neural networks. These operate over “streams” of data – that is, data that is delivered to you over time in, say, 100 μs chunks. We want to run the neural network on the data as it comes in, with the minimum amount of buffering necessary.
Crucially though, we do not control how much data we receive on each chunk – depending on the exact device firmware and the bluetooth status, sometimes we can receive 20 samples of data at once and sometimes we can receive 70 samples of data at once. We want the output logits of the network to be the same, regardless of this behavior – otherwise, debugging “why did my model not respond to XXX” becomes fiendishly difficult.
To test this, we use hypothesis, a library for property-based testing. The way we implement this test is to create an abstract base class where the user defines how to construct a network and we define test methods that check for windowing invariance (among other things). A simplified version of this looks something like
|
|
That is, the user inherits from Assertions
, implements
create_pytorch_network
and then the base class automatically creates
the test cases using that common network generator.
The Symptom
One day, I was alerted to a strange failure in our CircleCI tests. You can see the full stacktrace here, but the important part is:
|
|
This test was intermittent (which probably explains why it was merged in) – in retrospect, I estimate this fails maybe 5-10% of the time. That’s enough to be very noticeable and annoying to people, but also easy enough to miss during development time. At first, I thought it was a legitimate failure that just didn’t get caught often.
But when I looked further, I noticed the failure didn’t actually make
much sense. Our test case shouldn’t be exceeding the max allowed side
– the actual generator in create_pytorch_network
looked something
like:
|
|
which should pretty easily fit inside Hypothesis’s maximum allowable side.
Finding a Reproducer
The first step to debugging any failure is to figure out how to reliably reproduce it. It’s possible to debug unreproducible errors (e.g. by using appropriate logging to reconstruct what happened or by using something like rr to make it reproducible), but it’s a lot more involved.
The obvious thing to do was to follow the directions from Hypothesis,
which happily directed specify a --hypothesis-seed=...
flag to
reproduce the error. Unfortunately, that did not work – I tried this
flag both locally and in the CI environment1 and could not
reproduce the failure.
The second thing was to try to just run the actual test many times until I reproduced it. I wrote a small bash script:
|
|
and ran it both locally and in CircleCI. Unfortunately, there was still no reproduction. At this point, I was a little stumped – after all, how could this test be intermittently failing in CI, yet I could run it 10000 times without a single failure? I decided to just file an issue with Hypothesis asking for help and to come back another day (after my scheduled vacation).
The next day after I got back from vacation, I took another look at the bug and had the bright thought that maybe this had to do with some interaction across test cases. I then set up a couple bash scripts to run in parallel:
- One running
pytest
on just that file 10000 times - One running
pytest
on the entire package 10000 times - One reproducing exactly the CI test command 10000 times.
That turned out to be the trick. Luckily, it turned out option 1 was enough to reproduce the problem in CI. Once I did that, I checked whether the same script also worked on my laptop, which it did. It was somehow a problem with the file.
Overall, I’d say that finding a minimum reproducer probably took 3-4 hours, although most of that was waiting for these scripts to finish executing.
Minimizing the Reproducer
Once I had a reproducer, it was time to minimize the test case. Ideally, I wanted something self-contained in a single file – that’s much easier to share to others, especially to the Hypohtesis maintainers on Github.
I’m sure there’s a smarter way to do this, but I have a very simple algorithm:
- Make a copy of the test file and write a bash script to
pytest
on it 10000 times. - Recursively Copy and paste into that file
- Delete everything that aren’t seem related. Double-check that you can still reproduce the bug between deletions.
- Start renaming everything to avoid giving away too many details of our internal codebase.
- Check that it reproduces on a fresh environment with minimal dependencies (to check it’s not some weird monkey-patching thing).
Getting this to something reasonable took about 2 hours of work, and I was able to push the reproducer to https://github.com/alanhdu/hypothesis_3446/blob/591e2efc1e3241f72c2a01352cc3e7e7a116f7ab/test_hypothesis.py and open a bug report on Github
67 lines of code to reproduce the bug – not bad in terms of length, but pretty bad in terms of complexity. There were several things that stuck out to me as suspicious:
- Why were we hitting a “max allowable size” error on such a small test case?
- Why did the failure only appear when I ran all tests in the file at once?
- Why did the failure only appear using the test case inheritance setup?
- Why did the failure not reproduce itself when specifying
--hypothesis-seed=<...>
?
Getting a breakpoint
Now that I had an easy reproducer, it was time to whip out pdb
, the
Python debugger (technically, I used
pdb++ but that’s a minor detail).
Using a debugger is much easier than having to deal with print
statements.
My first thought was to just use the pytest --pdb
to try to reproduce
the problem, but that didn’t work – it just kept me at a pointed to the
@hypothesis.settings
line, which did not let me really poke around to
Hypothesis’s internals.
So instead, I created a new a new conda environment and cloned
Hypothesis locally, using pip install -e
to install the local version.
This would allow me to insert breakpoint()
inside of hypothesis
’s
own codebase and get a debugger going. You don’t technically have to do
this – I have sometimes directly modified the files in
anaconda/env/test/lib/python3.8/site-packages/hypothesis
, but that can
be risky if you don’t remember to revert your changes afterwards (or to
blow away and recreate the environment.
From there, I just used ripgrep to find the actual error message.
Luckily, that error happened exactly once, so I was able to insert a
breakpoint
right at this line.
Debugging
With this in mind, we can finally start debugging. Now that I knew where the error message was being sent, I could use the breakpoint to inspect the callstack and try to figure out how everything fits together. In particular, how are examples actually being generated, and what defines the “max allowable size”?
Making things deterministic
This took a while (and I went down a wrong turn looking heavily at the
Shrinker
class which ended up being a red herring), but eventually it became
clear that the key method was
ConjectureRunner.generate_new_examples
That eventually calls into ConjectureRunner.new_conjecture_data
which turns in turns gets a self.random
argument.
Just from the name, it seemed like self.random
must be the source of
non-determinism. To test that assumption, I added a
self.original_state = self.random.getstate()
to the constructor
immediately after this line,
which meant I could then read that local variable in a pdb
session. I
could then patch the __init__
to always call
self.random.set_state(...)
with the given seed.
What’s wrong with the seeding?
With this change, I had a 100% reproducible bug for the first time ever.
Now, the question was: why did passing in --hypothesis-seed=...
in the
CLI not make things reproducible if a manual
self.random.set_state(...)
did? Clearly, I thought, this must be
because there was some kind of hidden non-determinism in setting the
seed somehow.
I tried to just read through the code to understand where the seed was
being set (using strategic use of ripgrep
to look for strings like
set_state
), but this ended up going nowhere – the initialization
logic was pretty complicated, and I didn’t really understand how the
well enough to piece things together. I decided to work on something
else for a while.
After a couple days, I came back to this and decided I could just cheat
– instead of reading the code, I could just instrument it to record
everywhere the seed is being set. So I wrote a custom wrapper around
random.Random
that would record the stacktrace of every access to
random.Random
constructor call and to every call to set_state
and
monkey-patched it in my test script, before I even import hypothesis
for the first time:
|
|
In pdb
, this let me access all the places where seeds were set to
narrow down my search. Finally, I hit the payload: the seed was being
set in a get_random_for_wrapped_test
function which looked like:
|
|
pdb
confirmed that the Random
was being set using
wrapped_test._hypothesis_internal_use_seed
. Grepping for the
_hypothesis_internal_use_seed
string, I landed at this
function:
|
|
Finally, I had hit something – it looks like setting the seed directly
also implicitly sets the database
setting to None
. If I removed
that line, then I could now (finally) reproduce the error using
--hypothesis-seed=<...>
instead of manually setting
self.random.set_state
in the ConjectureRunner.__init__
function!
This finally answers the question of “why doesn’t
--hypothesis-seed=<...>
reproduce the error – it’s because that
implicitly turns off the database use!
NOTE: At this point, I realized I could’ve found the answer very simply by
just using the --hypothesis-verbosity=debug
flag – that would
immediately logged the database files we were using, and comparing the
diffs between --hypothesis-seed=<...>
and without it would’ve
immediately told me that database=None
in one but not the other. So in
some sense, this was a very unnecessary route to what I needed).
It’s the database!
As part of this, I noticed that if I modified the reproduction test to
do hypothesis.settings(database=None)
, then I could no longer reproduce the
failure at all! That means the error must be coming from the database
somehow. Going back to ConjectureRunner.__init__
, it looked like there
was a database_key
field that seemed relevant.
To see what’s going on, I added a print(self.dataset_key)
statement in
the constructor and checked the outputs – it turns out both test
functions were sharing the same database key! I added a quick hack to
ensure they had different database keys by replacing this line
with:
|
|
With the key collision cleared up, there was no longer any error! This answered another one of my questions: why was it necessary to run both tests at once to reproduce the error? It must be because the shared key causes some kind of collision.
The last part was to check why the database keys were conflicting. From here,
After some investigation, I followed pdb
up from
ConjectureRunner.__init__
to see that the database key was being set
here:
|
|
Looking at that function_digest
function,
I could see that was basically hashing the source code, name, and
arguments of the function! But because of the inheritance, all of these
things are exactly the same!
Now, I could finally say I understood what was going on:
- Why were we hitting a “max allowable size” error on such a small test case? Because of some kind of database corruption due to the conflicting database keys?
- Why did the failure only appear when I ran all tests in the file at once? Without multiple tests, there are no collisions in the database?
- Why did the failure only appear using the test case inheritance setup? Without this, the different tests would be assigned different test keys
- Why did the failure not reproduce itself when specifying
--hypothesis-seed=<...>
? This implicitly setsdatabase=None
, which means the database keys are not relevant.
Now, I could finally go back and fix the intermittent error (by setting
database=None
in our default hypothesis settings) update the bug
report with all my latest findings to discuss potential solutions.
Lessons
Now, this was a fairly tricky bug – overall, I estimate that it took me around 30-40 hours of devoted work (plus more hours in my off time marinating over the bug) for me to track down the problem. As you can see, it was a long, winding road with multiple false starts and places where I was stumped until I came back with fresh eyes.
So, what have we learned? I think a couple things:
- Be patient: sometimes, you just need to walk away and work on something else for a while, coming back with fresh eyes on the problem
- Bugs are fixable: you actually can go in and debug problems, even when they are tricky and involve other people’s code. The joy of using open-source software is that you can fix your own problems and not just wait for someone to come by.
- Know your tools: most of this was only possible because I’ve been
using Python for years and now a bunch of ways on how to get
information (e.g. using
pdb
to dynamically explore a codebase, or monkey-patching classes to trace where they’re used). Learning how these tools work is worth it. - Debugging is like a science: at every step, I had some specific questions I wanted to answer, and often had working assumption that I wanted to test. Those questions/assumptions guide what information you should go out and gather – this is much more effective then “let’s print everything, stare at the logs, and think very hard”.
-
CircleCI helpfully allows you to “re-run with ssh” a test, so I could ssh onto the actual worker and inspect its state interactively as well. ↩︎