A Debugging Story

Debugging is a core skill for software engineers, but IMO it’s one that we often neglect. There’s a ton of advice on the internet on how to be a better systems designer, or how to solve algorithms challenges (albeit mostly for interviews), but very little on how to get better at debugging.

I wanted to help solve that by trying to document a real-world bug (https://github.com/HypothesisWorks/hypothesis/issues/3446 stumped me for quite some time. Luckily, it’s in an open source codebase which means I can share a lot of the actual code without having to censor details of our internal codebase. I might’ve gotten some details wrong (it’s been a couple months since the bug, and I went on vacation for like 4 weeks while debugging it), but I reconstructed this as best I can from my notes at the time.

WARNING: this is going to be long, detailed, and somewhat meandering – I wanted to try and document the entire winding journey with all the false starts, red herrings, and debugging hacks needed. Bug reports are often polished to only see the final solution: my hope is that by revealing all the gory details inside that you’ll be able to at least pick up some tricks.

If you don’t want the all the details, you can skip to the end where I talk about some of the lessons I hope you can walk away from.

Some Context

I work at CTRL-labs, a division of Facebook building neural interface technologies for AR/VR. As part of my work, I wrote and maintain a library for building “real-time streaming” Pytorch neural networks. These operate over “streams” of data – that is, data that is delivered to you over time in, say, 100 μs chunks. We want to run the neural network on the data as it comes in, with the minimum amount of buffering necessary.

Crucially though, we do not control how much data we receive on each chunk – depending on the exact device firmware and the bluetooth status, sometimes we can receive 20 samples of data at once and sometimes we can receive 70 samples of data at once. We want the output logits of the network to be the same, regardless of this behavior – otherwise, debugging “why did my model not respond to XXX” becomes fiendishly difficult.

To test this, we use hypothesis, a library for property-based testing. The way we implement this test is to create an abstract base class where the user defines how to construct a network and we define test methods that check for windowing invariance (among other things). A simplified version of this looks something like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# In the Library Code
from hypothesis import strategies as st

class Assertions(metaclass=abc.ABCMeta):
    @abc.abstractmethod
    def create_pytorch_network(self, data: st.Data):
        ...

    @hypothesis.given(st.data())
    def test_online_offline_equivalence(self, data):
        network = self.create_pytorch_network(data)
        # some assert statements

# In User code
class TestMyNetwork(Assertions):
    def create_pytorch_network(self, data: st.Data):
        # the network I want to check
        ...

That is, the user inherits from Assertions, implements create_pytorch_network and then the base class automatically creates the test cases using that common network generator.

The Symptom

One day, I was alerted to a strange failure in our CircleCI tests. You can see the full stacktrace here, but the important part is:

1
2
3
4
5
6
7
8
9


args = (<pybmi.torch.tests.test_modules.Test_RIMLP_Raw object at 0x7f505fa3ba90>,)
kwargs = {}

    @functools.wraps(func)
>   def decorate_context(*args, **kwargs):
E   hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (20 examples overran while generating 9 valid ones)...
...

You can add @seed(124050483624262456020131915724542301937) to this test or run pytest with --hypothesis-seed=124050483624262456020131915724542301937 to reproduce this failure.

This test was intermittent (which probably explains why it was merged in) – in retrospect, I estimate this fails maybe 5-10% of the time. That’s enough to be very noticeable and annoying to people, but also easy enough to miss during development time. At first, I thought it was a legitimate failure that just didn’t get caught often.

But when I looked further, I noticed the failure didn’t actually make much sense. Our test case shouldn’t be exceeding the max allowed side – the actual generator in create_pytorch_network looked something like:

1
2
3
4
5
6
7
8


from hypothesis import strategies as st

st.sampled_from([16, 8, 6])
st.sampled_from([(-1, 0, 1), (-1.0, 0.0, 1.0), (-0.5, 0.5), (0,)])
st.boolean()  # optional
st.lists(st.integers(1, 8), min_size=1, max_size=3)
st.sampled_from(["mean", "max"])
st.lists(st.integers(1, 5), min_size=2, max_size=5))

which should pretty easily fit inside Hypothesis’s maximum allowable side.

Finding a Reproducer

The first step to debugging any failure is to figure out how to reliably reproduce it. It’s possible to debug unreproducible errors (e.g. by using appropriate logging to reconstruct what happened or by using something like rr to make it reproducible), but it’s a lot more involved.

The obvious thing to do was to follow the directions from Hypothesis, which happily directed specify a --hypothesis-seed=... flag to reproduce the error. Unfortunately, that did not work – I tried this flag both locally and in the CI environment¹ and could not reproduce the failure.

The second thing was to try to just run the actual test many times until I reproduced it. I wrote a small bash script:

1
2
3
4
5
6


set -ex
for i in {1..10000}
do
    echo $i
    pytest test_file.py -k "online_offline_equivalence and RIMLP"
done

and ran it both locally and in CircleCI. Unfortunately, there was still no reproduction. At this point, I was a little stumped – after all, how could this test be intermittently failing in CI, yet I could run it 10000 times without a single failure? I decided to just file an issue with Hypothesis asking for help and to come back another day (after my scheduled vacation).

The next day after I got back from vacation, I took another look at the bug and had the bright thought that maybe this had to do with some interaction across test cases. I then set up a couple bash scripts to run in parallel:

One running pytest on just that file 10000 times
One running pytest on the entire package 10000 times
One reproducing exactly the CI test command 10000 times.

That turned out to be the trick. Luckily, it turned out option 1 was enough to reproduce the problem in CI. Once I did that, I checked whether the same script also worked on my laptop, which it did. It was somehow a problem with the file.

Overall, I’d say that finding a minimum reproducer probably took 3-4 hours, although most of that was waiting for these scripts to finish executing.

Minimizing the Reproducer

Once I had a reproducer, it was time to minimize the test case. Ideally, I wanted something self-contained in a single file – that’s much easier to share to others, especially to the Hypohtesis maintainers on Github.

I’m sure there’s a smarter way to do this, but I have a very simple algorithm:

Make a copy of the test file and write a bash script to pytest on it 10000 times.
Recursively Copy and paste into that file
Delete everything that aren’t seem related. Double-check that you can still reproduce the bug between deletions.
Start renaming everything to avoid giving away too many details of our internal codebase.
Check that it reproduces on a fresh environment with minimal dependencies (to check it’s not some weird monkey-patching thing).

Getting this to something reasonable took about 2 hours of work, and I was able to push the reproducer to https://github.com/alanhdu/hypothesis_3446/blob/591e2efc1e3241f72c2a01352cc3e7e7a116f7ab/test_hypothesis.py and open a bug report on Github

67 lines of code to reproduce the bug – not bad in terms of length, but pretty bad in terms of complexity. There were several things that stuck out to me as suspicious:

Why were we hitting a “max allowable size” error on such a small test case?
Why did the failure only appear when I ran all tests in the file at once?
Why did the failure only appear using the test case inheritance setup?
Why did the failure not reproduce itself when specifying --hypothesis-seed=<...>?

Getting a breakpoint

Now that I had an easy reproducer, it was time to whip out pdb, the Python debugger (technically, I used pdb++ but that’s a minor detail). Using a debugger is much easier than having to deal with print statements.

My first thought was to just use the pytest --pdb to try to reproduce the problem, but that didn’t work – it just kept me at a pointed to the @hypothesis.settings line, which did not let me really poke around to Hypothesis’s internals.

So instead, I created a new a new conda environment and cloned Hypothesis locally, using pip install -e to install the local version. This would allow me to insert breakpoint() inside of hypothesis’s own codebase and get a debugger going. You don’t technically have to do this – I have sometimes directly modified the files in anaconda/env/test/lib/python3.8/site-packages/hypothesis, but that can be risky if you don’t remember to revert your changes afterwards (or to blow away and recreate the environment.

From there, I just used ripgrep to find the actual error message. Luckily, that error happened exactly once, so I was able to insert a breakpoint right at this line.

Debugging

With this in mind, we can finally start debugging. Now that I knew where the error message was being sent, I could use the breakpoint to inspect the callstack and try to figure out how everything fits together. In particular, how are examples actually being generated, and what defines the “max allowable size”?

Making things deterministic

This took a while (and I went down a wrong turn looking heavily at the Shrinker class which ended up being a red herring), but eventually it became clear that the key method was ConjectureRunner.generate_new_examples That eventually calls into ConjectureRunner.new_conjecture_data which turns in turns gets a self.random argument.

Just from the name, it seemed like self.random must be the source of non-determinism. To test that assumption, I added a self.original_state = self.random.getstate() to the constructor immediately after this line, which meant I could then read that local variable in a pdb session. I could then patch the __init__ to always call self.random.set_state(...) with the given seed.

What’s wrong with the seeding?

With this change, I had a 100% reproducible bug for the first time ever. Now, the question was: why did passing in --hypothesis-seed=... in the CLI not make things reproducible if a manual self.random.set_state(...) did? Clearly, I thought, this must be because there was some kind of hidden non-determinism in setting the seed somehow.

I tried to just read through the code to understand where the seed was being set (using strategic use of ripgrep to look for strings like set_state), but this ended up going nowhere – the initialization logic was pretty complicated, and I didn’t really understand how the well enough to piece things together. I decided to work on something else for a while.

After a couple days, I came back to this and decided I could just cheat – instead of reading the code, I could just instrument it to record everywhere the seed is being set. So I wrote a custom wrapper around random.Random that would record the stacktrace of every access to random.Random constructor call and to every call to set_state and monkey-patched it in my test script, before I even import hypothesis for the first time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


import random
import traceback

class MyRandom(random.Random):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.access = [
            (*args, **kwargs, traceback.format_stack())
        ]

    def set_state(self, *args, **kwargs):
        super().set_state(*args, **kwargs)
        self.access.append((*args, **kwargs, traceback.format_stack())

random.Random = MyRandom

In pdb, this let me access all the places where seeds were set to narrow down my search. Finally, I hit the payload: the seed was being set in a get_random_for_wrapped_test function which looked like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


def get_random_for_wrapped_test(test, wrapped_test):
    settings = wrapped_test._hypothesis_internal_use_settings
    wrapped_test._hypothesis_internal_use_generated_seed = None

    if wrapped_test._hypothesis_internal_use_seed is not None:
        return Random(wrapped_test._hypothesis_internal_use_seed)
    elif settings.derandomize:
        return Random(int_from_bytes(function_digest(test)))
    elif global_force_seed is not None:
        return Random(global_force_seed)
    else:
        global _hypothesis_global_random
        if _hypothesis_global_random is None:
            _hypothesis_global_random = Random()
        seed = _hypothesis_global_random.getrandbits(128)
        wrapped_test._hypothesis_internal_use_generated_seed = seed
        return Random(seed)

pdb confirmed that the Random was being set using wrapped_test._hypothesis_internal_use_seed. Grepping for the _hypothesis_internal_use_seed string, I landed at this function:

1
2
3
4
5
6
7


def accept(test):
    test._hypothesis_internal_use_seed = seed
    current_settings = getattr(test, "_hypothesis_internal_use_settings", None)
    test._hypothesis_internal_use_settings = Settings(
        current_settings, database=None
    )
    return test

Finally, I had hit something – it looks like setting the seed directly also implicitly sets the database setting to None. If I removed that line, then I could now (finally) reproduce the error using --hypothesis-seed=<...> instead of manually setting self.random.set_state in the ConjectureRunner.__init__ function!

This finally answers the question of “why doesn’t --hypothesis-seed=<...> reproduce the error – it’s because that implicitly turns off the database use!

NOTE: At this point, I realized I could’ve found the answer very simply by just using the --hypothesis-verbosity=debug flag – that would immediately logged the database files we were using, and comparing the diffs between --hypothesis-seed=<...> and without it would’ve immediately told me that database=None in one but not the other. So in some sense, this was a very unnecessary route to what I needed).

It’s the database!

As part of this, I noticed that if I modified the reproduction test to do hypothesis.settings(database=None), then I could no longer reproduce the failure at all! That means the error must be coming from the database somehow. Going back to ConjectureRunner.__init__, it looked like there was a database_key field that seemed relevant.

To see what’s going on, I added a print(self.dataset_key) statement in the constructor and checked the outputs – it turns out both test functions were sharing the same database key! I added a quick hack to ensure they had different database keys by replacing this line with:

1

self.database_key = None if database_key is None else database_key + uuid.uuid4()

With the key collision cleared up, there was no longer any error! This answered another one of my questions: why was it necessary to run both tests at once to reproduce the error? It must be because the shared key causes some kind of collision.

The last part was to check why the database keys were conflicting. From here,

After some investigation, I followed pdb up from ConjectureRunner.__init__ to see that the database key was being set here:

1
2
3
4
5
6
7


try:
    database_key = self.wrapped_test._hypothesis_internal_database_key
except AttributeError:
    if global_force_seed is None:   # this branch is taken!
        database_key = function_digest(self.test)
    else:
        database_key = None

Looking at that function_digest function, I could see that was basically hashing the source code, name, and arguments of the function! But because of the inheritance, all of these things are exactly the same!

Now, I could finally say I understood what was going on:

Why were we hitting a “max allowable size” error on such a small test case? Because of some kind of database corruption due to the conflicting database keys?
Why did the failure only appear when I ran all tests in the file at once? Without multiple tests, there are no collisions in the database?
Why did the failure only appear using the test case inheritance setup? Without this, the different tests would be assigned different test keys
Why did the failure not reproduce itself when specifying --hypothesis-seed=<...>? This implicitly sets database=None, which means the database keys are not relevant.

Now, I could finally go back and fix the intermittent error (by setting database=None in our default hypothesis settings) update the bug report with all my latest findings to discuss potential solutions.

Lessons

Now, this was a fairly tricky bug – overall, I estimate that it took me around 30-40 hours of devoted work (plus more hours in my off time marinating over the bug) for me to track down the problem. As you can see, it was a long, winding road with multiple false starts and places where I was stumped until I came back with fresh eyes.

So, what have we learned? I think a couple things:

Be patient: sometimes, you just need to walk away and work on something else for a while, coming back with fresh eyes on the problem
Bugs are fixable: you actually can go in and debug problems, even when they are tricky and involve other people’s code. The joy of using open-source software is that you can fix your own problems and not just wait for someone to come by.
Know your tools: most of this was only possible because I’ve been using Python for years and now a bunch of ways on how to get information (e.g. using pdb to dynamically explore a codebase, or monkey-patching classes to trace where they’re used). Learning how these tools work is worth it.
Debugging is like a science: at every step, I had some specific questions I wanted to answer, and often had working assumption that I wanted to test. Those questions/assumptions guide what information you should go out and gather – this is much more effective then “let’s print everything, stare at the logs, and think very hard”.

CircleCI helpfully allows you to “re-run with ssh” a test, so I could ssh onto the actual worker and inspect its state interactively as well. ↩︎

Amateur Hour