Amateur Hour

Intermediate Property-based Testing

Or finding properties to actually test
Jul 13, 2023. Filed under Technical
Tags: software

For such an effective way of ferreting out bugs in systems, property-based testing is drastically underused today. I’ve been using property-based testing for almost 10 years now, and in that time I’ve encountered maybe 3 other coworkers who also use property-based testing, even for systems that are ripe for it.

IMO, a big reason for this is that property-based testing has a “draw-the-owl” problem: most articles begin with fairly artificial examples (e.g. the oft-used x == encode(decode(x)) example) but then don’t go much further. I know when I started out, I definitely had problems figuring out how to actually apply property-based testing to the systems I was working with, and I’m sure many others have the same problem.

Draw the Owl Joke

There are entire research papers whose main contribution is basically “we came up with a clever property to test and discovered bugs in some programs”

For readers, note that this is going to be geared towards those who already know what property based testing is. I’ll give a brief introduction to it for those who are brand-new, but you’ll probably want to look at some other introductory, as well.

What is Property-Based Testing?

Normal unit testing has a bit of a chicken-and-egg problem: the reason why we need test suites is that we are generally pretty bad at finding all the different edge cases and interactions present in a system. But because we’re bad at such things, it’s easy to not test exactly those edge cases that cause problems in production!

Property-based testing is one potential solution to this: the basic idea is to test properties that are true for a large set of test inputs. You then have the computer automatically search through that set to find counterexamples.

The hello world of property-based testing is the “encode/decode” example – let’s say that you are writing your own JSON serializer and deserializer. Then a normal test would look something like:

1
2
def test_my_json_encoder_works():
    assert my_json.encode([1, 2, 3]) == "[1, 2, 3]"

But a really naive property-based test would look something like:

1
2
3
4
5
def test_my_json_roundtrip()
    # Generate a random list of integers
    xs = [random.randint(1, 10) for __ in range(random.randint(1, 10))]
    # Assert my property
    assert xs == my_json.decode(my_json.encode(xs))

Of course, an actual property-based test system will implement more effective test case generators that explore more of the space. Good ones will also automatically implement “shrinking” for you, where it will automatically minimize a test case that cases test failures1.

Then there are two questions you need to answer:

We’ll be focusing on the first question here. The second question is (IMO) trickier but also less necessary (you can get pretty far with pretty naive generators), so we’ll leave it to a hypothetical “Advanced Property-Based Testing” blog post.

How to actually find good properties?

The most obvious properties are “definitional” properties – these are things where the function definition naturally implies a definition (e.g. a “sort” means that the output should have the same keys as the input but in ascending order). This is good for things where you can reason about a logical specification, but in my POV there are relatively few places where you can do this easily (if it were easy, we could all just use Coq or Lean).

If those don’t exist, I usually turn to 3 rough categories of properties that I usually turn to when writing a test suite:

The boundaries are a little fuzzy, but I’ll give a bunch of examples of each to hopefully make things clearer.

Model-Checking / Test Oracles

A lot of times, you can define an “oracle” that tells you what the right answer “should be”. If you have one, then the property is just whethre your system agrees with that oracle. Some common examples:

Invariants

An invariant is something that you can modify without logical changes to the output – assert encode(decode(x)) == x is a simple example of this. A more involved example comes from a real-time streaming framework (a la MediaPipe) I use at work. Let’s the say I want to implement a Python computation to plug into this framework – such computations operate over batches of samples and returns batches of samples (e.g. fn(input_batch: Sequence[T]) -> Sequence[U]). But we don’t actually control the batching behavior – things like network lag, the sensor firmware we’re looking at, and more can all affect how many samples are delivered at once. This creates a natural invariant for us to enforce:

1
2
3
4
full_stream = [...]
output = []
for batch in divide_into_batches:
    output += streaming_compute(batch)

should always return the same result.2 Some other examples:

Note that for ML algorithms, you often can’t get exact invariants but you should be able to get “approximate” invariants.

Metamorphic Testing

This is a generalization of an invariant: you make some transformation to the system input which corresponds to some known transformation of the system output. In effect, we want the following commutative diagram:

Commutative Diagram

I find this is the most helpful category for testing complicated “scientific” algorithms. For these algorithms, it’s often hard to derive “perfect” solutions for testing them, but you can sanity-check that they have correct behavior under a number of invariants:

Hopefully, you can see how metamorphic testing lets you test a bunch of things, even if you don’t have a good way of good oracle specifying what the “correct” solution even is!


  1. Broadly speaking, there are two main generations of property-based testing. The first is the QuickCheck family, which automatically creates type-directed test case generation and shrinking. The second generation is what I call “Hypothesis-like”, which has explicit “strategies” over both the generation and shrinking. While almost all of what I’m talking about will be possible with both, you’ll generally have a better experience using Hypothesis-like libraries (since the shrinking strategies). ↩︎

  2. The obvious implementation is to just have the streaming computation operate sample-by-sample internally, but in PYthon it’s usually much faster to operate over a batch (since you can vectorize the underlying NumPy/SciPy/whatever operations). ↩︎