Most unit tests are liars. They pass confidently with a handful of carefully chosen examples, then fail spectacularly in production the moment real data walks through the door. You write test_sort_with_empty_list, test_sort_with_one_item, test_sort_with_duplicates, and call it done — only to discover months later that your sort implementation chokes on lists with NaN, or negative zeros, or 2 million elements.
Property-based testing flips this model. Instead of writing specific input-output pairs, you describe properties that should hold for all valid inputs, and let the testing framework generate thousands of cases automatically — including edge cases you’d never think to write. When a property fails, the framework shrinks the failing input to the smallest possible example that still triggers the bug.
This post walks through practical property-based testing in both Python (using Hypothesis) and Go (using rapid), with patterns you can apply to your own codebase today.
Example-Based Tests Hit a Ceiling
Consider a simple function that merges overlapping time ranges:
def merge_ranges(ranges: list[tuple[int, int]]) -> list[tuple[int, int]]:
if not ranges:
return []
sorted_ranges = sorted(ranges, key=lambda r: r[0])
merged = [sorted_ranges[0]]
for start, end in sorted_ranges[1:]:
if start <= merged[-1][1]:
merged[-1] = (merged[-1][0], max(merged[-1][1], end))
else:
merged.append((start, end))
return merged
Traditional tests might cover empty input, a single range, two non-overlapping ranges, and two overlapping ranges. That feels thorough — until someone passes ranges where start > end, or ranges that are fully contained within others, or thousands of tiny overlapping intervals.
The Property-Based Approach
Instead of guessing which inputs matter, define what “correct” means in general terms. For merge_ranges, there are several invariant properties:
- Coverage: Every input range is covered by at least one output range
- No overlaps: Output ranges never overlap
- Sorted: Output ranges are in ascending order
- Consistency: Merging an already-merged result returns the same result
Python with Hypothesis
from hypothesis import given, settings
from hypothesis import strategies as st
# Strategy: generate a list of (int, int) tuples where start <= end
range_strategy = st.tuples(st.integers(min_value=0, max_value=1000),
st.integers(min_value=0, max_value=1000))
valid_range = range_strategy.map(lambda t: (min(t), max(t)))
range_list = st.lists(valid_range, min_size=0, max_size=50)
@given(range_list)
@settings(max_examples=500)
def test_merge_ranges_no_overlaps(ranges):
result = merge_ranges(ranges)
for i in range(len(result) - 1):
assert result[i][1] < result[i + 1][0], \
f"Overlapping ranges in output: {result[i]} and {result[i + 1]}"
@given(range_list)
def test_merge_ranges_coverage(ranges):
result = merge_ranges(ranges)
for start, end in ranges:
covered = any(rs <= start and end <= re for rs, re in result)
assert covered, f"Range ({start}, {end}) not covered by {result}"
@given(range_list)
def test_merge_ranges_idempotent(ranges):
"""Merging an already-merged result should return the same result."""
first = merge_ranges(ranges)
second = merge_ranges(first)
assert first == second
Hypothesis generates hundreds of random inputs per test. When it finds a failure, it automatically shrinks the example — so instead of reporting a bug with a list of 47 random ranges, it might tell you the simplest failing case is [(-5, -5), (-5, 10)].
Go with rapid
The same approach in Go using rapid, which provides a type-safe, generics-based API:
package mergetest
import (
"sort"
"testing"
"pgregory.net/rapid"
)
type Range struct {
Start int
End int
}
func MergeRanges(ranges []Range) []Range {
if len(ranges) == 0 {
return nil
}
sorted := make([]Range, len(ranges))
copy(sorted, ranges)
sort.Slice(sorted, func(i, j int) bool {
return sorted[i].Start < sorted[j].Start
})
merged := []Range{sorted[0]}
for _, r := range sorted[1:] {
last := &merged[len(merged)-1]
if r.Start <= last.End {
if r.End > last.End {
last.End = r.End
}
} else {
merged = append(merged, r)
}
}
return merged
}
func TestMergeRanges_NoOverlaps(t *testing.T) {
rapid.Check(t, func(t *rapid.T) {
ranges := rapid.SliceOf(rapid.Custom(func(t *rapid.T) Range {
a := rapid.IntRange(0, 1000).Draw(t, "start")
b := rapid.IntRange(0, 1000).Draw(t, "end")
if a > b {
a, b = b, a
}
return Range{Start: a, End: b}
})).Draw(t, "ranges")
result := MergeRanges(ranges)
for i := 0; i < len(result)-1; i++ {
if result[i].End >= result[i+1].Start {
t.Fatalf("overlapping output ranges: %v and %v",
result[i], result[i+1])
}
}
})
}
Notice how rapid.Custom() lets you build complex generators from primitives. The Draw() method consumes from an internal random byte stream, which enables rapid’s automatic minimization — when a test fails, rapid traces which bytes produced which values and systematically shrinks them.
Common Property Patterns
Coming up with properties is the hardest part of property-based testing. Here are five patterns that cover most real-world cases:
1. Roundtrip: decode(encode(x)) == x
Any serialization format should satisfy this. JSON, protobuf, MessagePack, custom binary formats — if you can encode and decode, you can test it:
func TestJSONRoundtrip(t *testing.T) {
rapid.Check(t, func(t *rapid.T) {
original := rapid.Custom(func(t *rapid.T) Event {
return Event{
ID: rapid.String().Draw(t, "id"),
Type: rapid.String().Draw(t, "type"),
Payload: rapid.SliceOf(rapid.Byte()).Draw(t, "payload"),
}
}).Draw(t, "event")
data, err := json.Marshal(original)
if err != nil {
t.Fatalf("marshal failed: %v", err)
}
var decoded Event
if err := json.Unmarshal(data, &decoded); err != nil {
t.Fatalf("unmarshal failed: %v", err)
}
if decoded != original {
t.Fatalf("roundtrip mismatch: got %v, want %v",
decoded, original)
}
})
}
2. Oracle: Compare Against a Reference Implementation
When rewriting or optimizing a function, keep the old version around as an oracle. Generate random inputs and verify both produce identical results:
@given(st.lists(st.integers()))
def test_optimized_sort_matches_builtin(values):
assert optimized_sort(values) == sorted(values)
This is one of the highest-value property tests you can write — it catches every class of correctness bug in the new implementation, because you have a known-correct baseline to compare against.
3. Invariant: Something Should Never Happen
State the negative — your function should never return duplicate IDs, your stack should never have a negative depth, your parser should never crash on valid input:
@given(st.text(min_size=0, max_size=500))
def test_parser_never_crashes(input_text):
"""The parser must not raise exceptions on any string input."""
result = parse(input_text) # Should handle all input gracefully
assert isinstance(result, ParseResult)
4. Idempotence: f(f(x)) == f(x)
Operations that should converge — sorting, deduplication, normalization, merging — are naturally idempotent. Applying them twice should produce the same result as applying them once:
@given(st.text())
def test_normalize_idempotent(text):
once = normalize_unicode(text)
twice = normalize_unicode(once)
assert once == twice
5. Commutativity and Associativity
Mathematical properties translate directly to tests. String concatenation is associative for merge operations. Set union is commutative. Addition is both. If your operation claims these properties, test them:
func TestSetUnionCommutative(t *testing.T) {
rapid.Check(t, func(t *rapid.T) {
a := rapid.MapOf(
rapid.IntRange(0, 100),
rapid.String(),
).Draw(t, "a")
b := rapid.MapOf(
rapid.IntRange(0, 100),
rapid.String(),
).Draw(t, "b")
assertEqual(t, union(a, b), union(b, a))
})
}
Stateful Testing: Testing State Machines
Both Hypothesis and rapid support stateful (model-based) testing, where you define a model of your system’s expected behavior and the framework generates random sequences of operations to verify the real implementation matches the model.
In rapid, this looks like defining a state machine with rapid.MakeStateMachine — you specify actions, preconditions, and invariants. The framework then generates random action sequences, which is invaluable for catching bugs in concurrent data structures, caches, and protocol implementations.
In Hypothesis, the @st.composite decorator combined with RuleBasedStateMachine provides the same capability. You define rules that transition between states, and Hypothesis explores the state space looking for sequences that violate your invariants.
Integration with Existing Test Suites
You don’t need to rewrite your entire test suite. The best approach is incremental:
- Start with roundtrip tests — serialization, encode/decode, parse/render. These are easy to write and catch surprising bugs.
- Add oracle tests for any function you’re optimizing or rewriting. Compare old vs new.
- Add invariant tests for your core data structures — “this list should always be sorted,” “this map should never have null values.”
- Gradually expand to more complex properties and stateful testing as the team gets comfortable.
One practical tip: both Hypothesis and rapid integrate cleanly with standard test runners. Hypothesis tests are just pytest functions with a decorator. rapid tests are standard testing.T functions that call rapid.Check. No new build steps, no CI changes — just add them and run.
When Property-Based Testing Shines (And When It Doesn’t)
Property-based testing is most valuable for:
- Libraries and utilities with well-defined algebraic properties
- Parsers, serializers, and data format converters
- Optimization rewrites where you have a reference implementation
- Concurrent data structures and state machines
- Any code where edge cases matter (financial calculations, cryptography, compression)
It’s less useful for:
- UI behavior and visual rendering
- Integration tests with external services (though you can generate request payloads)
- One-off scripts with simple logic
The key insight is that property-based tests complement — not replace — example-based tests. Keep your specific examples for documentation and regression. Add properties to catch the cases you didn’t think of.
Getting Started
For Python projects, install Hypothesis and point it at one of your pure functions:
pip install hypothesis
For Go projects, add rapid as a test dependency:
go get pgregory.net/rapid
Pick one function — something with clear input/output behavior, something you understand well. Write a roundtrip property or an oracle test. Run it with a few hundred examples. See what happens. More often than not, you’ll find at least one edge case your manual tests missed.
The real payoff comes months later, when a new edge case appears in production and your property tests already catch it in CI — because the framework generated that exact input at random during a routine test run.