Lecture 3: Property-Based Testing (PBT) with well-scoped and well-typed terms
Modules referenced in this lecture
Overview and Goals
How do we know that the code that we have written so far is correct? We could write a bunch of unit tests, but it is more fun to use property-based testing, with Haskell’s Quickcheck library (QC). With this approach, we define properties of our definitions, and then use Quickcheck test those properties extensively with randomly generated values.
In this lecture, we will talk about how to generate random well-scoped (and well-typed) terms and use them to test the parser/pretty printer and evaluator that we defined in the previous lecture.
1. Quick check properties for parsing and pretty printing.
Recall that we can connect parsing and generation — via a chain of transformations:
inject scope-check
Tm Z ────────► N.Tm ──────────► Either ScopeCheckError (Tm Z)
inject pretty-print parse scope-check
Tm Z ────────► N.Tm ──────────► String ──────────► Either ParseError N.Tm ──────────► Either ScopeCheckError (Tm Z)
Two QuickCheck properties
(defined in Tutorial.Scoped.ScopeCheck) verify that our
transformations are correct:
-- inject then project recovers the original term
prop_project_round_trip :: S.Tm Z -> Bool
prop_project_round_trip i =
projectTm (injectTm i) == Right i
-- pretty-print then parse recovers the named term
-- | Pretty-printing a term and parsing it back yields the original named term.
prop_parse_round_trip :: S.Tm Z -> Bool
prop_parse_round_trip i =
parse (pp i) == Right iHowever, to run QuickCheck properties we need a way to generate random closed values!
Quickcheck crash course
QuickCheck provides two central abstractions for property-based testing:
Gen a— a type for generators of values of typea. Combinators likeQC.sized,QC.oneof, andQC.elements, and the monad operations build generators compositionally.Arbitrary a— a typeclass that packages a default generator (arbitrary :: Gen a) and a shrinker (shrink :: a -> [a]):class Arbitrary a where arbitrary :: Gen a shrink :: a -> [a] shrink _ = [] -- default: no shrinking
QuickCheck uses arbitrary to generate test inputs and
shrink to reduce a failing case to a smaller one.
2. A well-scoped generator for pure lambda calculus terms
Let’s start with a simple well-scoped generator that targets only
pure lambda calculus terms (Lam, App,
Var). It is a good warm-up before tackling the full
language.
Therefore we want to implement Gen (S.Tm n) — a
generator that only ever produces well-scoped de Bruijn terms, so that
every randomly generated term is a legitimate input to our
properties.
The key idea is to carry the scope — the number of variables currently in scope — as an implicit runtime parameter for the generator. When the generator recurses under a binder it increments the scope, making the newly bound variable available.
The Fin.universe operation enumerates all indices from
0 to n-1. Then the QC.elements
operation picks a random one. However, this only makese sense if there
is at least one free variable in scope. If n is zero, then
we don’t have any variables to pick from. Therefore, we just return the
smallest closed term.
-- At small size generate either a variable or "\x.x" depending on scope
genBase :: forall n. SNat n -> Gen (Tm n)
genBase SZ = return tmId
genBase SS = QC.elements (map Var Fin.universe)genScopedPureLC :: forall n. SNatI n => QC.Gen (Tm n)
genScopedPureLC = QC.sized go
where
go :: forall n. SNatI n => Int -> QC.Gen (Tm n)
go sz | sz <= 1 = genBase snat
go sz =
let
-- generate a random name and increment the number of free variables
gen1 = bind <$> genLocalName <*> go (sz - 1)
-- recursive calls for App divide size by two
gen = go (sz `div` 2)
in
QC.oneof [genBase snat, Lam <$> gen1, App <$> gen <*> gen ]The generator is parameterized by the number of free variables in
scope n :: Nat, which is a type-level natural
number. To use this number at runtime we need a singleton
provided by the SNatI n.
The go function does the main work of generation. The
first argument n is the number of variables in scope, the
second is sz, a size budget. During property-based testing
the QC.sized operation passes the current size budget to
the inner function go to generate progressively larger
terms. The snat call converts the SNatI n
constraint into an explicit SNat n value that
go can inspect and pass to recursive calls.
The local helper
go :: SNat n -> Int -> QC.Gen (Tm n) carries both the
scope witness and the size budget. When the size is larger than one,
there are three choices for generation via QC.oneof:
genBase n— a variable ortmId(same as before, but available at any size)Lam <$> gen1— wrap a binder;gen1generates the body in scopeS nby callinggo (sz - 1). The budget decreases by 1 (not halved) because a lambda chainλx.λy.λz.…is a linear sequence, not a tree.App <$> gen <*> gen— apply two sub-terms, each in the same scopenbut with budgetsz ``div`` 2so that both branches together stay within budget.
Binders also need names for pretty-printing. QuickCheck draws them from a small pool:
genLocalName :: QC.Gen LocalName
genLocalName = LocalName <$> QC.elements ["x", "y", "z", "w", "v", "u", "t", "s"]This can produce name collisions (e.g. "x" can appear in
nested binders). However, this is not a problem:injectTm
freshens names when the same name is already in scope. Correctness is
unaffected because LocalName equality ignores the stored
string.
3. Seeing the round trip in action
With this generator we can run our tests:
ghci> import Tutorial.Scoped.ScopeCheck
ghci> import Tutorial.Scoped.Gen
ghci> import Test.QuickCheck
ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped pp prop_project_round_trip)
+++ OK, passed 100 tests.
ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped pp prop_parse_round_trip)
+++ OK, passed 100 tests.
The forAllShrinkShow instructs quickcheck which scoped
generator, scoped shrinker, and pretty printer to use when testing.
The round-trip properties confirm that scope-checking the named term recovers exactly the original de Bruijn term.
4. Testing eval
Now let’s consider some properties that we might test for our evaluator.
Big-step properties
The most basic property is that if a term evaluates to some result, then the result is a value. We can write this test as follows:
-- if a term evaluates, it produces a value
prop_evalVal :: Tm Z -> Property
prop_evalVal = \t ->
discardAfter 1000000 $
case eval t of
Just v ->
counterexample ("not a value: " ++ pp v) $
property (isVal v)
Nothing ->
discardWe can strengthen this property by requiring the evaluation to produce a result. This property only holds for well-typed terms.
-- all terms evaluate to values
prop_eval_exists_Val :: Tm Z -> Property
prop_eval_exists_Val = \t ->
discardAfter 1000000 $
case eval t of
Just v ->
counterexample ("not a value: " ++ pp v) $
property (isVal v)
Nothing ->
property FalseWhat else can we test? We can define a generalization of
call-by-value evaluation called reduce that works with open
terms.
For closed terms it should agree with eval:
prop_eval_reduce :: Tm Z -> Property
prop_eval_reduce = \t ->
case eval t of
Just v -> reduce t == Just v
Nothing -> discardWe also test that reduce produces an inert term
(one that cannot step further). This property is checked at both the
closed scope Z and the open scope S Z (one
free variable):
prop_reduce_inert :: forall n. SNatI n => Tm n -> Property
prop_reduce_inert = \t ->
case reduce t of
Just v -> property (isInert v)
Nothing -> discard5. Going further: A well-typed generator?
The well-scoped generator (genScopedFull) produces any
structurally valid term — but “well-scoped” does not mean “well-typed”.
In an untyped setting, terms like the ω combinator
-- (λx. x x) (λx. x x)
omega :: Tm Z
omega = let self = Lam (bind1 (LocalName "x") (App (Var FZ) (Var FZ)))
in App self selfare perfectly well-scoped and will be generated regularly. Running
eval omega diverges: the evaluator loops forever.
This means properties tested with genScopedTm can hang
rather than fail, making it difficult to get useful QuickCheck output.
This is a real problem, because our well-scoped generator can produce
such terms in practice.
ghci> quickCheck (forAll genScopedPureLC prop_evalVal)
*** Failed! Timeout of 1000000 microseconds exceeded. (after 25 tests):
App (Lam (bind1 (Var 0))) (App (App (Lam (bind1 (App (Var 0) (Var 0)))) (Lam (bind1 (App (Var 0) (Var 0))))) (Lam (bind1 (Var 0))))
term: (\ x. x) ((\ u. u u) (\ v. v v) (\ x. x))If we add shrinking, we can even see this smallest term.
ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped Tutorial.Top.pp prop_evalVal)
*** Failed! Timeout of 1000000 microseconds exceeded. (after 65 tests and 4 shrinks):
(\ x. x x) (\ s. s s)
term: (\ x. x x) (\ s. s s)The solution is to add timeouts to our properties. However, if we do so, we end up discarding many cases.
prop_evalVal:
+++ OK, passed 1000 tests; 2133 discarded.
Alternatively, we can restrict generation to well-typed
terms. In the simply-typed lambda calculus every term is strongly
normalizing, so eval is guaranteed to terminate on every
well-typed input.
This also lets us state stronger type-soundness properties. For example, every well-typed closed term must evaluate to a value:
prop_eval_exists_Val :: Tm Z -> Property
prop_eval_exists_Val = \t ->
case eval t of
Just v -> property (isVal v)
Nothing -> property False -- well-typed terms must not get stuckAnd the step function must produces a value for
closed terms.
-- | the step
prop_stepVal :: Tm Z -> Property
prop_stepVal e =
let loop e =
if isVal e then property True
else case step e of
Nothing ->
counterexample ("stuck at: " ++ pp e) $
property False
Just e' -> loop e'
in within 1000000 $ loop eThese properties hold by type soundness (progress + preservation) and would not be testable with the well-scoped generator alone.
6. Two dimensions of generation
Therefore, we can define a generator for our syntax that is parameterized over two orthogonal dimensions:
data Constraint = Scoped | Typed
data Language = PureLC | FullScoped— generate any well-scoped term; types are ignored.Typed— generate a well-typed term by first picking a random type and context, then generating a term of that type.PureLC— only useLam,App, andVar(the pure lambda calculus).Full— also includeUnit,Pair,MatchPair,Inj,MatchUnit, andMatchSum.
For the Typed constraint, genTypedTm works
by type-directed synthesis. At each recursive call it has a target
type and a typing context (a vector mapping each de Bruijn
index to its type), and it only produces terms that have exactly that
type.
The high-level strategy is:
Introduction forms are selected by the target type. If the goal is
a :-> b, generate aLamwhose body is generated at typebin an extended context. If the goal isa :* b, generate aPair. And so on.Elimination forms (function application, pattern matching) require an arbitrary intermediate type, generated fresh for that sub-call. For example, to generate an
Apptargeting typeb, first pick a random argument typea, then generate a function of typea :-> band an argument of typea.Variables are filtered by type: only those indices whose context entry matches the target type are offered as candidates.
Fallback: when no introduction or variable candidate applies (e.g. at size 0 targeting a function type with no matching variable), the generator falls back to
Unit.
The top-level call also picks a random context and a random target
type before calling genTypedTm, so the resulting term may
have free variables with known types — or be closed if the context is
empty.
Shrinking
Random generation also requires shrinking. Because we have two
different constraints for the terms that we generate, we need two
different shrinking functions: shrinkScoped and
shrinkTyped.
Both shrinking function take a counter example (a term) and produce a
list of smaller terms that satisfy the same constraint. For
shrinkScoped, the constraint is that the smaller terms must
have the same scope as the original term. The type system guarantees
this constraint automatically. For shrinkTyped the smaller
terms must still be well-typed, but because we are not tracking types in
the type system, they may not have the same type.
6. Historical Notes
QuickCheck. Koen Claessen and John Hughes introduced QuickCheck in “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs” (2000). Property-based testing is now widely used across languages. The technique of generating well-scoped (or well-typed) terms directly — rather than generating strings and parsing them — avoids a large class of trivially-failing test cases and is the standard approach for testing language implementations.
Well-scoped and well-typed term generation. Generating terms directly in an indexed representation — rather than generating strings and filtering out ill-scoped ones — was popularized in the property-based testing literature by Pałka et al. (“Testing an Optimising Compiler by Generating Random Lambda Terms”, 2011), who used it to find bugs in GHC. Type-directed generation (producing only well-typed terms) is the natural extension: it avoids both scope errors and divergence, and was used by Lampropoulos, Paraskevopoulou, and Pierce (“Generating Good Generators for Inductive Relations”, 2018) and others for testing type-preserving compilers and proof assistants.
Exercises
1. Extending genTm. After adding
Let to the language (Exercise 3 in Lecture 1 and Exercise 2
above), extend genTm in Tutorial.Scoped.Gen to
also generate let-expressions.
-- In the Full branch of gens inside genTm:
, Let <$> gen <*> gen1- Why is
gen1the right generator for the binder part ofLet? - What scope does the body of the
letrun in? How does this differ fromApp? - Check that
prop_project_round_tripstill passes after this change.
2. An open-term round trip.
projectTmWith and injectTmWith work on open
terms too. Write a QuickCheck property analogous to
prop_project_round_trip for terms with one free variable
(Tm (S Z)), and test it:
prop_project_round_trip_open :: S.Tm (S Z) -> Bool
prop_project_round_trip_open t =
-- hint: use projectTmWith and injectTmWith with a suitable initial context
undefinedYou will need to choose a name for the free variable and pass it to
injectTmWith. Likewise, you will need to prime
projectTmWith with an association list that maps that name
to FZ.
- What initial
Vec (S Z) Stringdo you pass toinjectTmWith? - What initial
[(String, Fin (S Z))]do you pass toprojectTmWith? - Does the choice of name matter? Why or why not?
3. Substitution laws. State and test the following
equational laws as QuickCheck properties on Tm Z:
- Identity:
applyE idE t == t - Composition:
applyE f (applyE g t) == applyE (compE f g) t - Instantiate-shift:
instantiate1 (Bind1 (weaken t)) u == tfor anyt u :: Tm Z
For the composition law, use concrete environments,
e.g. g = idE and f = idE, or build simple
environments with (.:). Can you find a counterexample to
any of these properties if you get the implementation of
lift wrong?
4. Small Step Develop a small-step version of
reduce, called step. The small-step function
step either returns a reduct or Nothing (if
the term is inert). Two properties connect it to eval and
reduce:
-- for well-typed closed terms, step always reaches a value
prop_stepVal :: Tm Z -> Property
prop_stepVal =
let loop e =
if isVal e then property True
else case step e of
Nothing -> counterexample ("stuck at: " ++ pp e) (property False)
Just e' -> loop e'
in loop
-- stepping preserves the final evaluation result
prop_evalStep :: Tm Z -> Property
prop_evalStep = \e ->
case step e of
Nothing -> property (isVal e)
Just e' -> eval e == eval e'5. Full reduction (normalization). The
reduce function in Tutorial.Scoped.Eval is a
weak reducer: the Lam case returns the lambda
unchanged without looking inside the body. Implement full
reduction (also called normalization), which reduces everywhere
— including under binders:
normalize :: Tm n -> Maybe (Tm n)
normalize = _Hints:
The key new case is
Lam b. Useunbindl1 bto extract the stored local name and the body (of typeTm (S n)). Recursively normalize the body, then re-package it withbind1. BecauseTm (S n)is an open term, the recursive call is well-typed without any extra constraint.For
App, first normalize both sub-terms. If the function normalizes toLam band the argument tonv, perform the beta step withinstantiate1 b nvand normalize the result. If the function is inert, return the application of the normalized sub-terms.The other cases (
Pair,Inj,MatchSum,MatchPair,MatchUnit) follow the same pattern as inreduce, but callingnormalizerecursively instead ofreduce, and also normalizing inside constructor arguments.normalizeshould agree withreduceon terms that contain no redexes under binders. State and test this as a QuickCheck property onTm (S Z):prop_normalize_reduce :: Tm (S Z) -> Property prop_normalize_reduce t = ...Hint: find a predicate
noLambdaRedex :: Tm n -> Boolthat holds when there are no beta redexes inside any lambda body, and useQC.classifyto report what fraction of generated terms satisfy it.Define
isNormal :: Tm n -> Boolthat holds when a term contains no beta redexes anywhere (including under binders). Then state and test:prop_normalize_normal :: Tm (S Z) -> Property prop_normalize_normal t = case normalize t of Just nf -> property (isNormal nf) Nothing -> discardOn closed terms, does
normalize tsucceed exactly wheneval tsucceeds? State this as a property and test it. Does it hold for well-scoped terms? For well-typed terms? Explain any discrepancies you observe.