Lecture 3: Property-Based Testing (PBT) with well-scoped and well-typed terms

Modules referenced in this lecture

Overview and Goals

How do we know that the code that we have written so far is correct? We could write a bunch of unit tests, but it is more fun to use property-based testing, with Haskell’s Quickcheck library (QC). With this approach, we define properties of our definitions, and then use Quickcheck test those properties extensively with randomly generated values.

In this lecture, we will talk about how to generate random well-scoped (and well-typed) terms and use them to test the parser/pretty printer and evaluator that we defined in the previous lecture.

1. Quick check properties for parsing and pretty printing.

Recall that we can connect parsing and generation — via a chain of transformations:


        inject            scope-check
Tm Z    ────────►  N.Tm   ──────────►  Either ScopeCheckError (Tm Z)


         inject            pretty-print          parse             scope-check
Tm Z    ────────►  N.Tm   ──────────►  String  ──────────► Either ParseError N.Tm  ──────────►  Either ScopeCheckError (Tm Z)

Two QuickCheck properties (defined in Tutorial.Scoped.ScopeCheck) verify that our transformations are correct:

-- inject then project recovers the original term
prop_project_round_trip :: S.Tm Z -> Bool
prop_project_round_trip i =
    projectTm (injectTm i) == Right i

-- pretty-print then parse recovers the named term

-- | Pretty-printing a term and parsing it back yields the original named term.
prop_parse_round_trip :: S.Tm Z -> Bool
prop_parse_round_trip i =
   parse (pp i) == Right i

However, to run QuickCheck properties we need a way to generate random closed values!

Quickcheck crash course

QuickCheck provides two central abstractions for property-based testing:

QuickCheck uses arbitrary to generate test inputs and shrink to reduce a failing case to a smaller one.

2. A well-scoped generator for pure lambda calculus terms

Let’s start with a simple well-scoped generator that targets only pure lambda calculus terms (Lam, App, Var). It is a good warm-up before tackling the full language.

Therefore we want to implement Gen (S.Tm n) — a generator that only ever produces well-scoped de Bruijn terms, so that every randomly generated term is a legitimate input to our properties.

The key idea is to carry the scope — the number of variables currently in scope — as an implicit runtime parameter for the generator. When the generator recurses under a binder it increments the scope, making the newly bound variable available.

The Fin.universe operation enumerates all indices from 0 to n-1. Then the QC.elements operation picks a random one. However, this only makese sense if there is at least one free variable in scope. If n is zero, then we don’t have any variables to pick from. Therefore, we just return the smallest closed term.

-- At small size generate either a variable or "\x.x" depending on scope
genBase :: forall n. SNat n -> Gen (Tm n) 
genBase SZ = return tmId
genBase SS = QC.elements (map Var Fin.universe)
genScopedPureLC :: forall n. SNatI n => QC.Gen (Tm n)
genScopedPureLC = QC.sized go
    where
      go :: forall n. SNatI n => Int -> QC.Gen (Tm n)
      go sz | sz <= 1 = genBase snat
      go sz = 
         let
           -- generate a random name and increment the number of free variables
           gen1 = bind <$> genLocalName <*> go (sz - 1)

           -- recursive calls for App divide size by two
           gen  = go (sz `div` 2)
         in 
         QC.oneof [genBase snat, Lam <$> gen1, App <$> gen <*> gen ]

The generator is parameterized by the number of free variables in scope n :: Nat, which is a type-level natural number. To use this number at runtime we need a singleton provided by the SNatI n.

The go function does the main work of generation. The first argument n is the number of variables in scope, the second is sz, a size budget. During property-based testing the QC.sized operation passes the current size budget to the inner function go to generate progressively larger terms. The snat call converts the SNatI n constraint into an explicit SNat n value that go can inspect and pass to recursive calls.

The local helper go :: SNat n -> Int -> QC.Gen (Tm n) carries both the scope witness and the size budget. When the size is larger than one, there are three choices for generation via QC.oneof:

Binders also need names for pretty-printing. QuickCheck draws them from a small pool:

genLocalName :: QC.Gen LocalName
genLocalName = LocalName <$> QC.elements ["x", "y", "z", "w", "v", "u", "t", "s"]

This can produce name collisions (e.g. "x" can appear in nested binders). However, this is not a problem:injectTm freshens names when the same name is already in scope. Correctness is unaffected because LocalName equality ignores the stored string.

3. Seeing the round trip in action

With this generator we can run our tests:

ghci> import Tutorial.Scoped.ScopeCheck
ghci> import Tutorial.Scoped.Gen
ghci> import Test.QuickCheck
ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped pp prop_project_round_trip)
+++ OK, passed 100 tests.
ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped pp prop_parse_round_trip)
+++ OK, passed 100 tests.

The forAllShrinkShow instructs quickcheck which scoped generator, scoped shrinker, and pretty printer to use when testing.

The round-trip properties confirm that scope-checking the named term recovers exactly the original de Bruijn term.

4. Testing eval

Now let’s consider some properties that we might test for our evaluator.

Big-step properties

The most basic property is that if a term evaluates to some result, then the result is a value. We can write this test as follows:


-- if a term evaluates, it produces a value
prop_evalVal :: Tm Z -> Property
prop_evalVal = \t ->
    discardAfter 1000000 $
    case eval t of
        Just v -> 
            counterexample ("not a value: " ++ pp v) $
            property (isVal v)
        Nothing -> 
            discard

We can strengthen this property by requiring the evaluation to produce a result. This property only holds for well-typed terms.


-- all terms evaluate to values
prop_eval_exists_Val :: Tm Z -> Property
prop_eval_exists_Val = \t ->
    discardAfter 1000000 $
    case eval t of
        Just v -> 
            counterexample ("not a value: " ++ pp v) $
            property (isVal v)
        Nothing -> 
            property False

What else can we test? We can define a generalization of call-by-value evaluation called reduce that works with open terms.

For closed terms it should agree with eval:

prop_eval_reduce :: Tm Z -> Property
prop_eval_reduce = \t ->
    case eval t of
        Just v  -> reduce t == Just v
        Nothing -> discard

We also test that reduce produces an inert term (one that cannot step further). This property is checked at both the closed scope Z and the open scope S Z (one free variable):

prop_reduce_inert :: forall n. SNatI n => Tm n -> Property
prop_reduce_inert = \t ->
    case reduce t of
        Just v  -> property (isInert v)
        Nothing -> discard

5. Going further: A well-typed generator?

The well-scoped generator (genScopedFull) produces any structurally valid term — but “well-scoped” does not mean “well-typed”. In an untyped setting, terms like the ω combinator

-- (λx. x x) (λx. x x)
omega :: Tm Z
omega = let self = Lam (bind1 (LocalName "x") (App (Var FZ) (Var FZ)))
        in App self self

are perfectly well-scoped and will be generated regularly. Running eval omega diverges: the evaluator loops forever.

This means properties tested with genScopedTm can hang rather than fail, making it difficult to get useful QuickCheck output. This is a real problem, because our well-scoped generator can produce such terms in practice.

ghci> quickCheck (forAll genScopedPureLC prop_evalVal)
*** Failed! Timeout of 1000000 microseconds exceeded. (after 25 tests):
App (Lam (bind1 (Var 0))) (App (App (Lam (bind1 (App (Var 0) (Var 0)))) (Lam (bind1 (App (Var 0) (Var 0))))) (Lam (bind1 (Var 0))))
term: (\ x. x) ((\ u. u u) (\ v. v v) (\ x. x))

If we add shrinking, we can even see this smallest term.

ghci> quickCheck (forAllShrinkShow genScopedPureLC shrinkScoped Tutorial.Top.pp prop_evalVal)
*** Failed! Timeout of 1000000 microseconds exceeded. (after 65 tests and 4 shrinks):
(\ x. x x) (\ s. s s)
term: (\ x. x x) (\ s. s s)

The solution is to add timeouts to our properties. However, if we do so, we end up discarding many cases.

prop_evalVal:
+++ OK, passed 1000 tests; 2133 discarded.

Alternatively, we can restrict generation to well-typed terms. In the simply-typed lambda calculus every term is strongly normalizing, so eval is guaranteed to terminate on every well-typed input.

This also lets us state stronger type-soundness properties. For example, every well-typed closed term must evaluate to a value:

prop_eval_exists_Val :: Tm Z -> Property
prop_eval_exists_Val = \t ->
    case eval t of
        Just v  -> property (isVal v)
        Nothing -> property False   -- well-typed terms must not get stuck

And the step function must produces a value for closed terms.


-- | the step 
prop_stepVal :: Tm Z -> Property
prop_stepVal e =
    let loop e =
          if isVal e then property True
          else case step e of
             Nothing ->
                counterexample ("stuck at: " ++ pp e) $
                property False
             Just e' -> loop e'
    in within 1000000 $ loop e

These properties hold by type soundness (progress + preservation) and would not be testable with the well-scoped generator alone.

6. Two dimensions of generation

Therefore, we can define a generator for our syntax that is parameterized over two orthogonal dimensions:

data Constraint = Scoped | Typed
data Language   = PureLC | Full

For the Typed constraint, genTypedTm works by type-directed synthesis. At each recursive call it has a target type and a typing context (a vector mapping each de Bruijn index to its type), and it only produces terms that have exactly that type.

The high-level strategy is:

The top-level call also picks a random context and a random target type before calling genTypedTm, so the resulting term may have free variables with known types — or be closed if the context is empty.

Shrinking

Random generation also requires shrinking. Because we have two different constraints for the terms that we generate, we need two different shrinking functions: shrinkScoped and shrinkTyped.

Both shrinking function take a counter example (a term) and produce a list of smaller terms that satisfy the same constraint. For shrinkScoped, the constraint is that the smaller terms must have the same scope as the original term. The type system guarantees this constraint automatically. For shrinkTyped the smaller terms must still be well-typed, but because we are not tracking types in the type system, they may not have the same type.

6. Historical Notes

QuickCheck. Koen Claessen and John Hughes introduced QuickCheck in “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs” (2000). Property-based testing is now widely used across languages. The technique of generating well-scoped (or well-typed) terms directly — rather than generating strings and parsing them — avoids a large class of trivially-failing test cases and is the standard approach for testing language implementations.

Well-scoped and well-typed term generation. Generating terms directly in an indexed representation — rather than generating strings and filtering out ill-scoped ones — was popularized in the property-based testing literature by Pałka et al. (“Testing an Optimising Compiler by Generating Random Lambda Terms”, 2011), who used it to find bugs in GHC. Type-directed generation (producing only well-typed terms) is the natural extension: it avoids both scope errors and divergence, and was used by Lampropoulos, Paraskevopoulou, and Pierce (“Generating Good Generators for Inductive Relations”, 2018) and others for testing type-preserving compilers and proof assistants.

Exercises

1. Extending genTm. After adding Let to the language (Exercise 3 in Lecture 1 and Exercise 2 above), extend genTm in Tutorial.Scoped.Gen to also generate let-expressions.

-- In the Full branch of gens inside genTm:
, Let <$> gen <*> gen1

2. An open-term round trip. projectTmWith and injectTmWith work on open terms too. Write a QuickCheck property analogous to prop_project_round_trip for terms with one free variable (Tm (S Z)), and test it:

prop_project_round_trip_open :: S.Tm (S Z) -> Bool
prop_project_round_trip_open t =
    -- hint: use projectTmWith and injectTmWith with a suitable initial context
    undefined

You will need to choose a name for the free variable and pass it to injectTmWith. Likewise, you will need to prime projectTmWith with an association list that maps that name to FZ.

3. Substitution laws. State and test the following equational laws as QuickCheck properties on Tm Z:

For the composition law, use concrete environments, e.g. g = idE and f = idE, or build simple environments with (.:). Can you find a counterexample to any of these properties if you get the implementation of lift wrong?

4. Small Step Develop a small-step version of reduce, called step. The small-step function step either returns a reduct or Nothing (if the term is inert). Two properties connect it to eval and reduce:

-- for well-typed closed terms, step always reaches a value
prop_stepVal :: Tm Z -> Property
prop_stepVal = 
    let loop e =
          if isVal e then property True
          else case step e of
                 Nothing -> counterexample ("stuck at: " ++ pp e) (property False)
                 Just e' -> loop e'
    in loop

-- stepping preserves the final evaluation result
prop_evalStep :: Tm Z -> Property
prop_evalStep = \e ->
    case step e of
        Nothing -> property (isVal e)
        Just e' -> eval e == eval e'

5. Full reduction (normalization). The reduce function in Tutorial.Scoped.Eval is a weak reducer: the Lam case returns the lambda unchanged without looking inside the body. Implement full reduction (also called normalization), which reduces everywhere — including under binders:

normalize :: Tm n -> Maybe (Tm n)
normalize = _

Hints: