Lecture 1: Well-Scoped De Bruijn Representations
Modules referenced in this lecture
Overview and Goals
This lecture covers a well-scoped representation of lambda calculus terms. By well-scoped, we mean that the type system statically tracks the scope of each term. If the type of the term says that its scope is empty, then it will be a type error to reference a free variable in the term.
During the next fifty minutes, we will define this representation using de Bruijn indices (i.e. natural numbers) to represent variables. We will also implement a substitution-based interpreter with this representation, where type checking ensures that we haven’t made any scope errors.
The goals of this lecture are to:
cover the building blocks of working with the lambda calculus and implementing capture-avoiding substitution
convince you that the small amount of dependently-typed programming needed for well-scoped lambda terms pays off, even in Haskell
show-off the
reboundlibrary
In general, the ideas that we will cover this week should translate to any functional programming language with minimal support for dependent types. I’m using Haskell, but if you are working in Agda, Idris, Lean, or Rocq, you can translate this code to that setting.
1. The Problem: Variable Binding
Variable binding is the central challenge in implementing any language with binders (λ-abstractions, let-expressions, pattern matching). The core operation is substitution: to evaluate
(λx. x + x) 42
we substitute 42 for x in
x + x. But substitution in the presence of free variables
requires care. A naive string-substitution breaks on examples like
(λx. λy. x) y → λz. y -- must rename inner binder to avoid capturing y
Capture-avoiding substitution handles this by tracking free variables and renaming binders that would capture them. Unfortunately, implementations of this operation are a notorious source of bugs.
Furthermore, with named variables, λx. x and
λy. y are alpha-equivalent but structurally
different. When implementing lambda calculus terms, we want to decide
when terms are equivalent up-to the names of bound variables.
De Bruijn indices solve both problems, i.e. we replace variable names with numbers counting the binders between a use and its binding site:
λx. x becomes λ. 0
λx. λy. x becomes λ. λ. 1
λx. λy. x + y becomes λ. λ. 1 + 0
Alpha-equivalence collapses to structural equality — we get a correct
implementation in Haskell for free (by deriving the Eq type
class). And because there are no names, there is nothing to capture.
While, we will still need to do some work to define our substitution
function, statically checking the scopes will help us develop a correct
implementation.
2. Finite Sets and Scopes —
Nat and Fin
The well-scoped approach tracks the number of variables in scope at the type level, ruling out out-of-scope variable references statically. Let’s start with the definition of natural numbers, zero and successor.
data Nat = Z | S NatIn Haskell, we can use these numbers in types. For example, here are some names that we can use for type-level numbers:
type N1 = S Z
type N2 = S (S Z)
type N3 = S (S (S Z))A term in scope n may refer to variables 0
through n-1. We represent numbers in this finite scope as a
GADT:
data Fin n where
FZ :: Fin (S n) -- zero, valid in any non-empty scope
FS :: Fin n -> Fin (S n) -- successorFin Z is uninhabited, which means a closed term (scope
Z) cannot contain any variables — the type checker enforces
this statically. Some examples:
FZ :: Fin (S Z)— the only variable in scope 1FZ :: Fin (S (S Z))— variable0in scope 2FS FZ :: Fin (S (S Z))— variable1in scope 2
3. Types, Terms and Binders
Our object language is a simply-typed lambda calculus with binary products and sums:
data Ty = One | Ty :-> Ty | Ty :* Ty | Ty :+ TyTerms are indexed by their scope:
data Tm (n :: Nat) where
Var :: Fin n -> Tm n
Lam :: Bind1 n -> Tm n
Unit :: Tm n
Pair :: Tm n -> Tm n -> Tm n
Inj :: Int -> Tm n -> Tm n
App :: Tm n -> Tm n -> Tm n
MatchUnit :: Tm n -> Tm n -> Tm n
MatchPair :: Tm n -> Bind2 n -> Tm n
MatchSum :: Tm n -> Bind1 n -> Bind1 n -> Tm nAlthough written in GADT syntax, Tm is not a proper GADT
— all constructors produce Tm n for the same
n. The interesting type-level work happens in
Fin n. A term of type Tm Z is closed.
When we cross a binder, the scope grows. Bind1 n
packages a body that lives in scope S n:
data Bind1 n where
Bind1 :: Tm (S n) -> Bind1 nBind2 n packages a body under two simultaneous binders
(used in MatchPair):
data Bind2 n where
Bind2 :: Tm (S (S n)) -> Bind2 nInside a Bind1 n, FZ is the newly-bound
variable and FS x refers to whatever x
referred to in the outer scope. Inside a Bind2 n,
FZ is the second-bound variable, FS FZ the
first-bound, and FS (FS x) escapes to the outer scope.
Some concrete examples:
-- λx. x
ex_id :: Tm Z
ex_id = Lam (Bind1 (Var FZ))
-- λx. λy. x (x = FS FZ inside inner body, y = FZ)
ex_const :: Tm Z
ex_const = Lam (Bind1 (Lam (Bind1 (Var (FS FZ)))))
-- λf. λg. λx. f (g x)
ex_comp :: Tm Z
ex_comp = Lam (Bind1 (Lam (Bind1 (Lam (Bind1
(App (Var (FS (FS FZ))) (App (Var (FS FZ)) (Var FZ))))))))
-- f g x4. Substitution Environments
Substitution is an operation replaces the free variables in
terms with other terms. We will use a data structure, called an
environment, as part of this mapping. It stores the association between
every variable in some scope m to terms that all must be in
the same scope n.
The simplest way to express a substitution environment is using a function with the following type:
type Env m n = Fin m -> Tm nAn Env m n maps each of m variables to a
term in scope n. You should think of this as an abstract
data structure that maps indices to terms. Above, we represent it as a
function (with a finite domain), but it could also be implemented with a
length-indexed vector, array, tree, or other structure. In the fourth
lecture we will discuss various trade-offs in the representation. First,
however, let’s look at the interface we need from this structure.
The most important operation that we need for this data structure is to be able to apply it to a term, replacing all of the free variables in the term with their definitions in the environment. This operation is sometimes called parallel substitution because it performs multiple substitutions at once.
applyE :: Env m n -> Tm m -> Tm nHowever, we also need to be able to build substitutions, using the following operations.
List-like structure: empty and extension
Another way to think about an environment is like a list, where the ith position in the list is a mapping for index i.
zeroE :: Env Z m
zeroE = \n -> case n of {}
infixr 5 .:
(.:) :: Tm n -> Env m n -> Env (S m) n
t .: env = \f -> case f of
FZ -> t
FS x -> env xLike cons: t .: env extends env with a
binding for the outermost variable. Right-associativity lets us write
t1 .: t2 .: env like list cons.
For example, we can create an environment using the example terms above. Here, the type of the environment tells us that there are exactly three closed terms.
exampleE :: Env N3 Z
exampleE = ex_id .: ex_const .: ex_comp .: zeroEMore operations
Here’s another example environment: an identity substitution for terms with three free variables. We map each index to a variable with that index.
id3E :: Env N3 N3
id3E = Var f0 .: Var f1 .: Var f2 .: zeroEBut that is overly complicated. We can also create an identity substitution for every scope, with the following definition:
idE :: Env n n
idE = VarFurthermore, this definition hints at how we might update all free variables in a term. The shift environment increments every index by one.
shift :: Env n (S n)
shift = Var . FSWeakens a scope: applying applyE shift to a term in
scope n produces the same term in scope S n.
In this scope, the variable Var FZ is not used.
5. Applying a Substitution
applyE traverses a term, replacing variables via the
environment and lifting under each binder:
-- | Apply a substitution environment to a term
applyE :: Env m n -> Tm m -> Tm n
applyE env (Var x) = env x
applyE env (Lam (Bind1 b)) = Lam (Bind1 (applyE (up env) b))
applyE _ Unit = Unit
applyE env (Pair a b) = Pair (applyE env a) (applyE env b)
applyE env (Inj i t) = Inj i (applyE env t)
applyE env (App f a) = App (applyE env f) (applyE env a)
applyE env (MatchUnit a b) = MatchUnit (applyE env a) (applyE env b)
applyE env (MatchPair a (Bind2 b)) =
MatchPair (applyE env a) (Bind2 (applyE (up (up env)) b))
applyE env (MatchSum a (Bind1 b1) (Bind1 b2)) =
MatchSum (applyE env a)
(Bind1 (applyE (up env) b1))
(Bind1 (applyE (up env) b2))
-- | Lift an environment under one binder
up :: Env m n -> Env (S m) (S n)
up env = Var FZ .: (applyE shift . env)The key cases are the binders. For Lam (Bind1 b), the
body lives in scope S m, so we apply
up env :: Env (S m) (S n). For
MatchPair (Bind2 b), which binds two variables, we apply
up (up env).
In the up environment, the new outermost variable maps
to itself; every other variable x maps to
env x weakened into the larger scope.
Note that applyE and up are mutually
recursive. In a proof assistant like Agda or Rocq, this mutual recursion
through non-structural calls makes termination checking difficult. The
standard fix is to first define a renaming function (mapping
variables to variables, not terms), use that for shifting, and then
define substitution separately. We sidestep this here since Haskell has
no termination checker.
Finally, let’s define a helper function for instantiating binders. If
we were to evaluate (λx. body) arg we need to
instantiate the binder: substitute arg for
FZ in body. An environment that does this is
arg .: idE, which maps FZ → arg and
FS x → Var x:
instantiate1 :: Bind1 n -> Tm n -> Tm n
instantiate1 (Bind1 body) t = applyE (t .: idE) bodyFor double binders we supply two terms:
instantiate2 :: Bind2 n -> Tm n -> Tm n -> Tm n
instantiate2 (Bind2 body) t1 t2 = applyE (t1 .: t2 .: idE) body6. Working with well-scoped terms
Now let’s write a function that uses our well-scoped representation!
The end of the file includes a substitution-based evaluator for
closed terms. This evaluator could encounter a runtime type error; for
simplicity we return Nothing in that case.
eval :: Tm Z -> Maybe (Tm Z)The type of the evaluator also reflects that we only evaluate
closed terms. The argument must type check in a scope that
contains no variables. As a result, the variable case is impossible.
Since Tm Z contains no variables, Fin Z is
uninhabited, and the empty pattern match is vacuously exhaustive:
eval (Var x) = case x of {}In other cases, we use instantiate1 and
instantiate2 to substitute for the arguments in binders.
Observe the order of instantiation when working with pairs. As stated
above, inside a Bind2, FZ is the second-bound
variable and FS FZ is the first-bound variable. So the
second component of the pair has index 0 and the first component has
index 1.
Function application:
eval (App m n) = do
mv <- eval m
nv <- eval n
case mv of
Lam b -> eval (instantiate1 b nv)
_ -> NothingPair elimination:
eval (MatchPair e m) = do
v <- eval e
case v of
Pair v1 v2 -> eval (instantiate2 m v2 v1)
_ -> Nothing7. Using rebound
Some of the infrastructure in Scratch.hs is entirely
generic: it doesn’t depend on the specific constructors of
Tm. The rebound library packages this
machinery so you do not have to rewrite it for each new language.
The file Tutorial.Scoped.Syntax shows the updated
version. The differences are:
- Import the relevant modules from
rebound - Provide the type parameter to
Bind1/Bind2in theTmdefinition - Derive
Generic1forTm - Give instances of
SubstVar TmandSubst Tm Tm(theGeneric1machinery fills inapplyEautomatically)
Historical Notes
Named variables and capture-avoiding substitution. The difficulty of substitution under binders — and the need for alpha-renaming — was understood by Church (1941) and Kleene (1952). The Barendregt convention (assume all bound variables are distinct from free variables) is a paper-level workaround; implementing it correctly in software is error-prone and was a recognized problem in early theorem provers and language implementations.
De Bruijn indices. Introduced by Nicolaas de Bruijn in “Lambda Calculus Notation with Nameless Dummies” (1972) specifically to give a canonical representation of lambda terms in which alpha-equivalent terms are syntactically identical. The name “de Bruijn index” (counting binders outward from the use site) is the convention adopted here; de Bruijn himself used the opposite convention in some papers. An alternative is de Bruijn levels, which count inward from the outermost binder.
Parallel substitution. The use of an environment
Fin m -> Tm n that maps all variables simultaneously —
rather than one at a time — is sometimes called parallel
substitutions or vector substitutions. The version we present
here is inspired by the Autosubst tool for working with de Bruijn
indices in the Rocq proof assistant, which itself was inspired by the “Explicit
Substitutions” of Abadi, Cardelli, Curien, and Lévy. (1991).
Well-scoped de Bruijn terms. Tracking the number of
free variables at the type level using a natural-number index appears in
Altenkirch and Reus, “Monadic
Presentations of Lambda Terms Using Generalized Inductive Types”
(1999), and in Bird and Paterson, “De
Bruijn Notation as a Nested Datatype” (1999). These papers inspired
Kmett’s bound
library for working with well-scoped lambda calculus terms
efficiently.
Renamings and the termination problem. The mutual
recursion between applyE and up (noted in
Section 5) is a real obstacle in proof assistants. The standard fix —
define applyRen on renamings first, use it to implement
up, then build applyE on top — is described
in, e.g., Benton, Hur, Kennedy, and McBride, “Strongly
Typed Term Representations in Coq” (2012). Exercise 4 asks you to
implement this version.
Exercises
1. Writing terms. Write the following closed terms
in Tutorial.Scoped.Scratch. Work out the de Bruijn indices
by hand before trying it in GHCi.
ex_fst :: Tm Z— the projectionλp. match p with (x, y) → xex_snd :: Tm Z— the projectionλp. match p with (x, y) → yex_s :: Tm Z— the S combinatorλf. λg. λx. f x (g x)
Pay attention to the variable ordering inside Bind2. In
MatchPair e (Bind2 body), FZ refers to the
second pair component and FS FZ to the first.
2. Weakening. Consider the following function:
weaken :: Tm n -> Tm (S n)
weaken = applyE shiftThis embeds a term from scope n into the larger scope
S n, leaving the new variable FZ unused.
Verify in GHCi that weaken ex_id :: Tm (S Z) type-checks
and that applying instantiate1 (Bind1 (weaken e)) t returns
e unchanged for any e and t.
Why does applyE shift have the right type? Trace through
the definition of shift and applyE to convince
yourself.
3. Adding let. Extend the language in
Tutorial.Scoped.Scratch with a
let-expression:
Let :: Tm n -> Bind1 n -> Tm nLet e b binds the value of e in the body
b. You will need to: 1. Add the constructor to
Tm 2. Add a case to applyE — what environment
do you pass into the binder? 3. Add a case to eval. We have
let x = e in b is a by-value let, so evaluate
e first, then instantiate b
Check: is Let e b equivalent to
App (Lam b) e? Do both evaluate the same way?
4. Renamings. A renaming is an environment that maps variables to variables rather than terms:
type Ren m n = Fin m -> Fin nDefine applyRen :: Ren m n -> Tm m -> Tm n by
structural recursion, analogous to applyE. You will also
need upRen :: Ren m n -> Ren (S m) (S n).
Now observe: every renaming r :: Ren m n induces a
substitution Var . r :: Env m n, and
applyRen r = applyE (Var . r).
The reason proof assistants define applyRen separately
is that applyRen is structurally recursive (it
produces variables, not terms), so it can be used to define
up without the mutual recursion that troubles Agda and
Rocq.
5. Alternative representation of Env.
data Vec n a where
VNil :: Vec Z a
(:::) :: a -> Vec n a -> Vec (S n) aCan you implement substitution environments as length-indexed lists?
type Env m n = Vec m (Tm n)
First, define the lookup operation for accessing the definition of a particular index.
applyEnv :: Env m n -> Fin m -> Tm n
Now consider the other operations on environment.The implementation
of zeroE and (.:) is straightforward. But what
about the other operations, such as idE,
shiftE, and the composition of two environments.