Porting Lean to the ESP32-C3 RISC-V microcontroller July 31, 2024 on György Kurucz's blog
~29 minutes, 6175 words

The goal of this project is to cross-compile Lean programs to the ESP32-C3. (You can find all the code in this repository.)

Lean is a dependently typed pure functional programming language and a proof assistant, though in this article I will mostly be using it just as a programming language. Its compiler can either emit LLVM IR or C code. I will be using the C output for simplicity, but using the LLVM IR instead should just involve setting up the right compiler flags.

The ESP32-C3 is a RISC-V microcontroller running at up to 160 MHz. Its main selling point is its integrated Wi-Fi peripheral, but I will not be using it for this project. It has 384 KiB of internal RAM, and I will be fitting everything into this including the program code, the heap and the stack. (Running code directly from flash memory would also be possible in order to save RAM, but it’s potentially slower than running from RAM.)

What’s the motivation behind this project? While the Lean compiler is pretty good, Lean programs will never be as fast or as small as programs written in C. In an environment where every byte and every CPU clock cycle matters, wouldn’t it make sense to write everything in C?

As far as microcontrollers go, the ESP32-C3 is actually a pretty beefy machine. 160 MHz and 384 KiB of RAM might seem tiny compared to the 3+ GHz and 8+ GB of RAM in your usual desktop machine, but it is significantly more than what e.g. the Commodore 64 or the original Game Boy had. Here is a table comparing the ESP32-C3 to some historical computers:

CPURAM
Apollo Guidance Computer (1966)16 bit 2 MHz2 KiB
Commodore 64 (1982)8 bit 1 MHz64 KiB
Game Boy (1989)8 bit 4 MHz8 KiB + 8 KiB VRAM
ESP32-C3 (2021)32 bit 160 MHz384 KiB

Additionally, the success of projects like NodeMCU, MicroPython or Espruino show that running higher level languages on microcontrollers is very much feasible, and the developer effort saved by doing so can be worth it.

With that said, I will jump into describing the whole stack that goes into getting Lean running on the ESP32-C3. You can use the table of contents below to jump to a particular section that interests you, I would say ~80% of any section should still be understandable without having read the preceding ones.

Table Of Contents

Lean runtime

In this section I will describe how Lean represents objects in memory, and how it uses reference counting to manage them. Note that I will assume a 32 bit architecture (as the ESP32-C3 has a 32 bit core), though Lean is more commonly used on 64 bit platforms.

Object representation

Lean objects are allocated on the heap, and referenced by a pointer. They consist of a header containing a reference count and some other information, followed by some arbitrary contents:

Unfortunately, this can be quite inefficient for small objects. For example, it does not make sense to allocate small integers or booleans on the heap. As an optimization, certain small objects are stored directly as a value representing themselves (instead of a pointer pointing to them). For example, the number 13 would be represented directly by the constant 0x1b:

But wait, how did 13 turn into 27? It turns out that in order to distinguish pointers from these so-called scalar values, the LSB (least significant bit) needs to be set to 1. The actual value is shifted left by one, so that the value n is encoded as (n << 1) | 1. This is how the value 13 is encoded:

Note that the LSB of pointers is guaranteed to be 0, since the objects in the heap are always aligned to 4 bytes. This way, scalars and pointers can be clearly distinguished based on their LSB.

Memory management

Most managed languages nowadays use some sort of tracing garbage collector to automatically clean up unused objects. Lean uses reference counting instead. The header of each heap allocated object contains a reference count. It gets incremented if there is a new reference to the object, and decremented when a reference is no longer needed. When the reference count reaches zero, the object is freed.

Lean opts to use reference counting for two major reasons. First, a big weakness of reference counting, the handling of cycles, is not an issue, since due to immutability it is guaranteed that the object graph is acyclic.

Second, it allows for an optimization that is particularly useful for languages where all values are immutable. Take the following Lean code:

def rep (n : Nat) (v : α) :=
  match n with
  | 0 => #[]
  | n + 1 => Array.push (rep n v) v

This returns an array containing the value v n times. But since the array is immutable, it is created and destroyed n times, always with one extra element.

As a special optimization, when the Array.push builtin function receives an argument with a reference count of exactly 1, it can mutate the existing array instead of having to create a new one. Since the reference count guarantees that this mutation cannot be observed by any other part of the code, this does not violate functional purity.

You can read more about this specific optimization, and much more in this paper: [1]S. Ullrich and L. de Moura, “Counting immutable beans: Reference counting optimized for purely functional programming.” 2020. Available: https://arxiv.org/abs/1908.05647.

Getting the runtime to compile & run on the ESP32-C3

The runtime includes its own special allocator to speed up the allocation of small objects. It allocates 8 MiB segments from the system to then suballocate the objects. This is many times larger than the whole 384 KiB of RAM we have, so it definitely won’t cut it. The easiest solution is to just disable this special allocator, and rely on the malloc implementation being fast enough.

The next challenge is that the runtime is written in C++. If possible I would like to avoid having to deal with exceptions and vtables and other C++ features requiring special runtime considerations compared to C.

I went on a quest to slim down the runtime, and removed as many unnecessary things as possible:

What remains is mostly the core functionality used by compiled Lean programs:

Toolchain: compiler and libraries

Overview

I wanted to have full control over the whole toolchain to be able to customize things if necessary, so I decided to build the standard C and C++ libraries for myself. I compiled everything with the -ffunction-sections flag, which allows garbage collection of individual functions at link time, reducing code size. The full tech-stack ended up being the following:

The Nix package manager

If you ever tried to put together some software project with multiple customized dependencies that need to be compiled, you know how painful it can be. And even if you manage to get it working, and document how you did it, the chances of someone else getting it working on the first try (or even you yourself on another computer) is basically zero.

You did not forget to record all the custom build system flags you used, right? Also project X needs a one line modification to get it to compile in your environment, don’t forget about that. Also project Y needs an older version of dependency Z, so you have to temporarily downgrade it while compiling that one project, you recorded the exact version needed, right?

And even if you documented every step perfectly, some project is bound to depend on something in your environment you did not notice, but someone else might not have installed.

Also if you want to take your project and continue working on it let’s say 1 or 2 years later, chances are that something will have gotten broken by then. Even if you recorded the versions of ALL packages installed on your system, reproducing the exact same environment is impossible on most operating systems.

Nix is a package manager that builds packages in a reproducible manner. Every build is isolated from the rest of the world and only supplied with dependencies that were explicitly declared, so that it’s not possible to forget declaring a dependency. Every external input (e.g. source code downloaded from a git repository) must be pinned by its hash so that it cannot change implicitly.

If a nix package works today, it is pretty much guaranteed to continue working many years into the future without any changes.

Nix also makes it very easy to define one’s own packages using its package definition DSL called the nix language.

Because of these aforementioned advantages, I decided to use Nix for this project from the very start. This guarantees that anyone (either you, or future me many years from now) can just clone the project repository, and run nix build to compile the exact versions of everything I used for the project, including the final binary that runs on the ESP32-C3.

Targeting the ESP32-C3 with nixpkgs

The main package collection for Nix is called nixpkgs, which I will be using as a base to build my packages on.

If the description above made it seem like Nix is a magic bullet and everything is perfect and beautiful in the Nix ecosystem, I have some bad news, building software for embedded targets is not a very common (or commonly tested) use-case of nixpkgs, and some hacks will be needed. But importantly, these are reproducible hacks! Once they are working we can just shove them into a git repository and forget about them, they will keep working no matter how ugly they are.

With all that said, let’s see what it takes to target an embedded system with nixpkgs. nixpkgs has an stdenv (standard environment) for building packages that includes common dependencies like a C compiler and a libc. I want to instantiate a package set with an stdenv that comes with clang, picolibc, libc++, and preconfigured compiler flags which are needed to compile code for the ESP32-C3.

The most important compiler flag is the exact target processor architecture, which is configured as rv32imczicsr.1Just rolls off the tongue, right? rv32 stands for “RISC-V 32 bit”, i is the base integer instruction set, m is the extension with multiplication and division instructions, c signifies support for compressed instruction encoding, and finally zicsr signals support for “Control and Status Register” (CSR) instructions.

The package definition for picolibc is fairly straightforward. To get libc++ to compile we have to apply some patches to nixpkgs: this consists of turning on some options that are only enabled for WebAssembly in stock nixpkgs (I guess embedded environments and WebAssembly have quite a bit in common), and adding some extra flags on top to turn off some more libc++ features (like locale support, which we definitely don’t need) to get it to compile.

Building the customized Lean runtime is moderately involved, the solution is graft it onto the original Lean source tree so that we can use the build system that’s already in place.

Lean likes to use its own C compiler wrapper to apply some extra flags, and it takes quite a bit of work to convince it that this is really rather not a good idea in this situation. (It tries to do things like link with stdc++ which is the GNU version of the standard C library which we don’t have, or turn on threading support.)

Compiling Lean’s Init library is fairly easy compared to everything else, though I don’t think compiling it separately from the rest of Lean is an intended use-case.

Finally, building the example app itself is not too hard either (though it still does take a moderately sized list of carefully crafted compiler flags), since all the hard work has already been done by setting up the stdenv and Lean.

Hacking on the project yourself

The project also contains a QEMU version compiled from Espressif’s repository, with support for the ESP32-C3, so you don’t even need a real device to try it out. Just run nix run .#qemu in the project root. You should see a message printed by the main function written in Lean.

For further information, see the README file.

Booting and application initialization

Memory layout

Here is how I decided to lay out the 384 KiB of RAM:

I decided to place the stack at the start so that a stack overflow can’t corrupt anything else (the stack grows downwards). I am not sure why placing it at the end of the RAM is the common practice, since that way it could silently corrupt the heap.

The code section (usually called .text by the compiler) is pretty self explanatory, the example application I will be showing later weighs in at around 64 KiB. One notable thing about the code section is that it needs to be accessed through a different address space, in somewhat of a Harvard architecture style. On the ESP32-C3 the range 0x40380000-0x403dffff is used to access the RAM for code execution, while 0x3fc80000-0x3fcdffff is used for regular data access. (So the above code section starts at 0x40388000 in this mapping.) Some linker script trickery is needed to make sure that all the addresses in the linked code are correct.

The data section contains read only constants (such as strings) and global variables with a defined value.

The bss section contains zero initialized global variables. (Don’t know what the name “bss” stands for though.) Forgetting to zero this actually caused me to spend a day debugging seemingly unrelated malloc issues, which only manifested on real hardware and not in QEMU. It turns out that malloc stores the head of the free list in a global variable which it expects to be NULL by default. In QEMU all RAM is zeroed by default, but on real hardware it could be full of random junk values. Adding memset(&_bss_start, 0, (&_bss_end - &_bss_start)); to zero the bss section fixed the issue. Oops.

Finally, the remaining space is designated as the heap. All we have to do in the linker script is to specify the __heap_start and __heap_end symbols, and from there picolibc’s malloc implementation knows that it can allocate memory in this range. In my example application around 284 KiB is left for the heap, which is plenty.

To run our program, the code and data sections need to be copied to their specific addresses, and then execution needs to begin from a specific entry address. This can be achieved by packaging our application in the firmware image format expected by the ROM boot loader.

Booting

Booting on the ESP32-C3 happens through the following steps:

  1. On CPU Reset, the program counter is set to 0x40000000, and the CPU starts executing.

  2. At 0x40000000 the chip’s internal ROM is located. The values on various GPIO pins can affect what happens, but by default, the ROM code loads the application code from the external SPI flash memory to RAM. The application in the specific firmware image format is expected to be located at the beginning of the flash.

    The firmware image contains a number of segments to be loaded to specific addresses in RAM. The ROM code goes through them, and copies each segment to RAM.

  3. Finally, the firmware image also contains an entry point address. The ROM code jumps to it, and from here we have full responsibility over execution.

Note that unfortunately the ROM code is not open source, but parts of it have been reverse engineered, for example for security analysis. The ROM and the Wi-Fi MAC are the only non-open source software parts in the ESP32 ecosystem that I am aware of.

The entry point of my application is named call_start_cpu0 (this could be any name, it’s specified in the linker script). Here is the entry point function (we could also be using picolibc’s crt0 instead, but I was too lazy to set it up):

void __attribute__((noreturn)) call_start_cpu0() {
  __asm__("la sp, _stack_end;");
  memset(&_bss_start, 0, (&_bss_end - &_bss_start));
  disable_wdt();
  main(0, 0);
  while (true)
    ;
}

The very first thing we do is load the stack pointer to the sp register so that we can call functions and have local variables, pretty important things for most programs. Next, we zero the bss section, which as I already mentioned, is pretty important. We also disable the default watchdog timers. Next, we call the main function, which is the entry point generated by the Lean compiler. In case main were to return, we enter a busy loop. (Trying to return here would probably lead to a stack underflow, followed by the CPU going off trying to execute some random code.)

Lean startup

The main function emitted by the Lean compiler looks something like this:

int main(int argc, char **argv) {
  lean_object *in;
  lean_object *res;
  lean_initialize_runtime_module();
  lean_set_panic_messages(false);
  res = initialize_Main(1 /* builtin */, lean_io_mk_world());
  lean_set_panic_messages(true);
  lean_io_mark_end_initialization();
  if (lean_io_result_is_ok(res)) {
    lean_dec_ref(res);
    lean_init_task_manager();
    res = _lean_main(lean_io_mk_world());
  }
  lean_finalize_task_manager();
  if (lean_io_result_is_ok(res)) {
    int ret = 0;
    lean_dec_ref(res);
    return ret;
  } else {
    lean_io_result_show_error(res);
    lean_dec_ref(res);
    return 1;
  }
}

Notable parts are:

Example application: RGB LED controller

The development board I have has a WS2812B RGB LED, so making it light up in various patterns was a pretty obvious choice for an example application.

First, I define an interface for arbitrary lighting patterns:

structure LightPattern where
  State : Type
  init : State
  step : State → State
  color : State → Color

And yes, all the fields after State can refer to its value (a type) to define their own types, this is what dependent typing is all about.

To make this a little less abstract, let’s define a simple blinking pattern with an arbitrary color:

def LightPattern.blink (c : Color) : LightPattern := {
  State := Bool
  init := true
  step := Bool.not
  color := (if · then c else default)
}

State is a Bool, as I already mentioned previously. step is simply the boolean negation. color will be c when the state is true, and off otherwise. (default is defined as { red := 0, green := 0, blue := 0 }.) The syntax for defining color might be a little confusing, the · shorthand is used to define the function, which is equivalent to fun s => if s then c else default.

Let’s define a more advanced pattern, which cycles through the three color channels:

inductive ColorChannel := | Red | Green | Blue
def ColorChannel.toColor : ColorChannel → Color
| Red => { red := 0xFF, green := 0, blue := 0 }
| Green => { red := 0, green := 0xFF, blue := 0 }
| Blue => { red := 0, green := 0, blue := 0xFF }

def LightPattern.cycle : LightPattern := {
  State := ColorChannel
  init := .Red
  step := fun
    | .Red => .Green
    | .Green => .Blue
    | .Blue => .Red
  color := ColorChannel.toColor
}

Next, let’s define a “light pattern transformer”, that takes an existing pattern and modifies it. In particular I want to be able to tune the speed of these patterns. I will be running the update loop at around a 1000 Hz, which would be much too fast to see the blinking pattern for instance. The following pattern will only run the step function of the underlying pattern p every n steps:

def LightPattern.every (n : UInt16) (p : LightPattern) : LightPattern := {
  State := UInt16 × p.State,
  init := (0, p.init),
  step := fun (i, s) => if i == n then (0, p.step s) else (i + 1, s),
  color := fun (_, s) => p.color s
}

Let’s unpack this:

Finally, I wanted to also have a candle-like flicker effect. First, we need a (pseudo) random number generator for this (source):

abbrev Prng.State := UInt32
def Prng.step (state : Prng.State) : Prng.State := state + 0x9E3779B9
def Prng.get (z : Prng.State) : UInt32 :=
  let z := z ^^^ (z >>> 16)
  let z := z * 0x21F0AAAD
  let z := z ^^^ (z >>> 15)
  let z := z * 0x735A2D97
  z ^^^ (z >>> 15)

The only state of our candle will be the 32 bit PRNG state. (Unfortunately, this will be allocated on the heap since it’s a 32 bit value.) Here is the code for the candle:

def LightPattern.flicker : LightPattern := {
  State := Prng.State
  init := 0
  step := Prng.step
  color := fun state =>
    let random := Prng.get state
    let brighness := 0x80 + random &&& 0x7F
    {
      red := brighness.toUInt8,
      green := 150 * brighness / 255 |>.toUInt8,
      blue := 50 * brighness / 255 |>.toUInt8,
    }
}

All the candle-specific logic happens in the color function. We get the current value of the PRNG, calculate a brightness between 128 and 255 with it, and then mix up a warm white looking color while being very careful to avoid floating point maths. (Floating point has some issue I haven’t looked into yet and calculates bad values.)

Next, let’s define our application. We want to dynamically switch between patterns from a predefined array, which we parameterize our app over:

structure AppState (patterns : Array LightPattern) where
  patternIndex : Fin patterns.size
  state : (patterns.get patternIndex).State

patternIndex is the index of the currently playing light pattern, while state is the current state of the current light pattern, and its type depends on the current patternIndex.

Here is the rest of the app implementation, understanding it is an exercise left to the reader:

variable (patterns : Array LightPattern)

/-- Switch to the next pattern. -/
def AppState.nextPattern (s : AppState patterns) : AppState patterns :=
  let patternIndex := s.patternIndex + Fin.ofNat' 1 (
    match e : patterns.size with
    | 0 => Fin.elim0 (e ▸ s.patternIndex)
    | _ + 1 => (by omega))
  {
    patternIndex,
    state := (patterns.get patternIndex).init,
  }

/-- Update the state of the current pattern. -/
def AppState.step (s : AppState patterns) : AppState patterns :=
  { s with state := (patterns.get s.patternIndex).step s.state }

/-- Get the current output color. -/
def AppState.color (s : AppState patterns) : Color :=
  (patterns.get s.patternIndex).color s.state

/-- Initialize a new app state. -/
def AppState.init (hPatterns : patterns.size > 0) : AppState patterns :=
  let patternIndex := Fin.ofNat' 0 hPatterns
  {
    patternIndex,
    state := (patterns.get patternIndex).init,
  }

Next, we define our array of light patterns (assuming a 1000 Hz update frequency):

def patterns : Array LightPattern := #[
  every 500 (blink white),
  every 100 (blink white),
  every 500 (blink red),
  every 100 (blink red),
  every 500 (blink green),
  every 100 (blink green),
  every 500 (blink blue),
  every 100 (blink blue),

  every 500 cycle,
  every 100 cycle,

  every 50 flicker,

  breathe 1000 .Red,
  breathe 1000 .Green,
  breathe 1000 .Blue
]

And finally, we are ready to define our main loop:

def main : IO Unit := do
  let mut appState : AppState patterns
    := AppState.init patterns (by simp [patterns])
  let mut buttonPressedPrev := false
  while true do
    -- Read the button state, advance to the next pattern when pressed
    let buttonPressed := !(← readGpio 9)
    if !buttonPressedPrev && buttonPressed then
      appState := appState.nextPattern
    buttonPressedPrev := buttonPressed

    -- Update the pattern state and output the current color to the LED
    appState := appState.step
    setLedState appState.color

    -- TODO:
    -- Use proper timer instead of just a wait, so the frequency does not
    -- depend on the time it takes to do the update.
    delayUs 1000

Wait, didn’t I say that Lean was a pure language? What are these mutable variables and while loop doing here? As it turns out it’s all syntax sugar, and a pretty advanced one at that. All the “impure” features are compiled down to pure monadic code by Lean. You can learn more by reading the relevant chapter of the Lean manual.

readGpio, setLedState, and delayUs are implemented in C:

@[extern "c_read_gpio"]
opaque readGpio : Fin 22 → IO Bool
@[extern "c_set_led_state"]
opaque setLedState : (@& Color) → IO Unit
@[extern "c_delay_us"]
opaque delayUs : Nat → IO Unit

For example, here is what the c_set_led_state function looks like:

lean_object *c_set_led_state(lean_object *state, lean_object *w) {
  uint32_t data = lean_ctor_get_uint8(state, 0) << 16 |
                  lean_ctor_get_uint8(state, 1) << 8 |
                  lean_ctor_get_uint8(state, 2);
  send_led_data(data);
  return lean_io_result_mk_ok(lean_box(0));
}

See the appendix below for how the send_led_data function works.

Results

Here are some clips of some LED patterns (in particular every 100 cycle, breathe 1000 .Blue, and every 50 flicker) on my Waveshare ESP32-C3-Zero development board:

Conclusions

Overall I am pleased with the results, beforehand I wasn’t even quite sure if running Lean in such a constrained environment would be possible. For now this is still very much just a proof-of-concept, though in the future I might try to build something actually useful with an ESP32 and Lean. Here is a list of ideas for how this project could be taken further:

Finally, if you are interested in any of the topics mentioned in this article, here are some useful resources:

Appendix: Bit banging the WS2812 protocol

The WS2812 family of addressable LEDs need to be driven with very particular timing constraints. A 1 is signaled with the data line being high for 800 ns and low for 450 ns, while a 0 is signaled by it being high for 400 ns and low for 850 ns. We need to transfer 24 bits like this, 8 for each color channel.

Using a 20 MHz CPU clock, a single clock cycle is 50 ns, meaning that the transfer of a single bit needs to take exactly 25 cycles.

Most instructions take 1 clock cycle on the ESP32-C3, except for branch instructions, which take 1 cycle if the branch is not taken, but take 3 if the branch is taken! (At least according to this article I found on the topic.) And be thankful that the ESP32-C3 does not have a branch predictor, so that even if the timings are somewhat complicated, they are at least deterministic.

With all that said, here is the working code that I ended up with:

__attribute__((noinline)) void send_led_data(uint32_t data) {
  const size_t gpio = 10;
  for (size_t i = 0; i < 24; ++i) {
    const uint32_t bit = data & (1 << (23 - i)); // shift + and: 2 cycles
    GPIO_OUT_W1TS_REG = 1 << gpio;               // GPIO high: 1 cycle
    if (bit) {
      // if branch not taken: 1 cycles
      __asm__("nop; nop; nop; nop; nop;"
              "nop; nop; nop; nop; nop;"); // wait: 10 cycles
      GPIO_OUT_W1TC_REG = 1 << gpio;       // GPIO low: 1 cycle
      __asm__("nop;");                     // wait: 1 cycle
      // loop variable inc + jump taken: 4 cycles
    } else {
      // if branch taken: 3 cycles
      __asm__("nop; nop;");          // wait: 2 cycles
      GPIO_OUT_W1TC_REG = 1 << gpio; // GPIO low: 1 cycle
      __asm__("nop; nop; nop; nop; nop; nop; nop; nop; nop;"); // wait: 9 cycles
      // loop variable inc + jump not taken: 2 cycles
    }
  }
}

Writing it took some effort involving constantly looking at the disassembly to see what smart ways clang managed to come up with to helpfully “optimize” my code, and then accounting for them. E.g. you might notice that the jump back to the beginning of the loop takes a different number of cycles in the two branches of the if! It turns out that clang laid out the code in such a way that the else branch of the if is placed before the beginning of the loop, so execution just falls through into the beginning of the loop again without having to jump back. This is faster, as you can see, it saved 2 clock cycles for us…

This code will probably keep working up until I upgrade to the next version of clang, and it comes up with some other way to outsmart us.

Also you might notice that the clock cycles in an iteration above only add up to 20, and not 25. I had to take out a few nops, becuase the LED was not working with 25 for some reason. I still don’t know what the issue is, either there are still some clock cycles that I am not accounting for, or maybe the clock is slightly slower than 20 MHz for some reason. (I would need an oscilloscope or logic analyzer to debug this, which I don’t have.)

Overall, I would rate the experience of writing this function 0/10. If I ever need to write such timing sensitive code again I will make sure to do it in assembly, because that will probably cause much fewer headaches. Optimizing compilers are awesome, except in rare cases like this, when they are working actively against you.


  1. Just rolls off the tongue, right? rv32 stands for “RISC-V 32 bit”, i is the base integer instruction set, m is the extension with multiplication and division instructions, c signifies support for compressed instruction encoding, and finally zicsr signals support for “Control and Status Register” (CSR) instructions. ↩︎ ↩︎

  2. I use a 16 instead of a 32 bit counter since a 32 bit value would have to be allocated on the heap due to the way Lean’s object representation works, whereas a 16 bit value will be a scalar. It would be nice if Lean also had 31 bit integers like OCaml. ↩︎

Bibliography

[1]
S. Ullrich and L. de Moura, “Counting immutable beans: Reference counting optimized for purely functional programming.” 2020. Available: https://arxiv.org/abs/1908.05647