The goal of this project is to cross-compile Lean programs to the ESP32-C3. (You can find all the code in this repository.)
Lean is a dependently typed pure functional programming language and a proof assistant, though in this article I will mostly be using it just as a programming language. Its compiler can either emit LLVM IR or C code. I will be using the C output for simplicity, but using the LLVM IR instead should just involve setting up the right compiler flags.
The ESP32-C3 is a RISC-V microcontroller running at up to 160 MHz. Its main selling point is its integrated Wi-Fi peripheral, but I will not be using it for this project. It has 384 KiB of internal RAM, and I will be fitting everything into this including the program code, the heap and the stack. (Running code directly from flash memory would also be possible in order to save RAM, but it’s potentially slower than running from RAM.)
What’s the motivation behind this project? While the Lean compiler is pretty good, Lean programs will never be as fast or as small as programs written in C. In an environment where every byte and every CPU clock cycle matters, wouldn’t it make sense to write everything in C?
As far as microcontrollers go, the ESP32-C3 is actually a pretty beefy machine. 160 MHz and 384 KiB of RAM might seem tiny compared to the 3+ GHz and 8+ GB of RAM in your usual desktop machine, but it is significantly more than what e.g. the Commodore 64 or the original Game Boy had. Here is a table comparing the ESP32-C3 to some historical computers:
CPU | RAM | |
---|---|---|
Apollo Guidance Computer (1966) | 16 bit 2 MHz | 2 KiB |
Commodore 64 (1982) | 8 bit 1 MHz | 64 KiB |
Game Boy (1989) | 8 bit 4 MHz | 8 KiB + 8 KiB VRAM |
ESP32-C3 (2021) | 32 bit 160 MHz | 384 KiB |
Additionally, the success of projects like NodeMCU, MicroPython or Espruino show that running higher level languages on microcontrollers is very much feasible, and the developer effort saved by doing so can be worth it.
With that said, I will jump into describing the whole stack that goes into getting Lean running on the ESP32-C3. You can use the table of contents below to jump to a particular section that interests you, I would say ~80% of any section should still be understandable without having read the preceding ones.
- Lean runtime: If you are interested in how the Lean runtime represents objects, and why its authors decided to use reference counting instead of a garbage collector for memory management, this is the section for you.
- Toolchain: compiler and libraries: Instead of downloading the prebuilt toolchain provided by Espressif, I decided to compile and configure everything for myself to have more control. If you are curious about my adventures compiling my own libc and libc++, read here.
- Booting and application initialization: Normally the boot loader provided by ESP-IDF handles setting up the hardware, but I decided to not use ESP-IDF. If you want to know how the boot chain and hardware setup works on the ESP32-C3, read this one.
- Example application: RGB LED controller: Now that I have a working Lean environment on the ESP32-C3, I wanted to write a demo application for this blog post. If you want to see some Lean code (and maybe get inspired to try Lean, I would definitely recommend), and want to know how to blink an LED in a very complicated way (with dependent types!), read here.
- Results: If you just want to see some clips of colorful flashing LEDs that are underpinned by all this work.
Table Of Contents
Lean runtime
In this section I will describe how Lean represents objects in memory, and how it uses reference counting to manage them. Note that I will assume a 32 bit architecture (as the ESP32-C3 has a 32 bit core), though Lean is more commonly used on 64 bit platforms.
Object representation
Lean objects are allocated on the heap, and referenced by a pointer. They consist of a header containing a reference count and some other information, followed by some arbitrary contents:
Unfortunately, this can be quite inefficient for small objects. For example, it does not make sense to allocate small integers or booleans on the heap. As an optimization, certain small objects are stored directly as a value representing themselves (instead of a pointer pointing to them). For example, the number 13 would be represented directly by the constant 0x1b
:
But wait, how did 13 turn into 27? It turns out that in order to distinguish pointers from these so-called scalar values, the LSB (least significant bit) needs to be set to 1. The actual value is shifted left by one, so that the value n
is encoded as (n << 1) | 1
. This is how the value 13 is encoded:
Note that the LSB of pointers is guaranteed to be 0, since the objects in the heap are always aligned to 4 bytes. This way, scalars and pointers can be clearly distinguished based on their LSB.
Memory management
Most managed languages nowadays use some sort of tracing garbage collector to automatically clean up unused objects. Lean uses reference counting instead. The header of each heap allocated object contains a reference count. It gets incremented if there is a new reference to the object, and decremented when a reference is no longer needed. When the reference count reaches zero, the object is freed.
Lean opts to use reference counting for two major reasons. First, a big weakness of reference counting, the handling of cycles, is not an issue, since due to immutability it is guaranteed that the object graph is acyclic.
Second, it allows for an optimization that is particularly useful for languages where all values are immutable. Take the following Lean code:
def rep (n : Nat) (v : α) :=
match n with
| 0 => #[]
| n + 1 => Array.push (rep n v) v
This returns an array containing the value v
n
times. But since the array is immutable, it is created and destroyed n
times, always with one extra element.
As a special optimization, when the Array.push
builtin function receives an argument with a reference count of exactly 1, it can mutate the existing array instead of having to create a new one. Since the reference count guarantees that this mutation cannot be observed by any other part of the code, this does not violate functional purity.
You can read more about this specific optimization, and much more in this paper: [1]S. Ullrich and L. de Moura, “Counting immutable beans: Reference counting optimized for purely functional programming.” 2020. Available: https://arxiv.org/abs/1908.05647.
Getting the runtime to compile & run on the ESP32-C3
The runtime includes its own special allocator to speed up the allocation of small objects. It allocates 8 MiB segments from the system to then suballocate the objects. This is many times larger than the whole 384 KiB of RAM we have, so it definitely won’t cut it. The easiest solution is to just disable this special allocator, and rely on the malloc
implementation being fast enough.
The next challenge is that the runtime is written in C++. If possible I would like to avoid having to deal with exceptions and vtables and other C++ features requiring special runtime considerations compared to C.
I went on a quest to slim down the runtime, and removed as many unnecessary things as possible:
- There were some uses of C++ exceptions, but none of them were hard to remove.
- I removed a bunch of unneeded features, like multithreading or debugging support.
- Removed library functions assuming the presence of an OS, like file IO or process management.
What remains is mostly the core functionality used by compiled Lean programs:
- Primitives for allocating objects and manipulating reference counts.
- Optimized functions for calling closures.
- Implementations for special-case types like strings, arrays, and natural number.
- IO primitives, so things like
IO.print
work.
Toolchain: compiler and libraries
Overview
I wanted to have full control over the whole toolchain to be able to customize things if necessary, so I decided to build the standard C and C++ libraries for myself. I compiled everything with the -ffunction-sections
flag, which allows garbage collection of individual functions at link time, reducing code size. The full tech-stack ended up being the following:
C/C++ compiler: Clang version 18.
Clang is a multi-target compiler, so unlike GCC, I don’t have to build a special RISC-V cross-compiler. I just pass the
-target riscv32-none-elf -march=rv32imczicsr
1Just rolls off the tongue, right?rv32
stands for “RISC-V 32 bit”,i
is the base integer instruction set,m
is the extension with multiplication and division instructions,c
signifies support for compressed instruction encoding, and finallyzicsr
signals support for “Control and Status Register” (CSR) instructions. flags and it works.C standard library: picolibc.
Picolibc is specifically designed for small embedded systems with limited RAM. It includes a
malloc
implementation, which is needed for the Lean runtime. It also includes standard input/output functions, which are also used by the Lean runtime to implementIO.print
. I hooked them up to the built-in USB serial peripheral on the ESP32-C3.C++ standard library: libc++.
libc++ is developed under the umbrella of the LLVM project, and so is a natural choice to use together with clang.
C++ is needed to compile the Lean runtime. I think
std::vector
is the most important thing it needs from the C++ standard library. (Which uses C++’snew
operator, which usesmalloc
under the hood. Fortunately all the C++ stuff I needed ended up just working relatively painlessly.)Lean compiler: No custom modifications.
Lean runtime: My own slimmed down version, discussed above.
Lean standard library (
Init
): No modifications. (Though I do have a dirty hack to reduce its code size massively, see below.)QEMU for testing: Espressif’s QEMU branch with support for the ESP32-C3.
The Nix package manager
If you ever tried to put together some software project with multiple customized dependencies that need to be compiled, you know how painful it can be. And even if you manage to get it working, and document how you did it, the chances of someone else getting it working on the first try (or even you yourself on another computer) is basically zero.
You did not forget to record all the custom build system flags you used, right? Also project X needs a one line modification to get it to compile in your environment, don’t forget about that. Also project Y needs an older version of dependency Z, so you have to temporarily downgrade it while compiling that one project, you recorded the exact version needed, right?
And even if you documented every step perfectly, some project is bound to depend on something in your environment you did not notice, but someone else might not have installed.
Also if you want to take your project and continue working on it let’s say 1 or 2 years later, chances are that something will have gotten broken by then. Even if you recorded the versions of ALL packages installed on your system, reproducing the exact same environment is impossible on most operating systems.
Nix is a package manager that builds packages in a reproducible manner. Every build is isolated from the rest of the world and only supplied with dependencies that were explicitly declared, so that it’s not possible to forget declaring a dependency. Every external input (e.g. source code downloaded from a git repository) must be pinned by its hash so that it cannot change implicitly.
If a nix package works today, it is pretty much guaranteed to continue working many years into the future without any changes.
Nix also makes it very easy to define one’s own packages using its package definition DSL called the nix language.
Because of these aforementioned advantages, I decided to use Nix for this project from the very start. This guarantees that anyone (either you, or future me many years from now) can just clone the project repository, and run nix build
to compile the exact versions of everything I used for the project, including the final binary that runs on the ESP32-C3.
Targeting the ESP32-C3 with nixpkgs
The main package collection for Nix is called nixpkgs
, which I will be using as a base to build my packages on.
If the description above made it seem like Nix is a magic bullet and everything is perfect and beautiful in the Nix ecosystem, I have some bad news, building software for embedded targets is not a very common (or commonly tested) use-case of nixpkgs
, and some hacks will be needed. But importantly, these are reproducible hacks! Once they are working we can just shove them into a git repository and forget about them, they will keep working no matter how ugly they are.
With all that said, let’s see what it takes to target an embedded system with nixpkgs
. nixpkgs
has an stdenv (standard environment) for building packages that includes common dependencies like a C compiler and a libc. I want to instantiate a package set with an stdenv that comes with clang, picolibc, libc++, and preconfigured compiler flags which are needed to compile code for the ESP32-C3.
The most important compiler flag is the exact target processor architecture, which is configured as rv32imczicsr
.1Just rolls off the tongue, right? rv32
stands for “RISC-V 32 bit”, i
is the base integer instruction set, m
is the extension with multiplication and division instructions, c
signifies support for compressed instruction encoding, and finally zicsr
signals support for “Control and Status Register” (CSR) instructions.
The package definition for picolibc is fairly straightforward. To get libc++ to compile we have to apply some patches to nixpkgs
: this consists of turning on some options that are only enabled for WebAssembly in stock nixpkgs
(I guess embedded environments and WebAssembly have quite a bit in common), and adding some extra flags on top to turn off some more libc++ features (like locale support, which we definitely don’t need) to get it to compile.
Building the customized Lean runtime is moderately involved, the solution is graft it onto the original Lean source tree so that we can use the build system that’s already in place.
Lean likes to use its own C compiler wrapper to apply some extra flags, and it takes quite a bit of work to convince it that this is really rather not a good idea in this situation. (It tries to do things like link with stdc++
which is the GNU version of the standard C library which we don’t have, or turn on threading support.)
Compiling Lean’s Init
library is fairly easy compared to everything else, though I don’t think compiling it separately from the rest of Lean is an intended use-case.
Finally, building the example app itself is not too hard either (though it still does take a moderately sized list of carefully crafted compiler flags), since all the hard work has already been done by setting up the stdenv and Lean.
Hacking on the project yourself
The project also contains a QEMU version compiled from Espressif’s repository, with support for the ESP32-C3, so you don’t even need a real device to try it out. Just run nix run .#qemu
in the project root. You should see a message printed by the main function written in Lean.
For further information, see the README file.
Booting and application initialization
Memory layout
Here is how I decided to lay out the 384 KiB of RAM:
I decided to place the stack at the start so that a stack overflow can’t corrupt anything else (the stack grows downwards). I am not sure why placing it at the end of the RAM is the common practice, since that way it could silently corrupt the heap.
The code section (usually called .text
by the compiler) is pretty self explanatory, the example application I will be showing later weighs in at around 64 KiB. One notable thing about the code section is that it needs to be accessed through a different address space, in somewhat of a Harvard architecture style. On the ESP32-C3 the range 0x40380000
-0x403dffff
is used to access the RAM for code execution, while 0x3fc80000
-0x3fcdffff
is used for regular data access. (So the above code section starts at 0x40388000
in this mapping.) Some linker script trickery is needed to make sure that all the addresses in the linked code are correct.
The data section contains read only constants (such as strings) and global variables with a defined value.
The bss section contains zero initialized global variables. (Don’t know what the name “bss” stands for though.) Forgetting to zero this actually caused me to spend a day debugging seemingly unrelated malloc issues, which only manifested on real hardware and not in QEMU. It turns out that malloc stores the head of the free list in a global variable which it expects to be NULL
by default. In QEMU all RAM is zeroed by default, but on real hardware it could be full of random junk values. Adding memset(&_bss_start, 0, (&_bss_end - &_bss_start));
to zero the bss section fixed the issue. Oops.
Finally, the remaining space is designated as the heap. All we have to do in the linker script is to specify the __heap_start
and __heap_end
symbols, and from there picolibc’s malloc implementation knows that it can allocate memory in this range. In my example application around 284 KiB is left for the heap, which is plenty.
To run our program, the code and data sections need to be copied to their specific addresses, and then execution needs to begin from a specific entry address. This can be achieved by packaging our application in the firmware image format expected by the ROM boot loader.
Booting
Booting on the ESP32-C3 happens through the following steps:
On CPU Reset, the program counter is set to
0x40000000
, and the CPU starts executing.At
0x40000000
the chip’s internal ROM is located. The values on various GPIO pins can affect what happens, but by default, the ROM code loads the application code from the external SPI flash memory to RAM. The application in the specific firmware image format is expected to be located at the beginning of the flash.The firmware image contains a number of segments to be loaded to specific addresses in RAM. The ROM code goes through them, and copies each segment to RAM.
Finally, the firmware image also contains an entry point address. The ROM code jumps to it, and from here we have full responsibility over execution.
Note that unfortunately the ROM code is not open source, but parts of it have been reverse engineered, for example for security analysis. The ROM and the Wi-Fi MAC are the only non-open source software parts in the ESP32 ecosystem that I am aware of.
The entry point of my application is named call_start_cpu0
(this could be any name, it’s specified in the linker script). Here is the entry point function (we could also be using picolibc’s crt0 instead, but I was too lazy to set it up):
void __attribute__((noreturn)) call_start_cpu0() {
__asm__("la sp, _stack_end;");
memset(&_bss_start, 0, (&_bss_end - &_bss_start));
disable_wdt();
main(0, 0);
while (true)
;
}
The very first thing we do is load the stack pointer to the sp
register so that we can call functions and have local variables, pretty important things for most programs. Next, we zero the bss section, which as I already mentioned, is pretty important. We also disable the default watchdog timers. Next, we call the main
function, which is the entry point generated by the Lean compiler. In case main
were to return, we enter a busy loop. (Trying to return here would probably lead to a stack underflow, followed by the CPU going off trying to execute some random code.)
Lean startup
The main
function emitted by the Lean compiler looks something like this:
int main(int argc, char **argv) {
lean_object *in;
lean_object *res;
lean_initialize_runtime_module();
lean_set_panic_messages(false);
res = initialize_Main(1 /* builtin */, lean_io_mk_world());
lean_set_panic_messages(true);
lean_io_mark_end_initialization();
if (lean_io_result_is_ok(res)) {
lean_dec_ref(res);
lean_init_task_manager();
res = _lean_main(lean_io_mk_world());
}
lean_finalize_task_manager();
if (lean_io_result_is_ok(res)) {
int ret = 0;
lean_dec_ref(res);
return ret;
} else {
lean_io_result_show_error(res);
lean_dec_ref(res);
return 1;
}
}
Notable parts are:
The call to
lean_initialize_runtime_module
to do any initialization needed by the runtime. As I mentioned already I slimmed down the runtime a lot, so this doesn’t do too much.initialize_Main
initializes some global constants of theMain
module. It also callsinitialize_Init
to initialize the standard library. The standard library initialization turned out to be an issue, since apparently Lean does not remove the initialization code for unused constants. The standard library is fairly big, and leaving its initialization as-is makes the code size of the application larger than the 384 KiB size of our whole RAM, so we clearly need to do something about this. My trick is override theinitialize_Init
function with a stub implementation that does nothing:lean_object *initialize_Init(uint8_t builtin, lean_object *w) { return lean_io_result_mk_ok(lean_box(0)); }
As long as we carefully avoid using any constants from the standard library that would have needed initialization, this seems to just work. (Though it’s definitely not a safe solution.)
Finally, the call to
_lean_main
executes themain
IO monad of our Lean program. This is what a hello world program in Lean would look like:def main : IO Unit := do IO.println "hello world"
Example application: RGB LED controller
The development board I have has a WS2812B RGB LED, so making it light up in various patterns was a pretty obvious choice for an example application.
First, I define an interface for arbitrary lighting patterns:
structure LightPattern where
State : Type
init : State
step : State → State
color : State → Color
State
is the type of the internal state of the pattern. For example, for a simple blinking pattern it could be aBool
.init
is the initial state.step
is a function to compute the next state from the current one.color
is the function to get the current output color of the LED based on the state.
And yes, all the fields after State
can refer to its value (a type) to define their own types, this is what dependent typing is all about.
To make this a little less abstract, let’s define a simple blinking pattern with an arbitrary color:
def LightPattern.blink (c : Color) : LightPattern := {
State := Bool
init := true
step := Bool.not
color := (if · then c else default)
}
State
is a Bool
, as I already mentioned previously. step
is simply the boolean negation. color
will be c
when the state is true
, and off otherwise. (default
is defined as { red := 0, green := 0, blue := 0 }
.) The syntax for defining color
might be a little confusing, the ·
shorthand is used to define the function, which is equivalent to fun s => if s then c else default
.
Let’s define a more advanced pattern, which cycles through the three color channels:
inductive ColorChannel := | Red | Green | Blue
def ColorChannel.toColor : ColorChannel → Color
| Red => { red := 0xFF, green := 0, blue := 0 }
| Green => { red := 0, green := 0xFF, blue := 0 }
| Blue => { red := 0, green := 0, blue := 0xFF }
def LightPattern.cycle : LightPattern := {
State := ColorChannel
init := .Red
step := fun
| .Red => .Green
| .Green => .Blue
| .Blue => .Red
color := ColorChannel.toColor
}
Next, let’s define a “light pattern transformer”, that takes an existing pattern and modifies it. In particular I want to be able to tune the speed of these patterns. I will be running the update loop at around a 1000 Hz, which would be much too fast to see the blinking pattern for instance. The following pattern will only run the step
function of the underlying pattern p
every n
steps:
def LightPattern.every (n : UInt16) (p : LightPattern) : LightPattern := {
State := UInt16 × p.State,
init := (0, p.init),
step := fun (i, s) => if i == n then (0, p.step s) else (i + 1, s),
color := fun (_, s) => p.color s
}
Let’s unpack this:
- We define the state to be a 16 bit counter2I use a 16 instead of a 32 bit counter since a 32 bit value would have to be allocated on the heap due to the way Lean’s object representation works, whereas a 16 bit value will be a scalar. It would be nice if Lean also had 31 bit integers like OCaml., plus whatever state
p
has.×
denotes a product type, a.k.a. a pair. init
is zero and the initial state ofp
.step
is the core of the logic: when the counter reachesn
, we call thestep
of the underlyingp
. Otherwise, we just increment the counter, and leave the state ofp
as-is.color
just outputs the color ofp
.
Finally, I wanted to also have a candle-like flicker effect. First, we need a (pseudo) random number generator for this (source):
abbrev Prng.State := UInt32
def Prng.step (state : Prng.State) : Prng.State := state + 0x9E3779B9
def Prng.get (z : Prng.State) : UInt32 :=
let z := z ^^^ (z >>> 16)
let z := z * 0x21F0AAAD
let z := z ^^^ (z >>> 15)
let z := z * 0x735A2D97
z ^^^ (z >>> 15)
The only state of our candle will be the 32 bit PRNG state. (Unfortunately, this will be allocated on the heap since it’s a 32 bit value.) Here is the code for the candle:
def LightPattern.flicker : LightPattern := {
State := Prng.State
init := 0
step := Prng.step
color := fun state =>
let random := Prng.get state
let brighness := 0x80 + random &&& 0x7F
{
red := brighness.toUInt8,
green := 150 * brighness / 255 |>.toUInt8,
blue := 50 * brighness / 255 |>.toUInt8,
}
}
All the candle-specific logic happens in the color
function. We get the current value of the PRNG, calculate a brightness between 128 and 255 with it, and then mix up a warm white looking color while being very careful to avoid floating point maths. (Floating point has some issue I haven’t looked into yet and calculates bad values.)
Next, let’s define our application. We want to dynamically switch between patterns from a predefined array, which we parameterize our app over:
structure AppState (patterns : Array LightPattern) where
patternIndex : Fin patterns.size
state : (patterns.get patternIndex).State
patternIndex
is the index of the currently playing light pattern, while state
is the current state of the current light pattern, and its type depends on the current patternIndex
.
Here is the rest of the app implementation, understanding it is an exercise left to the reader:
variable (patterns : Array LightPattern)
/-- Switch to the next pattern. -/
def AppState.nextPattern (s : AppState patterns) : AppState patterns :=
let patternIndex := s.patternIndex + Fin.ofNat' 1 (
match e : patterns.size with
| 0 => Fin.elim0 (e ▸ s.patternIndex)
| _ + 1 => (by omega))
{
patternIndex,
state := (patterns.get patternIndex).init,
}
/-- Update the state of the current pattern. -/
def AppState.step (s : AppState patterns) : AppState patterns :=
{ s with state := (patterns.get s.patternIndex).step s.state }
/-- Get the current output color. -/
def AppState.color (s : AppState patterns) : Color :=
(patterns.get s.patternIndex).color s.state
/-- Initialize a new app state. -/
def AppState.init (hPatterns : patterns.size > 0) : AppState patterns :=
let patternIndex := Fin.ofNat' 0 hPatterns
{
patternIndex,
state := (patterns.get patternIndex).init,
}
Next, we define our array of light patterns (assuming a 1000 Hz update frequency):
def patterns : Array LightPattern := #[
every 500 (blink white),
every 100 (blink white),
every 500 (blink red),
every 100 (blink red),
every 500 (blink green),
every 100 (blink green),
every 500 (blink blue),
every 100 (blink blue),
every 500 cycle,
every 100 cycle,
every 50 flicker,
breathe 1000 .Red,
breathe 1000 .Green,
breathe 1000 .Blue
]
And finally, we are ready to define our main loop:
def main : IO Unit := do
let mut appState : AppState patterns
:= AppState.init patterns (by simp [patterns])
let mut buttonPressedPrev := false
while true do
-- Read the button state, advance to the next pattern when pressed
let buttonPressed := !(← readGpio 9)
if !buttonPressedPrev && buttonPressed then
appState := appState.nextPattern
buttonPressedPrev := buttonPressed
-- Update the pattern state and output the current color to the LED
appState := appState.step
setLedState appState.color
-- TODO:
-- Use proper timer instead of just a wait, so the frequency does not
-- depend on the time it takes to do the update.
delayUs 1000
Wait, didn’t I say that Lean was a pure language? What are these mutable variables and while loop doing here? As it turns out it’s all syntax sugar, and a pretty advanced one at that. All the “impure” features are compiled down to pure monadic code by Lean. You can learn more by reading the relevant chapter of the Lean manual.
readGpio
, setLedState
, and delayUs
are implemented in C:
@[extern "c_read_gpio"]
opaque readGpio : Fin 22 → IO Bool
@[extern "c_set_led_state"]
opaque setLedState : (@& Color) → IO Unit
@[extern "c_delay_us"]
opaque delayUs : Nat → IO Unit
For example, here is what the c_set_led_state
function looks like:
lean_object *c_set_led_state(lean_object *state, lean_object *w) {
uint32_t data = lean_ctor_get_uint8(state, 0) << 16 |
lean_ctor_get_uint8(state, 1) << 8 |
lean_ctor_get_uint8(state, 2);
send_led_data(data);
return lean_io_result_mk_ok(lean_box(0));
}
See the appendix below for how the send_led_data
function works.
Results
Here are some clips of some LED patterns (in particular every 100 cycle
, breathe 1000 .Blue
, and every 50 flicker
) on my Waveshare ESP32-C3-Zero development board:
Conclusions
Overall I am pleased with the results, beforehand I wasn’t even quite sure if running Lean in such a constrained environment would be possible. For now this is still very much just a proof-of-concept, though in the future I might try to build something actually useful with an ESP32 and Lean. Here is a list of ideas for how this project could be taken further:
- Transition to first compiling everything to LLVM bitcode instead of directly to machine code, this would enable link-time optimization to further reduce code size, and might simplify the cross-compilation setup.
- Use ESP-IDF, it already has a bunch of useful hardware abstractions, and could potentially make supporting other devices from the ESP32 family much easier.
- Get threading support by putting FreeRTOS into the stack. (ESP-IDF already uses it.)
- Get Wi-Fi working, would be pretty important for real-world applications.
- Try upstreaming some of the needed modifications to Lean.
- Investigate if unused initialization code could somehow be detected and removed, this is definitely a big pain point.
- I think adding flags to make some features optional in the runtime (e.g. C++ exceptions, threads, and various OS interfaces) should be relatively uncontroversial.
- Make the build system in general behave a bit better when cross-compiling.
- Add 31 bit integers. (Maybe by special casing the
Fin
type?)
- Debug why floating point operations are giving bad results.
Finally, if you are interested in any of the topics mentioned in this article, here are some useful resources:
- Lean: I recommend the Functional Programming in Lean book, it’s a quite comprehensive introduction to programming in Lean. If you are by chance interested in using Lean as a proof assistant, take a look at the separate book titled Theorem Proving in Lean 4.
- ESP32: If you want to program in a memory safe language for the ESP32 family of devices, I would recommend trying Rust, specifically the esp-rs project, which seems to have an active community around it.
- Nix: Not going to lie, I find the resources in the Nix ecosystem to be pretty lacking if you want to do anything somewhat non-trivial. Be prepared to read a lot of
nixpkgs
source code if you want to do something on the level presented in this article. The best I can do is link you to the official learning resource collection on the NixOS website, but I cannot vouch for the quality of anything there.
Appendix: Bit banging the WS2812 protocol
The WS2812 family of addressable LEDs need to be driven with very particular timing constraints. A 1
is signaled with the data line being high for 800 ns and low for 450 ns, while a 0
is signaled by it being high for 400 ns and low for 850 ns. We need to transfer 24 bits like this, 8 for each color channel.
Using a 20 MHz CPU clock, a single clock cycle is 50 ns, meaning that the transfer of a single bit needs to take exactly 25 cycles.
Most instructions take 1 clock cycle on the ESP32-C3, except for branch instructions, which take 1 cycle if the branch is not taken, but take 3 if the branch is taken! (At least according to this article I found on the topic.) And be thankful that the ESP32-C3 does not have a branch predictor, so that even if the timings are somewhat complicated, they are at least deterministic.
With all that said, here is the working code that I ended up with:
__attribute__((noinline)) void send_led_data(uint32_t data) {
const size_t gpio = 10;
for (size_t i = 0; i < 24; ++i) {
const uint32_t bit = data & (1 << (23 - i)); // shift + and: 2 cycles
GPIO_OUT_W1TS_REG = 1 << gpio; // GPIO high: 1 cycle
if (bit) {
// if branch not taken: 1 cycles
__asm__("nop; nop; nop; nop; nop;"
"nop; nop; nop; nop; nop;"); // wait: 10 cycles
GPIO_OUT_W1TC_REG = 1 << gpio; // GPIO low: 1 cycle
__asm__("nop;"); // wait: 1 cycle
// loop variable inc + jump taken: 4 cycles
} else {
// if branch taken: 3 cycles
__asm__("nop; nop;"); // wait: 2 cycles
GPIO_OUT_W1TC_REG = 1 << gpio; // GPIO low: 1 cycle
__asm__("nop; nop; nop; nop; nop; nop; nop; nop; nop;"); // wait: 9 cycles
// loop variable inc + jump not taken: 2 cycles
}
}
}
Writing it took some effort involving constantly looking at the disassembly to see what smart ways clang managed to come up with to helpfully “optimize” my code, and then accounting for them. E.g. you might notice that the jump back to the beginning of the loop takes a different number of cycles in the two branches of the if
! It turns out that clang laid out the code in such a way that the else
branch of the if
is placed before the beginning of the loop, so execution just falls through into the beginning of the loop again without having to jump back. This is faster, as you can see, it saved 2 clock cycles for us…
This code will probably keep working up until I upgrade to the next version of clang, and it comes up with some other way to outsmart us.
Also you might notice that the clock cycles in an iteration above only add up to 20, and not 25. I had to take out a few nop
s, becuase the LED was not working with 25 for some reason. I still don’t know what the issue is, either there are still some clock cycles that I am not accounting for, or maybe the clock is slightly slower than 20 MHz for some reason. (I would need an oscilloscope or logic analyzer to debug this, which I don’t have.)
Overall, I would rate the experience of writing this function 0/10. If I ever need to write such timing sensitive code again I will make sure to do it in assembly, because that will probably cause much fewer headaches. Optimizing compilers are awesome, except in rare cases like this, when they are working actively against you.
Just rolls off the tongue, right?
rv32
stands for “RISC-V 32 bit”,i
is the base integer instruction set,m
is the extension with multiplication and division instructions,c
signifies support for compressed instruction encoding, and finallyzicsr
signals support for “Control and Status Register” (CSR) instructions. ↩︎ ↩︎I use a 16 instead of a 32 bit counter since a 32 bit value would have to be allocated on the heap due to the way Lean’s object representation works, whereas a 16 bit value will be a scalar. It would be nice if Lean also had 31 bit integers like OCaml. ↩︎