|
| 1 | +title: Learning to write an LLVM backend, one small step at a time |
| 2 | +date: 2026-03-19 |
| 3 | +license: CC-BY-4.0 |
| 4 | +slug: llvm-ia16-intro |
| 5 | +summary: An introduction to a learning journal series about building an LLVM backend from scratch, targeting the 8086. |
| 6 | +tags: llvm, compilers, ia16, 8086, backend |
| 7 | + |
| 8 | +I can read x86 assembly reasonably well. |
| 9 | +I understand what the compiler is *trying* to do when I look at its output. |
| 10 | +But ask me how it gets there — how a compiler decides which registers to use, |
| 11 | +how it turns an abstract operation into concrete instructions, |
| 12 | +what all those optimisation passes are actually doing — and I'd have to admit I don't really know. |
| 13 | +I've always treated the compiler as a black box. |
| 14 | +This project is about opening it. |
| 15 | + |
| 16 | +The target architecture chose itself. |
| 17 | +I've spent years reading about computer history — |
| 18 | +[OS/2 Museum](https://www.os2museum.com/) is a favourite — and the 8086 sits at the centre of a lot of it. |
| 19 | +I'm not an expert: I can't write large programs in assembler, |
| 20 | +but I understand the instruction set, I know the registers, I have a feel for the addressing modes. |
| 21 | +Using hardware I already know means one fewer unknown when the compiler does something unexpected. |
| 22 | + |
| 23 | +So: building an LLVM backend for the 8086, from scratch, documenting every step. |
| 24 | +This is a learning journal. |
| 25 | + |
| 26 | +## The full picture: what a compiler does |
| 27 | + |
| 28 | +Before getting to the part I'm actually going to build, it helps to look at the whole pipeline. |
| 29 | +A compiler takes source code and produces machine instructions — but it does this in several steps, |
| 30 | +each with a clear input and output. |
| 31 | + |
| 32 | +```mermaid |
| 33 | +flowchart TD |
| 34 | + SRC["Source code"] |
| 35 | + AST["Abstract Syntax Tree"] |
| 36 | + IR["LLVM IR\nIntermediate Representation"] |
| 37 | + OPT["Optimised LLVM IR"] |
| 38 | + OBJ["Object file"] |
| 39 | + EXE["Executable"] |
| 40 | + RUN["Running program"] |
| 41 | +
|
| 42 | + SRC -->|"Lexing & parsing"| AST |
| 43 | + AST -->|"Semantic analysis"| IR |
| 44 | + IR -->|"Optimisation passes"| OPT |
| 45 | + OPT -->|"Code generation"| ASM |
| 46 | + ASM -->|"Assembler"| OBJ |
| 47 | + OBJ -->|"Linker"| EXE |
| 48 | + EXE -->|"Run"| RUN |
| 49 | +
|
| 50 | + subgraph "this project" |
| 51 | + ASM["Assembly text\n.asm"] |
| 52 | + end |
| 53 | +``` |
| 54 | + |
| 55 | +The **Intermediate Representation** is the key abstraction in the middle. |
| 56 | +Instead of going straight from source code to machine instructions — |
| 57 | +which would need a separate optimiser for every language and every CPU — |
| 58 | +compilers first translate everything into a common form that is not tied to any specific architecture. |
| 59 | +Optimisation passes work on that form, |
| 60 | +and then the code generator translates the optimised result into real instructions for a specific machine. |
| 61 | + |
| 62 | +In [GCC](https://gcc.gnu.org/), this intermediate form is called |
| 63 | +[GIMPLE](https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html) at the high level and |
| 64 | +[RTL](https://gcc.gnu.org/onlinedocs/gccint/RTL.html) (Register Transfer Language) closer to the machine. |
| 65 | +In [LLVM](https://llvm.org/), it is called |
| 66 | +[LLVM IR](https://llvm.org/docs/LangRef.html) — a readable, typed representation that looks a bit like |
| 67 | +assembly for an imaginary machine. |
| 68 | +For example, a function that adds two 16-bit integers looks like this in LLVM IR: |
| 69 | + |
| 70 | +```llvm |
| 71 | +define i16 @add(i16 %a, i16 %b) { |
| 72 | + %r = add i16 %a, %b |
| 73 | + ret i16 %r |
| 74 | +} |
| 75 | +``` |
| 76 | + |
| 77 | +The code generator — the **backend** — takes optimised LLVM IR and translates it into assembly for a specific target. |
| 78 | +That is what I'm going to build. |
| 79 | + |
| 80 | +## Why LLVM and not GCC? |
| 81 | + |
| 82 | +A fair question, since [gcc-ia16](https://github.com/tkchia/gcc-ia16) already exists — |
| 83 | +a real, working GCC port targeting the 8086. |
| 84 | +I could just use that. |
| 85 | +But this project is about learning, not about producing the most practical toolchain. |
| 86 | + |
| 87 | +GCC's backend is older and more tangled up with the rest of the compiler. |
| 88 | +LLVM's backend is a more self-contained module with cleaner interfaces — |
| 89 | +or at least that's what I've read. |
| 90 | +I'll find out if that's true. |
| 91 | +LLVM IR is also easy to write by hand, |
| 92 | +which matters because I won't be using a C frontend in the early stages. |
| 93 | + |
| 94 | +There's also the ecosystem: LLVM ships with backends for many architectures — |
| 95 | +from small embedded targets to large general-purpose CPUs. |
| 96 | +Some of them are simple enough to learn from — |
| 97 | +I expect to figure out which ones in the next few posts. |
| 98 | + |
| 99 | +LLVM's existing x86 backend is also worth mentioning. |
| 100 | +Modern 32-bit and 64-bit x86 has its roots in the 8086, |
| 101 | +so there is probably a lot to learn from it. |
| 102 | +But extending it to support 16-bit would be more complex than building a new backend in parallel — |
| 103 | +the existing backend carries a lot of assumptions about the target that would need to be unpicked. |
| 104 | + |
| 105 | +## The goal structure |
| 106 | + |
| 107 | +Having a clear set of goals helps keep scope under control — or so I hope. |
| 108 | + |
| 109 | +**Dream goal:** [Clang](https://clang.llvm.org/) support — a full C/C++ frontend targeting the 8086. |
| 110 | +Far enough away to be motivating without being a near-term distraction. |
| 111 | + |
| 112 | +**Strategic goal:** a proper 8086 backend. |
| 113 | +The Intel 8086 — 16-bit registers, segmented memory, DOS-era calling conventions. |
| 114 | +An architecture quirky enough that several LLVM assumptions will probably need to be worked around. |
| 115 | +I don't know exactly which ones yet. |
| 116 | + |
| 117 | +**Operative goal:** [small memory model](#dos-memory-models) only. |
| 118 | +The 8086's segmented memory is genuinely complex — |
| 119 | +far pointers, multiple memory models, segment register management. |
| 120 | +That's all real and eventually interesting, but it's not where I start. |
| 121 | +LLVM works naturally with flat memory models, |
| 122 | +and the DOS small memory model — one code segment, one data segment, no far pointers — |
| 123 | +is close enough to flat that the framework handles it well. |
| 124 | +One fewer thing to fight. |
| 125 | + |
| 126 | +**Tactical goal:** a stripped-down 8086 subset. |
| 127 | +Rather than targeting the full architecture right away, |
| 128 | +I'll start with a small subset — a handful of instructions, a few registers, a simple calling convention. |
| 129 | +Real hardware, real instruction encodings, |
| 130 | +but only as much of the ISA as the simplest possible program needs. |
| 131 | +I think that's a good way to separate "does the LLVM infrastructure work?" |
| 132 | +from "is the hardware modelled correctly?" — but I'll see if that holds up in practice. |
| 133 | + |
| 134 | +## Decisions made so far |
| 135 | + |
| 136 | +A few choices have already been made — mostly about moving complex parts outside the backend. |
| 137 | +These may turn out to be wrong; if so, I'll document that too. |
| 138 | + |
| 139 | +### External assembler |
| 140 | + |
| 141 | +The backend will emit [NASM](https://www.nasm.us/)-compatible text assembly. |
| 142 | +NASM handles binary encoding and produces object files. |
| 143 | +Binary encoding is complex enough on its own, |
| 144 | +so keeping it out of scope until the core problems are solved makes sense. |
| 145 | + |
| 146 | +### External linker and output formats |
| 147 | + |
| 148 | +[NASM](https://www.nasm.us/) supports multiple output formats. |
| 149 | +For a DOS `.EXE` executable, NASM can produce |
| 150 | +[OMF](https://en.wikipedia.org/wiki/Relocatable_Object_Module_Format) (Object Module Format) object files, |
| 151 | +which a DOS-compatible linker such as [OpenWatcom's WLINK](https://github.com/open-watcom/open-watcom-v2) |
| 152 | +can then link into an executable. |
| 153 | +For simpler flat `.COM` binaries, NASM can produce a flat binary directly. |
| 154 | +The exact linker choice and output format is something I'll need to settle later. |
| 155 | + |
| 156 | +### Development environment |
| 157 | + |
| 158 | +The goal is to track current LLVM — starting with LLVM 22 on Linux. |
| 159 | +As new versions are released I'll try to rebase, and note any API changes that affected the code. |
| 160 | + |
| 161 | +### Validation target |
| 162 | + |
| 163 | +The tactical goal is a backend that can correctly compile generic LLVM IR — |
| 164 | +no target-specific intrinsics or attributes, just standard operations. |
| 165 | +The first concrete test is a function that adds two 16-bit integers and returns the result, |
| 166 | +written directly in LLVM IR, |
| 167 | +compiled through [llc](https://llvm.org/docs/CommandGuide/llc.html) (the LLVM Static Compiler), |
| 168 | +assembled with NASM, linked, and running under any x86 emulator — |
| 169 | +[DOSBox](https://www.dosbox.com/), [QEMU](https://www.qemu.org/), or [PCjs](https://www.pcjs.org/) in the browser. |
| 170 | +Whether a small instruction set is truly enough for generic IR is something I expect to find out along the way. |
| 171 | + |
| 172 | +## A note on AI assistance |
| 173 | + |
| 174 | +AI tools are part of this project — |
| 175 | +for navigating the LLVM codebase, generating boilerplate, |
| 176 | +and getting initial explanations of subsystems I haven't encountered before. |
| 177 | +That includes helping draft this post — |
| 178 | +the words get written together, but the decisions and the verification are mine. |
| 179 | + |
| 180 | +They also get things wrong fairly often: |
| 181 | +made-up API signatures, incorrect claims about how passes work, references to examples that don't exist. |
| 182 | +I've learned to treat AI output as a starting point, not an answer, |
| 183 | +and to check against actual LLVM source before accepting anything. |
| 184 | +Where AI was relevant to a decision or discovery, I'll say so. |
| 185 | + |
| 186 | +## What's next |
| 187 | + |
| 188 | +Before the backend can generate any code, LLVM needs to know the target exists at all. |
| 189 | +The very first step isn't writing any code generation logic — |
| 190 | +it's getting LLVM to recognise `ia16` as a valid architecture at all. |
| 191 | +That's where the next post starts. |
| 192 | + |
| 193 | +The first milestone is not "correct code." |
| 194 | +It's "no crash." |
| 195 | +That's worth celebrating on its own terms. |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +The plan is simple: take the smallest possible program, |
| 200 | +and push it through [llc](https://llvm.org/docs/CommandGuide/llc.html) |
| 201 | +until real 8086 instructions come out the other end. |
| 202 | + |
| 203 | +I don't know yet where it will break. |
| 204 | + |
| 205 | +That's the point. |
| 206 | + |
| 207 | +## Sidebar: DOS memory models { #dos-memory-models } |
| 208 | + |
| 209 | +The Intel 8086 has a 20-bit physical address space — 1MB of addressable memory. |
| 210 | +But its registers are only 16 bits wide, which can only address 64KB directly. |
| 211 | +Intel's solution was segmentation: four segment registers (`CS`, `DS`, `SS`, `ES`) |
| 212 | +each point to a 64KB window into the full address space. |
| 213 | +Any memory access is relative to one of these windows. |
| 214 | +A full 20-bit address is formed by combining a segment register value with a 16-bit offset. |
| 215 | + |
| 216 | +This worked well for hardware, but created a problem for compilers: |
| 217 | +how should a program be laid out in this segmented space? |
| 218 | +A pointer could be just a 16-bit offset within the current segment — |
| 219 | +fast and small, but limited to 64KB — |
| 220 | +or a full 32-bit segment:offset pair that could reach anywhere in the 1MB space, |
| 221 | +at the cost of size and speed. |
| 222 | +Every function call, every data access, every pointer had to answer this question. |
| 223 | + |
| 224 | +The answer was the **memory model** — a compile-time contract that defined how a program used the segmented address space. |
| 225 | +This concept was invented by Intel themselves, well before DOS or the IBM PC existed. |
| 226 | +Intel's [PL/M-86 Compiler Operator's Manual](https://bitsavers.org/pdf/intel/ISIS_II/9800478A_ISIS-II_PLM_Compiler_Operators_Manual_Apr79.pdf) |
| 227 | +(1979) already defined three models — `small`, `medium`, and `large` — as compiler controls, |
| 228 | +with `small` as the default. |
| 229 | +By 1981, Intel's [PL/M-86 User's Guide](https://bitsavers.org/pdf/intel/ISIS_II/121636-002_PLM86_Users_Guide_Nov81.pdf) |
| 230 | +had added a fourth: `compact`. |
| 231 | + |
| 232 | +When C compilers arrived on DOS, they inherited these concepts. |
| 233 | +Early versions of [Lattice C](https://en.wikipedia.org/wiki/Lattice_C) and |
| 234 | +[Microsoft C](https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#Early_versions) 1.x |
| 235 | +(which was rebranded Lattice C) supported only a single memory model. |
| 236 | +Version 2.x of both compilers introduced support for multiple models. |
| 237 | +By [Microsoft C 3.00](https://www.pcjs.org/software/pcx86/lang/microsoft/c/3.00/) (1985), |
| 238 | +the first version fully developed by Microsoft rather than licensed from Lattice, |
| 239 | +proper separate library sets existed for `small`, `medium`, and `large`. |
| 240 | + |
| 241 | +| Model | Code segments | Data segments | Function pointer | Data pointer | |
| 242 | +|-------|---------------|---------------|------------------|--------------| |
| 243 | +| `tiny` | 1 (shared) | 1 (shared) | near | near | |
| 244 | +| `small` | 1 | 1 | near | near | |
| 245 | +| `medium` | multiple | 1 | far | near | |
| 246 | +| `compact` | 1 | multiple | near | far | |
| 247 | +| `large` | multiple | multiple | far | far | |
| 248 | + |
| 249 | +Microsoft C 3.00 also introduced the `near` and `far` keywords, |
| 250 | +which allowed individual pointers to override the selected memory model. |
| 251 | +This appears to be their first appearance in a C compiler — |
| 252 | +a feature that would become a staple of DOS-era C programming. |
| 253 | + |
| 254 | +`tiny` deserves a special note: it was never a distinct compiler model in the same sense as the others. |
| 255 | +It was `small` code linked and then converted to a flat `.COM` executable using `EXE2BIN` — |
| 256 | +a DOS utility that stripped the EXE header and produced a raw binary image. |
| 257 | +The result had to fit entirely within 64KB, code and data together. |
| 258 | + |
| 259 | +For this project, `small` is the starting point: one code segment, one data segment, all pointers near. |
| 260 | +It is the closest DOS memory model to the flat memory model that LLVM works with naturally. |
0 commit comments