Skip to content

Commit 1890ca2

Browse files
committed
Add introductory post for LLVM ia16 backend series
Introduces the learning journal series for building an LLVM backend targeting the 8086.
1 parent 7b34a38 commit 1890ca2

1 file changed

Lines changed: 260 additions & 0 deletions

File tree

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
title: Learning to write an LLVM backend, one small step at a time
2+
date: 2026-03-19
3+
license: CC-BY-4.0
4+
slug: llvm-ia16-intro
5+
summary: An introduction to a learning journal series about building an LLVM backend from scratch, targeting the 8086.
6+
tags: llvm, compilers, ia16, 8086, backend
7+
8+
I can read x86 assembly reasonably well.
9+
I understand what the compiler is *trying* to do when I look at its output.
10+
But ask me how it gets there — how a compiler decides which registers to use,
11+
how it turns an abstract operation into concrete instructions,
12+
what all those optimisation passes are actually doing — and I'd have to admit I don't really know.
13+
I've always treated the compiler as a black box.
14+
This project is about opening it.
15+
16+
The target architecture chose itself.
17+
I've spent years reading about computer history —
18+
[OS/2 Museum](https://www.os2museum.com/) is a favourite — and the 8086 sits at the centre of a lot of it.
19+
I'm not an expert: I can't write large programs in assembler,
20+
but I understand the instruction set, I know the registers, I have a feel for the addressing modes.
21+
Using hardware I already know means one fewer unknown when the compiler does something unexpected.
22+
23+
So: building an LLVM backend for the 8086, from scratch, documenting every step.
24+
This is a learning journal.
25+
26+
## The full picture: what a compiler does
27+
28+
Before getting to the part I'm actually going to build, it helps to look at the whole pipeline.
29+
A compiler takes source code and produces machine instructions — but it does this in several steps,
30+
each with a clear input and output.
31+
32+
```mermaid
33+
flowchart TD
34+
SRC["Source code"]
35+
AST["Abstract Syntax Tree"]
36+
IR["LLVM IR\nIntermediate Representation"]
37+
OPT["Optimised LLVM IR"]
38+
OBJ["Object file"]
39+
EXE["Executable"]
40+
RUN["Running program"]
41+
42+
SRC -->|"Lexing & parsing"| AST
43+
AST -->|"Semantic analysis"| IR
44+
IR -->|"Optimisation passes"| OPT
45+
OPT -->|"Code generation"| ASM
46+
ASM -->|"Assembler"| OBJ
47+
OBJ -->|"Linker"| EXE
48+
EXE -->|"Run"| RUN
49+
50+
subgraph "this project"
51+
ASM["Assembly text\n.asm"]
52+
end
53+
```
54+
55+
The **Intermediate Representation** is the key abstraction in the middle.
56+
Instead of going straight from source code to machine instructions —
57+
which would need a separate optimiser for every language and every CPU —
58+
compilers first translate everything into a common form that is not tied to any specific architecture.
59+
Optimisation passes work on that form,
60+
and then the code generator translates the optimised result into real instructions for a specific machine.
61+
62+
In [GCC](https://gcc.gnu.org/), this intermediate form is called
63+
[GIMPLE](https://gcc.gnu.org/onlinedocs/gccint/GIMPLE.html) at the high level and
64+
[RTL](https://gcc.gnu.org/onlinedocs/gccint/RTL.html) (Register Transfer Language) closer to the machine.
65+
In [LLVM](https://llvm.org/), it is called
66+
[LLVM IR](https://llvm.org/docs/LangRef.html) — a readable, typed representation that looks a bit like
67+
assembly for an imaginary machine.
68+
For example, a function that adds two 16-bit integers looks like this in LLVM IR:
69+
70+
```llvm
71+
define i16 @add(i16 %a, i16 %b) {
72+
%r = add i16 %a, %b
73+
ret i16 %r
74+
}
75+
```
76+
77+
The code generator — the **backend** — takes optimised LLVM IR and translates it into assembly for a specific target.
78+
That is what I'm going to build.
79+
80+
## Why LLVM and not GCC?
81+
82+
A fair question, since [gcc-ia16](https://github.com/tkchia/gcc-ia16) already exists —
83+
a real, working GCC port targeting the 8086.
84+
I could just use that.
85+
But this project is about learning, not about producing the most practical toolchain.
86+
87+
GCC's backend is older and more tangled up with the rest of the compiler.
88+
LLVM's backend is a more self-contained module with cleaner interfaces —
89+
or at least that's what I've read.
90+
I'll find out if that's true.
91+
LLVM IR is also easy to write by hand,
92+
which matters because I won't be using a C frontend in the early stages.
93+
94+
There's also the ecosystem: LLVM ships with backends for many architectures —
95+
from small embedded targets to large general-purpose CPUs.
96+
Some of them are simple enough to learn from —
97+
I expect to figure out which ones in the next few posts.
98+
99+
LLVM's existing x86 backend is also worth mentioning.
100+
Modern 32-bit and 64-bit x86 has its roots in the 8086,
101+
so there is probably a lot to learn from it.
102+
But extending it to support 16-bit would be more complex than building a new backend in parallel —
103+
the existing backend carries a lot of assumptions about the target that would need to be unpicked.
104+
105+
## The goal structure
106+
107+
Having a clear set of goals helps keep scope under control — or so I hope.
108+
109+
**Dream goal:** [Clang](https://clang.llvm.org/) support — a full C/C++ frontend targeting the 8086.
110+
Far enough away to be motivating without being a near-term distraction.
111+
112+
**Strategic goal:** a proper 8086 backend.
113+
The Intel 8086 — 16-bit registers, segmented memory, DOS-era calling conventions.
114+
An architecture quirky enough that several LLVM assumptions will probably need to be worked around.
115+
I don't know exactly which ones yet.
116+
117+
**Operative goal:** [small memory model](#dos-memory-models) only.
118+
The 8086's segmented memory is genuinely complex —
119+
far pointers, multiple memory models, segment register management.
120+
That's all real and eventually interesting, but it's not where I start.
121+
LLVM works naturally with flat memory models,
122+
and the DOS small memory model — one code segment, one data segment, no far pointers —
123+
is close enough to flat that the framework handles it well.
124+
One fewer thing to fight.
125+
126+
**Tactical goal:** a stripped-down 8086 subset.
127+
Rather than targeting the full architecture right away,
128+
I'll start with a small subset — a handful of instructions, a few registers, a simple calling convention.
129+
Real hardware, real instruction encodings,
130+
but only as much of the ISA as the simplest possible program needs.
131+
I think that's a good way to separate "does the LLVM infrastructure work?"
132+
from "is the hardware modelled correctly?" — but I'll see if that holds up in practice.
133+
134+
## Decisions made so far
135+
136+
A few choices have already been made — mostly about moving complex parts outside the backend.
137+
These may turn out to be wrong; if so, I'll document that too.
138+
139+
### External assembler
140+
141+
The backend will emit [NASM](https://www.nasm.us/)-compatible text assembly.
142+
NASM handles binary encoding and produces object files.
143+
Binary encoding is complex enough on its own,
144+
so keeping it out of scope until the core problems are solved makes sense.
145+
146+
### External linker and output formats
147+
148+
[NASM](https://www.nasm.us/) supports multiple output formats.
149+
For a DOS `.EXE` executable, NASM can produce
150+
[OMF](https://en.wikipedia.org/wiki/Relocatable_Object_Module_Format) (Object Module Format) object files,
151+
which a DOS-compatible linker such as [OpenWatcom's WLINK](https://github.com/open-watcom/open-watcom-v2)
152+
can then link into an executable.
153+
For simpler flat `.COM` binaries, NASM can produce a flat binary directly.
154+
The exact linker choice and output format is something I'll need to settle later.
155+
156+
### Development environment
157+
158+
The goal is to track current LLVM — starting with LLVM 22 on Linux.
159+
As new versions are released I'll try to rebase, and note any API changes that affected the code.
160+
161+
### Validation target
162+
163+
The tactical goal is a backend that can correctly compile generic LLVM IR —
164+
no target-specific intrinsics or attributes, just standard operations.
165+
The first concrete test is a function that adds two 16-bit integers and returns the result,
166+
written directly in LLVM IR,
167+
compiled through [llc](https://llvm.org/docs/CommandGuide/llc.html) (the LLVM Static Compiler),
168+
assembled with NASM, linked, and running under any x86 emulator —
169+
[DOSBox](https://www.dosbox.com/), [QEMU](https://www.qemu.org/), or [PCjs](https://www.pcjs.org/) in the browser.
170+
Whether a small instruction set is truly enough for generic IR is something I expect to find out along the way.
171+
172+
## A note on AI assistance
173+
174+
AI tools are part of this project —
175+
for navigating the LLVM codebase, generating boilerplate,
176+
and getting initial explanations of subsystems I haven't encountered before.
177+
That includes helping draft this post —
178+
the words get written together, but the decisions and the verification are mine.
179+
180+
They also get things wrong fairly often:
181+
made-up API signatures, incorrect claims about how passes work, references to examples that don't exist.
182+
I've learned to treat AI output as a starting point, not an answer,
183+
and to check against actual LLVM source before accepting anything.
184+
Where AI was relevant to a decision or discovery, I'll say so.
185+
186+
## What's next
187+
188+
Before the backend can generate any code, LLVM needs to know the target exists at all.
189+
The very first step isn't writing any code generation logic —
190+
it's getting LLVM to recognise `ia16` as a valid architecture at all.
191+
That's where the next post starts.
192+
193+
The first milestone is not "correct code."
194+
It's "no crash."
195+
That's worth celebrating on its own terms.
196+
197+
---
198+
199+
The plan is simple: take the smallest possible program,
200+
and push it through [llc](https://llvm.org/docs/CommandGuide/llc.html)
201+
until real 8086 instructions come out the other end.
202+
203+
I don't know yet where it will break.
204+
205+
That's the point.
206+
207+
## Sidebar: DOS memory models { #dos-memory-models }
208+
209+
The Intel 8086 has a 20-bit physical address space — 1MB of addressable memory.
210+
But its registers are only 16 bits wide, which can only address 64KB directly.
211+
Intel's solution was segmentation: four segment registers (`CS`, `DS`, `SS`, `ES`)
212+
each point to a 64KB window into the full address space.
213+
Any memory access is relative to one of these windows.
214+
A full 20-bit address is formed by combining a segment register value with a 16-bit offset.
215+
216+
This worked well for hardware, but created a problem for compilers:
217+
how should a program be laid out in this segmented space?
218+
A pointer could be just a 16-bit offset within the current segment —
219+
fast and small, but limited to 64KB —
220+
or a full 32-bit segment:offset pair that could reach anywhere in the 1MB space,
221+
at the cost of size and speed.
222+
Every function call, every data access, every pointer had to answer this question.
223+
224+
The answer was the **memory model** — a compile-time contract that defined how a program used the segmented address space.
225+
This concept was invented by Intel themselves, well before DOS or the IBM PC existed.
226+
Intel's [PL/M-86 Compiler Operator's Manual](https://bitsavers.org/pdf/intel/ISIS_II/9800478A_ISIS-II_PLM_Compiler_Operators_Manual_Apr79.pdf)
227+
(1979) already defined three models — `small`, `medium`, and `large` — as compiler controls,
228+
with `small` as the default.
229+
By 1981, Intel's [PL/M-86 User's Guide](https://bitsavers.org/pdf/intel/ISIS_II/121636-002_PLM86_Users_Guide_Nov81.pdf)
230+
had added a fourth: `compact`.
231+
232+
When C compilers arrived on DOS, they inherited these concepts.
233+
Early versions of [Lattice C](https://en.wikipedia.org/wiki/Lattice_C) and
234+
[Microsoft C](https://en.wikipedia.org/wiki/Microsoft_Visual_C%2B%2B#Early_versions) 1.x
235+
(which was rebranded Lattice C) supported only a single memory model.
236+
Version 2.x of both compilers introduced support for multiple models.
237+
By [Microsoft C 3.00](https://www.pcjs.org/software/pcx86/lang/microsoft/c/3.00/) (1985),
238+
the first version fully developed by Microsoft rather than licensed from Lattice,
239+
proper separate library sets existed for `small`, `medium`, and `large`.
240+
241+
| Model | Code segments | Data segments | Function pointer | Data pointer |
242+
|-------|---------------|---------------|------------------|--------------|
243+
| `tiny` | 1 (shared) | 1 (shared) | near | near |
244+
| `small` | 1 | 1 | near | near |
245+
| `medium` | multiple | 1 | far | near |
246+
| `compact` | 1 | multiple | near | far |
247+
| `large` | multiple | multiple | far | far |
248+
249+
Microsoft C 3.00 also introduced the `near` and `far` keywords,
250+
which allowed individual pointers to override the selected memory model.
251+
This appears to be their first appearance in a C compiler —
252+
a feature that would become a staple of DOS-era C programming.
253+
254+
`tiny` deserves a special note: it was never a distinct compiler model in the same sense as the others.
255+
It was `small` code linked and then converted to a flat `.COM` executable using `EXE2BIN`
256+
a DOS utility that stripped the EXE header and produced a raw binary image.
257+
The result had to fit entirely within 64KB, code and data together.
258+
259+
For this project, `small` is the starting point: one code segment, one data segment, all pointers near.
260+
It is the closest DOS memory model to the flat memory model that LLVM works with naturally.

0 commit comments

Comments
 (0)