A high-performance lock-free Multi-Producer Single-Consumer (MPSC) channel implementation in Zig, achieving ~180 billion messages per second (u32) and ~93 billion messages per second (u64) on commodity hardware.
Platform: Linux only. Uses futex, eventfd, epoll, memfd_create, NUMA sysfs, and SCM_RIGHTS — all Linux-specific APIs. macOS and Windows are not supported.
Throughput on AMD Ryzen 7 5700 (8 cores, 16 threads, DDR4-3200):
| Config | u32 (4B msg) | u64 (8B msg) |
|---|---|---|
| 1P1C | 29 B/s | 15 B/s |
| 2P1C | 58 B/s | 29 B/s |
| 4P1C | 112 B/s | 58 B/s |
| 6P1C | 160 B/s | 85 B/s |
| 8P1C | 180 B/s | 93 B/s |
B/s = billion messages per second. 500M msgs/producer, 32K batch, 32K ring, @memset fill. Each per-producer ring (128–256 KB) fits within the 512 KB dedicated L2 per Zen 3 core — throughput reflects cache-resident end-to-end message movement. See docs/methodology.md for full methodology.
How these numbers are achieved: Each producer calls reserve(32768) → @memset (fills 128KB for u32) → commit(32768). The consumer advances head pointers. At 1P1C, a single core writes 128KB into L2 cache per batch at near-peak L2 bandwidth (~118 GB/s on Zen 3), yielding ~29 B/s. At 8P, each producer has its own L2 cache (512KB per Zen 3 core) with zero cross-producer contention, scaling to ~180 B/s (~80% efficiency — the ~20% loss is L3 coherency traffic and consumer polling overhead, not lock contention). Ring coordination (reserve/commit/consume) takes <0.1% of total time — 99.9% is the @memset write. See docs/ringmpsc.md §5.3 for the complete first-principles derivation.
| Config | ringmpsc | CAS MPSC | Mutex | vs Mutex |
|---|---|---|---|---|
| 1P1C | 54 M/s | 115 M/s | 24 M/s | 2x |
| 2P1C | 117 M/s | 23 M/s | 29 M/s | 4x |
| 4P1C | 237 M/s | 20 M/s | 10 M/s | 24x |
| 8P1C | 514 M/s | 14 M/s | 4 M/s | 129x |
At 8 producers, both the mutex and CAS queue collapse under contention while ringmpsc scales to 416+ M/s (8x over 1P1C). The CAS MPSC (Vyukov-style bounded array) wins at 1P1C where there is no contention, but degrades from 115 M/s to 14 M/s at 8P — an 8x degradation. See docs/benchmarks.md for full competitive analysis including batch throughput, latency, and comparison against rigtorp-style queues.
RingMPSC uses a ring-decomposed architecture where each producer has a dedicated SPSC ring buffer, eliminating producer-producer contention entirely.
Producer 0 ──► [Ring 0] ──┐
Producer 1 ──► [Ring 1] ──┼──► Consumer (polls all rings)
Producer 2 ──► [Ring 2] ──┤
Producer N ──► [Ring N] ──┘
- 128-byte Cache Line Alignment — Head and tail pointers separated by 128 bytes to prevent prefetcher-induced false sharing
- Cached Sequence Numbers — Producers cache the consumer's head position, refreshing only when the ring appears full
- Batch Operations —
consumeBatchprocesses all available items with a single atomic head update - Adaptive Backoff — Crossbeam-style exponential backoff (spin -> yield -> park)
- Zero-Copy API —
reserve/commitpattern for direct ring buffer writes - NUMA-Aware Allocation — Per-ring heap allocation with producer-local NUMA binding
- Futex-Based Wait Strategies — 6 pluggable strategies from busy-spin to futex-blocking with zero CPU when idle
- EventNotifier — Optional eventfd-based consumer wake for epoll/io_uring event loop integration (zero CPU when idle, instant wake on data)
- Adaptive Consumer Skip — Exponential backoff on empty rings, reducing consumer polling from O(P) to O(active_producers)
- Zero-Copy BufferPool — Lock-free slab allocator for large message passing via pointer-ring pattern. Ring carries 4-byte handles; payloads stay in pre-allocated pool with zero copies.
- Cross-Process SharedRing — SPSC ring buffer shared across process boundaries via memfd_create + MAP_SHARED. Automatic cleanup on fd close, heartbeat-based crash detection, cross-process futex wake.
- SCM_RIGHTS fd Passing — Send/receive file descriptors between processes over Unix sockets for SharedRing establishment.
- Graceful Shutdown — Coordinated close with drain and fully-drained detection
- Latency Tracking — Per-batch min/max/avg nanosecond latency (comptime-gated, zero overhead when disabled)
- Channel Stats Snapshot — Single
snapshot()call returning all health metrics - Thin Event Loop — Managed consumer loop with automatic drain on stop
See docs/ringmpsc.md for the full algorithm writeup with memory ordering analysis.
const ringmpsc = @import("ringmpsc");
// SPSC ring buffer (zero-overhead, type alias for Ring)
var ring = ringmpsc.spsc.Channel(u64, .{}){};
_ = ring.send(&[_]u64{1, 2, 3});
// MPSC channel (multi-producer, heap-allocated rings)
var channel = try ringmpsc.mpsc.Channel(u64, .{}).init(allocator);
defer channel.deinit();
const producer = try channel.register();
_ = producer.sendOne(42);
// MPMC channel (work-stealing consumers)
var mpmc = ringmpsc.mpmc.MpmcChannel(u64, .{}){};
const prod = try mpmc.register();
var cons = try mpmc.registerConsumer();
_ = prod.sendOne(42);
var buf: [16]u64 = undefined;
_ = cons.steal(&buf);# Build
zig build -Doptimize=ReleaseFast
# Unit tests
zig build test
# Integration tests
zig build test-spsc
zig build test-mpsc
zig build test-fifo
zig build test-chaos
zig build test-determinism
zig build test-stress
zig build test-fuzz
zig build test-mpmc
zig build test-spmc
zig build test-simd
zig build test-blocking
zig build test-wraparound
# Benchmarks
zig build bench-high-perf # Primary throughput benchmark (u32 + u64)
zig build bench-mpsc # MPSC baseline
zig build bench-spsc # SPSC latency
zig build bench-simd # SIMD operations
zig build bench-suite # Full benchmark suite with HTML report
zig build bench-analysis # Bottleneck analysis
zig build bench-cross-process # SharedRing vs in-process (4.6% overhead)
zig build bench-zero-copy # BufferPool crossover analysis by message size
zig build bench-wake-latency # Spin vs eventfd wake latency (110ns vs 6us p50)
zig build bench # Full suite: multi-run stats, baselines, HTML report
# Examples
zig build example-basic # SPSC send/recv + reserve/commit
zig build example-mpsc # Multi-producer pipeline with handler
zig build example-backpressure # Blocking send with timeout
zig build example-shutdown # Graceful shutdown with drain
zig build example-event-loop # Managed consumer loop
zig build example-event-notifier # eventfd + epoll consumer
zig build example-zero-copy # Zero-copy large messages via BufferPool
zig build example-shared-memory # Cross-process ring via memfd + mmap
zig build example-cross-process # Full IPC with fd passing over Unix socket
zig build example-logging # Structured logging pipeline (real-world pattern)
# Clean benchmark run (recommended)
sudo taskset -c 0-15 nice -n -20 zig-out/bin/bench-high-perf- docs/architecture.md — System architecture: type hierarchy, memory layout, data flow, memory ordering model
- docs/api.md — Complete API reference for all channel types, primitives, and utilities
- docs/ringmpsc.md — Algorithm design: ring decomposition, cache layout, batch consumption
- docs/methodology.md — Benchmark methodology: measurement procedure, threats to validity, reproducibility
- docs/benchmarks.md — Results and analysis: scaling, bottleneck breakdown, configuration sensitivity
- docs/correctness.md — Correctness properties, memory ordering model, test coverage
For clean benchmark runs:
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sudo taskset -c 0-15 nice -n -20 zig-out/bin/bench-high-perf| Single-Consumer | Multi-Consumer | |
|---|---|---|
| Single-Producer | spsc.Channel |
spmc.Channel |
| Multi-Producer | mpsc.Channel |
mpmc.MpmcChannel |
- SPSC — Zero-overhead type alias for the core Ring. Fastest possible path.
- MPSC — Heap-allocated per-producer rings with NUMA support. Dynamic producer registration (grow at runtime). The primary use case.
- SPMC — Single ring, CAS-based multi-consumer stealing.
- MPMC — Per-producer rings with progressive work-stealing (local shard -> hottest ring -> random fallback).
pub const Config = struct {
ring_bits: u6 = 16, // 2^16 = 64K slots per ring
max_producers: usize = 16,
enable_metrics: bool = false,
track_contention: bool = false,
prefetch_threshold: usize = 16,
numa_aware: bool = true,
numa_node: i8 = -1, // -1 = auto-detect
};Four properties verified by the test suite. See docs/correctness.md for details.
- Per-Producer FIFO — Messages from a single producer are received in order
- No Data Loss — All sent messages are eventually received
- Thread Safety — No data races under concurrent access
- Determinism — Identical inputs produce identical outputs
ringmpsc/
├── src/
│ ├── ringmpsc.zig # Public API root module
│ ├── primitives/
│ │ ├── ring_buffer.zig # Core SPSC ring buffer
│ │ ├── wait_strategy.zig # Futex-based wait strategies
│ │ ├── event_notifier.zig # eventfd for epoll/io_uring integration
│ │ ├── buffer_pool.zig # Lock-free slab allocator for zero-copy
│ │ ├── shared_ring.zig # Cross-process ring via memfd + MAP_SHARED
│ │ ├── backoff.zig # Exponential backoff
│ │ ├── simd.zig # SIMD batch operations
│ │ └── blocking.zig # Blocking wrappers with timeout
│ ├── platform/
│ │ ├── numa.zig # NUMA topology + allocation
│ │ └── fd_passing.zig # SCM_RIGHTS fd send/recv over Unix socket
│ ├── mpsc/channel.zig # MPSC channel (ring-decomposed)
│ ├── spsc/channel.zig # SPSC channel (Ring type alias)
│ ├── mpmc/channel.zig # MPMC channel (work-stealing)
│ ├── spmc/channel.zig # SPMC channel (CAS consumers)
│ ├── event_loop.zig # Thin managed consumer loop
│ └── metrics/collector.zig # Throughput + latency metrics
├── tests/ # Integration, stress, fuzz, blocking tests
├── benchmarks/src/ # Benchmark suite + analysis
├── examples/ # Usage examples (basic, pipeline, logging, backpressure, shutdown, event loop, eventfd+epoll)
├── bindings/
│ ├── c/ # C header-only implementation
│ └── rust/ # Rust port
├── docs/
│ ├── architecture.md # System architecture
│ ├── api.md # API reference
│ ├── ringmpsc.md # Algorithm writeup
│ ├── methodology.md # Benchmark methodology
│ ├── benchmarks.md # Results and analysis
│ └── correctness.md # Correctness verification
└── scripts/
├── verify.sh # One-click verification (19 checks)
└── bench-multirun.sh # Multi-run statistics (median, stddev)
- C — Header-only library in
bindings/c/ringmpsc.h - Rust — Full port in
bindings/rust/
| Term | Definition |
|---|---|
| MPSC | Multi-Producer Single-Consumer — many threads send, one thread receives |
| SPSC | Single-Producer Single-Consumer — one sender, one receiver (fastest possible) |
| B/s | Billion messages per second (10⁹ messages/sec) |
| M/s | Million messages per second (10⁶ messages/sec) |
| Ring decomposition | Giving each producer its own SPSC ring instead of sharing one queue |
| Batch consumption | Processing many messages with a single atomic head update (amortizes sync cost) |
| Cache-resident | Data fits in CPU cache (L1/L2/L3), not main memory — much faster access |
| False sharing | Two threads' data on the same cache line causing spurious invalidations |
| Acquire/Release | Memory ordering that ensures writes before a release-store are visible after an acquire-load |
| CAS | Compare-And-Swap — atomic read-modify-write, fails under contention |
| Futex | Fast userspace mutex — OS-level sleep/wake with zero overhead when uncontested |
- LMAX Disruptor — Original batch-consumption pattern
- rigtorp/SPSCQueue — High-performance C++ SPSC queue
- crossbeam-channel — Rust concurrent channels
- moodycamel::ConcurrentQueue — C++ lock-free queue