O-POPE is a highly scalable, high-frequency outer-product engine for General Matrix Multiply (GEMM) acceleration with minimal buffering overhead. Developed as a collaboration between ETH Zurich (Integrated Systems Laboratory) and the University of Bologna, O-POPE targets the execution time and energy bottlenecks of modern machine learning floating-point workloads.
The key architectural innovation is the runtime repurposing of floating-point unit (FPU) pipeline registers as implicit data reuse and double buffers. By implementing an output-stationary outer-product dataflow on a semi-systolic mesh, the engine hides pipeline latency and decouples accumulation from memory access without relying on bulky, area-intensive input buffer arrays. In a 12 nm FinFET process, O-POPE achieves 1 GHz at 0.72 V, sustaining up to 99.97% FPU utilization with input buffer overhead below 2% for large configurations.
Authors: Danilo Cammarata (ETH Zurich IIS)
- Repository Structure
- Integration & Dependencies
- Getting Started
- Software Support
- Architectural Parameters
- License
opope/
├── rtl/ # Synthesizable SystemVerilog RTL (Solderpad 0.51)
├── sw/ # Software HAL, driver, and example application (Apache 2.0)
│ ├── kernel/ # Bare-metal boot and linker scripts
│ └── utils/ # Utility headers
├── golden-model/ # Python golden reference models per precision (Apache 2.0)
│ ├── FP8/, FP16/, FP32/ # Single-precision golden models
│ └── FP8FP16/, FP16FP32/# Mixed-precision golden models
├── target/sim/ # Simulation infrastructure
│ ├── src/ # Testbenches in SystemVerilog (Solderpad 0.51) / C++ (Apache 2.0)
│ ├── vsim/ # QuestaSim/ModelSim makefrags (Apache 2.0)
│ └── verilator/ # Verilator makefrags (Apache 2.0)
├── scripts/ # Environment and utility scripts (Apache 2.0)
├── Bender.yml # Dependency manifest
└── Makefile # Top-level build system
The primary RTL modules in rtl/ are:
| Module | Description |
|---|---|
opope_top.sv |
Top-level wrapper. Integrates the PE mesh, memory streamer, and control subsystems. Exposes an HCI memory port and an optional HWPE peripheral control port. |
opope_engine.sv |
Configurable p×p PE mesh. Handles row-wise broadcasting for Matrix A and column-wise broadcasting for Matrix B. |
opope_ctrl.sv |
Central state machine. Orchestrates config phases, execution flags, multi-context scheduling, and pipeline/memory decoupling boundaries. |
opope_streamer.sv |
Memory address generation. Manages multi-channel data motion for matrices A, B, and C over the TCDM interface. |
opope_buffers.sv |
Input FIFOs and latency tolerance logic for absorbing memory bank conflicts and aligning data streams. |
O-POPE can be integrated into any system that exposes an HCI-compatible memory interface. Two integration modes are selected at compile time:
TARGET_OPOPE_HWPE (recommended) — the host core configures the accelerator via a memory-mapped peripheral port (hwpe_ctrl_intf_periph). This is the standard path and works with any RISC-V core that can perform MMIO writes, including systems built on the open-source PULP platform.
TARGET_OPOPE_COMPLEX — the accelerator is controlled via the CV32E40X eXtension Interface (XIF), enabling custom RISC-V instructions.
All hardware dependencies are managed via Bender and declared in Bender.yml:
| Dependency | Version | Purpose |
|---|---|---|
| fpnew / cvfpu | pulp-v0.1.3 |
Transprecision FPU (FP8/FP16/FP32). Integer units can be substituted. |
| hci | v2.1.2 |
Memory interconnect interface |
| hwpe-ctrl | v2.1.0 |
Register file and peripheral control interface |
| hwpe-stream | 1.7 |
Streaming data interface |
| common_cells | 1.21.0 |
Shared RTL primitives |
| tech_cells_generic | 0.2.11 |
Technology-independent cell wrappers |
The following tools are required. make init will install the RISC-V toolchain, Bender and Verilator automatically for external users; commercial simulators, instead, must be installed separately.
| Tool | Purpose | Install |
|---|---|---|
| Bender | RTL dependency manager | make init (via Cargo) |
RISC-V GCC (riscv32-unknown-elf-gcc) |
Compile software stimuli | make init (downloads prebuilt binary) |
| QuestaSim / ModelSim or Verilator | RTL simulation | Must be installed separately |
| Python 3 | Golden model generation | System package manager |
For external users — fetches and installs open-source dependencies (RISC-V GCC, Bender and Verilator):
make init
Before simulating, generate the software reference output for your target matrix dimensions and precision. For example, a 32×32×32 GEMM in FP16:
make golden OP=gemm M=32 N=32 K=32 fp_fmt=FP16
Supported formats: FP16, FP32, and mixed-precision variants FP8FP16, FP16FP32.
O-POPE supports both QuestaSim/ModelSim and Verilator simulation flows.
Run with QuestaSim (with GUI):
make sim target=vsim gui=1
Run with Verilator (with waveform tracing):
make sim target=verilator gui=1
Run the full regression suite and collect logs:
make sim-all
The sw/ directory provides a complete bare-metal software stack for programming O-POPE from a RISC-V host core:
| File | Description |
|---|---|
sw/archi_opope.h |
Register map: base address and offsets for all control registers |
sw/hal_opope.h |
HAL: inline C functions for configuring, triggering, and polling the accelerator |
sw/utils/opope_utils.h |
Utility functions for result comparison across FP8/FP16/FP32 |
sw/opope.c |
Example application: runs a GEMM, sleeps on wfi, and checks output against the golden model |
A typical offload sequence using the HAL:
#include "archi_opope.h"
#include "hal_opope.h"
// 1. Clear state and acquire a job slot
hwpe_soft_clear();
while (hwpe_acquire_job() < 0);
// 2. Configure: matrix pointers, dimensions, and arithmetic format
opope_cfg((unsigned int)x, (unsigned int)w, (unsigned int)y,
M_SIZE, N_SIZE, K_SIZE,
gemm_ops, comp_fmt, mem_fmt);
// 3. Trigger execution and wait for interrupt
hwpe_trigger_job();
asm volatile("wfi" ::: "memory");O-POPE can be reconfigured at compile time via top-level parameters:
| Parameter | Description | Options / Default |
|---|---|---|
FpFormat |
Compute precision of the accumulator of the PE array | FP16 (default), FP32 |
Height / Width (p) |
PE mesh dimensions (rows × columns) | Power of 2 (default: 8) |
This repository uses two licenses, applied by directory:
Applies to all synthesizable RTL and SystemVerilog testbenches:
rtl/— all.svfilestarget/sim/src/— all.svfiles (opope_tb.sv,opope_tb_wrap.sv,opope_complex_tb.sv,tb_dummy_memory.sv)
Applies to all software, scripts, build infrastructure, and golden models:
sw/— all C, assembly, and header files (exceptsw/utils/tinyprintf.h, see below)golden-model/— all Python scriptsscripts/— all shell scripts and utilitiestarget/sim/src/opope_tb.cpp— Verilator C++ testbench drivertarget/sim/vsim/— QuestaSim makefrags and TCL scriptstarget/sim/verilator/— Verilator makefragsMakefile,Bender.yml,bender_common.mk,bender_sim.mk,bender_synth.mk- CI configuration (
.github/,.gitlab/)
sw/utils/tinyprintf.h— Copyright (c) 2004, 2012 Kustaa Nyholm / SpareTimeLabs, distributed under a BSD-style license. See the file header for the full terms.