Multi-Agent CUDA Kernel Optimizations

Solutions and metrics from Cursor's multi-agent system that autonomously optimized 235 CUDA kernels for NVIDIA Blackwell B200 GPUs, achieving a 38% geomean speedup over baselines.

Read the full writeup: Speeding up GPU kernels by 38% with a multi-agent system

Repository structure

L1/ — 94 single-operator kernel problems (e.g. attention, RoPE, RMSNorm)
L2/ — 82 multi-operator fused kernel problems (e.g. full decoder layers, MoE routing)
Quant/ — 33 quantized kernel problems (FP8, NVFP4)
FlashInfer-Bench/ — 26 problems benchmarked against FlashInfer (GEMM, GQA, MoE, fused ops)
combined_metrics.csv — Per-workload results: baseline latency, SOL latency, selected latency, and SOL score
problem_level_metrics.csv — Per-problem aggregate results: SOL score and speedup vs. baseline

Each problem directory contains src/ (the kernel solution), solution.json, and traces.jsonl.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent CUDA Kernel Optimizations

Repository structure

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
FlashInfer-Bench		FlashInfer-Bench
L1		L1
L2		L2
Quant		Quant
.gitignore		.gitignore
README.md		README.md
combined_metrics.csv		combined_metrics.csv
problem_level_metrics.csv		problem_level_metrics.csv

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent CUDA Kernel Optimizations

Repository structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages