Skip to content

Add multi-turn RL support for iterative kernel refinement#5

Open
nataliakokoromyti wants to merge 2 commits intoScalingIntelligence:mainfrom
nataliakokoromyti:add-multi-turn
Open

Add multi-turn RL support for iterative kernel refinement#5
nataliakokoromyti wants to merge 2 commits intoScalingIntelligence:mainfrom
nataliakokoromyti:add-multi-turn

Conversation

@nataliakokoromyti
Copy link
Copy Markdown
Collaborator

@nataliakokoromyti nataliakokoromyti commented Mar 25, 2026

  • adds multi-turn RL support for iterative kernel refinement with execution feedback across up to T turns
  • the implementation is an attempt to faithfully reproduce the multi-turn RL paradigm from the "Kevin: Multi-Turn RL for Generating CUDA Kernels" paper by Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, and Silas Alberti.
  • extends the training loop, eval script, configs, and TensorBoard logging to support both single-turn and multi-turn modes

@nataliakokoromyti nataliakokoromyti force-pushed the add-multi-turn branch 5 times, most recently from 7c139d6 to cc20524 Compare March 26, 2026 11:43
Adds a second evaluator backend (in-process subprocess on the local GPU)
alongside the existing Modal cloud backend, plus a dispatch layer that
selects between them via eval_config.evaluator_backend ("modal" | "local").

The local backend exists because KernelBench compiles every kernel through
torch.utils.cpp_extension.load_inline, which leaks memory across compiles
and OOMs an MI350X after a few hundred problems. Each kernel now runs in
a fresh `python -c` worker so the leak dies with the worker, mirroring the
trick used by KernelBench's eval_hip.py.

Changes:
- Add EvalConfig.evaluator_backend / gpu_arch / local_timeout fields.
- New kernelbench_tinker.local.evaluator with subprocess-isolated worker
  that calls set_gpu_arch() before importing torch and uses a per-process
  TORCH_EXTENSIONS_DIR to dodge stale lock files from killed predecessors.
- New evaluator_dispatch module so call sites stay agnostic; both
  ModalKernelEvaluator and LocalKernelEvaluator expose the same async API.
- Wire dispatch through KernelBenchDatasetBuilder, kernelbench_client, and
  the eval_kernel_rl script. Add multi-level iteration to eval_kernel_rl.
- Add rl_kernelbench_hip.yaml: backend=hip, evaluator_backend=local,
  gpu_arch=[gfx950], levels=[1,2,3], multiturn enabled, dataset_src=local.

The Modal backend remains the default, so existing CUDA configs are
unaffected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant