Share training code between backends by Kovbo · Pull Request #626 · OpenPipe/ART

Kovbo · 2026-03-21T01:42:13Z

We have a lot of similar training code implemented differently across the local and serverless backends. This makes it hard to maintain.
We should keep the training code in ART and import it into the serverless backend.
This PR implements that change.

Mental Model

There are 3 layers:
- ART backend entrypoints: turn user-facing train() / train_sft() calls into a concrete training request.
- Job handoff/orchestration: move that request into a worker process and stream metrics back.
- Unsloth/Megatron worker execution: actually run forward/backward/optimizer/save loops.
The refactor mostly changed layer 3, and a small part of layer 1.

Before, the same Megatron execution logic existed in two places:
- ART’s local Megatron worker had its own inline runtime setup and RL loop in what is now .venv/lib/python3.12/site-packages/art/megatron/
  train.py:1.
- The serverless Megatron worker had its own inline runtime setup plus RL and SFT loops in what is now megatron/train.py:1.
Those two scripts both did the same kinds of work:
- set CUDA / Triton / inductor env
- build the Megatron provider/model/optimizer
- patch GPT preprocess
- load LoRA + optimizer state
- run the RL step loop
- save LoRA + optimizer shards
- write metrics to a log file
- clean up the job directory
The serverless worker also had its own separate SFT loop.

There was also smaller duplication at the ART backend layer:
- .venv/lib/python3.12/site-packages/art/local/backend.py:501
- .venv/lib/python3.12/site-packages/art/serverless/backend.py:193
Both of those built RL config objects and aggregated training metrics separately.

So before, changing RL training logic meant touching multiple files.

How It Works Now

Now the shared Megatron execution lives in one ART module:
- .venv/lib/python3.12/site-packages/art/megatron/shared.py:58
- .venv/lib/python3.12/site-packages/art/megatron/shared.py:113
- .venv/lib/python3.12/site-packages/art/megatron/shared.py:230
That module owns the actual training process:
- model/optimizer context creation
- LoRA + optimizer load/save
- RL loop
- SFT loop
- per-rank gradient reduction
- metrics logging
- job completion cleanup
The wrappers are now thin:
- ART local Megatron wrapper: .venv/lib/python3.12/site-packages/art/megatron/train.py:1
- Repo serverless worker wrapper: megatron/train.py:1
The important design choice is that offload/reload stayed outside the shared runner:
- Local ART wrapper does reload_to_gpu(...), calls run_megatron_rl_job(...), then offload_to_cpu(...) in finally at .venv/lib/python3.12/site-
  packages/art/megatron/train.py:41.
- Serverless wrapper just calls the shared ART runner directly at megatron/train.py:42.
So the shared code is “training logic only”, and the local-only memory management stays local-only.

At the backend layer, the shared RL config/metrics glue now lives in .venv/lib/python3.12/site-packages/art/_backend_training.py:15.
Both .venv/lib/python3.12/site-packages/art/local/backend.py:613 and .venv/lib/python3.12/site-packages/art/serverless/backend.py:260 use it.

Current End-to-End Flows

Local RL with ART + Megatron:
- LocalBackend.train() builds config and delegates to _train_model() in .venv/lib/python3.12/site-packages/art/local/backend.py:501.
- _train_model() packs trajectory groups to disk and calls the service in .venv/lib/python3.12/site-packages/art/local/backend.py:670.
- MegatronService.train() pauses vLLM, writes a job JSON into /tmp/megatron_training_jobs, and tails the shared log file at .venv/lib/
  python3.12/site-packages/art/megatron/service.py:209.
- The local Megatron worker .venv/lib/python3.12/site-packages/art/megatron/train.py:28 polls that directory, reloads GPU state, and
  calls .venv/lib/python3.12/site-packages/art/megatron/shared.py:113.
- After completion, MegatronService merges shards, creates the next checkpoint, wakes vLLM, and registers the new LoRA alias at .venv/lib/
  python3.12/site-packages/art/megatron/service.py:271.
Serverless RL:
- ServerlessBackend.train() builds config and submits a training job through the API at .venv/lib/python3.12/site-packages/art/serverless/
  backend.py:193.
- _train_model() calls client.training_jobs.create(...) and polls event streams at .venv/lib/python3.12/site-packages/art/serverless/
  backend.py:319 using the client defined at .venv/lib/python3.12/site-packages/art/serverless/client.py:183.
- On the server side, the repo’s MegatronTrainer.train() writes a worker job file and tails the per-job log at trainers/
  megatron_trainer.py:97.
- The repo worker megatron/train.py:26 polls the same job directory and dispatches to .venv/lib/python3.12/site-packages/art/megatron/
  shared.py:113.
Serverless SFT:
- The server workflow downloads the artifact and tokenizes it into SFTBatch objects in app/temporal/workflows/training_workflows.py:825.
- The repo’s MegatronTrainer.train_sft() writes the tokenized batches to disk and writes an SFT job file at trainers/megatron_trainer.py:149.
- The repo worker megatron/train.py:42 dispatches to .venv/lib/python3.12/site-packages/art/megatron/shared.py:230.
- RL backend config/metric aggregation is shared in one place.

FurtherAI

I will propose a merge with main after #619. We'll need to collect the changes I made to train.py mostly and settle on a unified API.

The rest of it looks good! Though I'll test the actual megatron training and ensure the merge still keeps correctness and performance. You should extend the correctness tests to SFT though.

FurtherAI

The PR now signals "all done" before offloading Megatron state. The sentinel is written in shared.py, while offload happens later in train.py. On main, offload happened first at train.py and only then did rank 0
emit completion at train.py. That can reintroduce same-node wake/OOM races.
Missing adapter checkpoints no longer fail hard. The new shared loader in shared.py
silently resets LoRA params instead of raising. main hard-fails on a missing adapter.
The log_path API cutover is still partial. The shared job schema exposes log_path in jobs.py, and shared execution uses it in shared.py, but the local controller still tails
and deletes the default path in service.py.

All the good points found by codex. I'll add to 1 that the some objects such as the packed tensors need to be deleted before the dirs can be removed, since they are memory mapped and thus hold the files open. So all object deletion should happen before dir removal.

Correctness tests were passing, which is good.

Kovbo · 2026-03-31T01:22:12Z

Good catch.

I think everything has been updated now, and I also added some additional training loop refactoring.

My Codex also flagged a few other issues, but they do not appear to be regressions in this branch. I am not sure whether they are real issues, though:

/art/megatron/train.py:355 still requires grad_accumulation_sequences % dp_world_size == 0, and the public config builders still leave it at the default 1 in /art/_backend_training.py:15. So DP>1 Megatron can break through backend issue, just not caused by our sharing refactor.
In /art/megatron/train.py:495-540, reduced_loss is computed as sum of per-rank averages, not a global token-weighted average. That makes DP>1 logging wrong.

FurtherAI · 2026-03-31T18:45:56Z

/art/megatron/train.py:355 still requires grad_accumulation_sequences % dp_world_size == 0, and the public config builders still leave it at the default 1 in /art/_backend_training.py:15. So DP>1 Megatron can break through backend issue, just not caused by our sharing refactor.

In /art/megatron/train.py:495-540, reduced_loss is computed as sum of per-rank averages, not a global token-weighted average. That makes DP>1 logging wrong.

Yeah, let's set the default grad_accumulation_sequences to None and if none, update it to dp_world_size here. The assert is intentional though, so it stays.
As for this, finalize_model_grads in finalize_model_grads_extended correctly reduces this across ranks I believe. It would be good to drop a comment there though ("# num_tokens is reduced in place across ranks").

The previous issues seem to be solved. Looks good

Did you add SFT to the correctness tests? Or is that why supports_sft=False?

FurtherAI · 2026-04-01T06:24:52Z

Changes look good, I ran the correctness tests and they are passing.
Last thing is the plan for SFT correctness tests?

Kovbo force-pushed the feat/shared-training-code branch from c1abe2e to a1b8efc Compare March 21, 2026 02:03

Kovbo requested a review from bradhilton March 24, 2026 20:32

Kovbo marked this pull request as ready for review March 24, 2026 20:59

Kovbo requested a review from FurtherAI March 25, 2026 18:20

FurtherAI reviewed Mar 25, 2026

View reviewed changes

Refresh shared training refactor on top of ART main

c2039fc

Kovbo force-pushed the feat/shared-training-code branch from b19e94c to c2039fc Compare March 28, 2026 01:03

Kovbo added 7 commits March 28, 2026 01:10

Rename Megatron merge helper

19c906b

Deduplicate local and shared training logic

9d75910

Fix Megatron rope theta compatibility

6d0d2ae

Remove Megatron rope theta workaround

9c474c9

Align Unsloth SFT weight decay defaults

2fa8ffb

remove apex from no-build-isolation-package

8cb71cc

update install script

3a679cb

Kovbo requested a review from FurtherAI March 30, 2026 20:17

FurtherAI requested changes Mar 30, 2026

View reviewed changes

Kovbo force-pushed the feat/shared-training-code branch from dbc9f8a to 3a679cb Compare March 30, 2026 22:52

Kovbo added 2 commits March 30, 2026 23:08

Fix Megatron job finalization ordering

9e90c7d

Share Megatron worker loop

511d72c

Kovbo requested a review from FurtherAI March 31, 2026 01:02

Kovbo added 2 commits April 1, 2026 02:10

Default Megatron grad accumulation by DP size

2e64da0

Collapse Megatron shared API into train module

0cee7cf

Kovbo added 3 commits April 1, 2026 18:50

Remove Megatron shared shim

911c082

Collapse Unsloth shared API into train module

0fa9a2b

Lighten Megatron orchestration imports

f6cd445

Kovbo force-pushed the feat/shared-training-code branch 2 times, most recently from 8096ed4 to f6cd445 Compare April 2, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share training code between backends#626

Share training code between backends#626
Kovbo wants to merge 15 commits intomainfrom
feat/shared-training-code

Kovbo commented Mar 21, 2026 •

edited

Loading

Uh oh!

FurtherAI left a comment

Uh oh!

FurtherAI left a comment •

edited

Loading

Uh oh!

Kovbo commented Mar 31, 2026

Uh oh!

FurtherAI commented Mar 31, 2026

Uh oh!

FurtherAI commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kovbo commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FurtherAI left a comment

Choose a reason for hiding this comment

Uh oh!

FurtherAI left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kovbo commented Mar 31, 2026

Uh oh!

FurtherAI commented Mar 31, 2026

Uh oh!

FurtherAI commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kovbo commented Mar 21, 2026 •

edited

Loading

FurtherAI left a comment •

edited

Loading