Conversation
c1abe2e to
a1b8efc
Compare
FurtherAI
left a comment
There was a problem hiding this comment.
I will propose a merge with main after #619. We'll need to collect the changes I made to train.py mostly and settle on a unified API.
The rest of it looks good! Though I'll test the actual megatron training and ensure the merge still keeps correctness and performance. You should extend the correctness tests to SFT though.
b19e94c to
c2039fc
Compare
There was a problem hiding this comment.
- The PR now signals "all done" before offloading Megatron state. The sentinel is written in
shared.py, while offload happens later intrain.py. On main, offload happened first attrain.pyand only then did rank 0
emit completion attrain.py. That can reintroduce same-node wake/OOM races. - Missing adapter checkpoints no longer fail hard. The new shared loader in
shared.py
silently resets LoRA params instead of raising. main hard-fails on a missing adapter. - The log_path API cutover is still partial. The shared job schema exposes log_path in
jobs.py, and shared execution uses it inshared.py, but the local controller still tails
and deletes the default path inservice.py.
All the good points found by codex. I'll add to 1 that the some objects such as the packed tensors need to be deleted before the dirs can be removed, since they are memory mapped and thus hold the files open. So all object deletion should happen before dir removal.
Correctness tests were passing, which is good.
dbc9f8a to
3a679cb
Compare
|
Good catch. I think everything has been updated now, and I also added some additional training loop refactoring. My Codex also flagged a few other issues, but they do not appear to be regressions in this branch. I am not sure whether they are real issues, though:
|
The previous issues seem to be solved. Looks good Did you add SFT to the correctness tests? Or is that why |
|
Changes look good, I ran the correctness tests and they are passing. |
8096ed4 to
f6cd445
Compare
We have a lot of similar training code implemented differently across the local and serverless backends. This makes it hard to maintain.
We should keep the training code in ART and import it into the serverless backend.
This PR implements that change.
Mental Model
There are 3 layers:
The refactor mostly changed layer 3, and a small part of layer 1.
Before, the same Megatron execution logic existed in two places:
train.py:1.
Those two scripts both did the same kinds of work:
The serverless worker also had its own separate SFT loop.
There was also smaller duplication at the ART backend layer:
Both of those built RL config objects and aggregated training metrics separately.
So before, changing RL training logic meant touching multiple files.
How It Works Now
Now the shared Megatron execution lives in one ART module:
That module owns the actual training process:
The wrappers are now thin:
The important design choice is that offload/reload stayed outside the shared runner:
packages/art/megatron/train.py:41.
So the shared code is “training logic only”, and the local-only memory management stays local-only.
At the backend layer, the shared RL config/metrics glue now lives in .venv/lib/python3.12/site-packages/art/_backend_training.py:15.
Both .venv/lib/python3.12/site-packages/art/local/backend.py:613 and .venv/lib/python3.12/site-packages/art/serverless/backend.py:260 use it.
Current End-to-End Flows
Local RL with ART + Megatron:
python3.12/site-packages/art/megatron/service.py:209.
calls .venv/lib/python3.12/site-packages/art/megatron/shared.py:113.
python3.12/site-packages/art/megatron/service.py:271.
Serverless RL:
backend.py:193.
backend.py:319 using the client defined at .venv/lib/python3.12/site-packages/art/serverless/client.py:183.
megatron_trainer.py:97.
shared.py:113.
Serverless SFT:
The server workflow downloads the artifact and tokenizes it into SFTBatch objects in app/temporal/workflows/training_workflows.py:825.
The repo’s MegatronTrainer.train_sft() writes the tokenized batches to disk and writes an SFT job file at trainers/megatron_trainer.py:149.
The repo worker megatron/train.py:42 dispatches to .venv/lib/python3.12/site-packages/art/megatron/shared.py:230.
RL backend config/metric aggregation is shared in one place.