Skip to content

feat: Add dedicated merged mode to Megatron backend#636

Draft
vivekkalyan wants to merge 2 commits intofeat/dedicated-mode-megatronfrom
feat/merged-mode-megatron
Draft

feat: Add dedicated merged mode to Megatron backend#636
vivekkalyan wants to merge 2 commits intofeat/dedicated-mode-megatronfrom
feat/merged-mode-megatron

Conversation

@vivekkalyan
Copy link
Copy Markdown
Collaborator

@vivekkalyan vivekkalyan commented Apr 1, 2026

Summary

This adds dedicated merged mode to the Megatron backend.

In dedicated merged mode, ART now keeps vLLM running on the inference GPU, trains Megatron on the trainer GPU, and updates inference weights in place through vLLM's native weight transfer APIs. This allows the training of models which do not have LoRA support on vLLM as well as enabling faster inference when used in LocalBackend

What this enables

  • Use Megatron with dedicated trainer and inference GPUs in rollout_weights_mode="merged"
  • Keep inference and training decoupled while only pausing generation during the actual merged weight swap
  • Advance the served model alias step by step without restarting the dedicated vLLM server

Implementation

  • Add a Megatron service-to-trainer job protocol for:
    • initial merged sync
    • LoRA training jobs
    • merged training jobs
  • Start dedicated vLLM in merged mode with native weight transfer enabled
  • Initialize NCCL weight transfer between the Megatron trainer and vLLM
  • Convert live Megatron weights through Megatron Bridge into HF/vLLM checkpoint names, merge ART LoRA deltas into those tensors, and send them directly to vLLM
  • Update the served model name after each successful merged sync
  • Reuse a shared TCP port helper instead of depending on a backend-specific implementation

Validation

  • Unit coverage:
    • tests/unit/test_megatron_dedicated.py
  • Fresh-cluster 2-GPU smoke:
    • trainer on GPU 0
    • inference on GPU 1
    • base model Qwen/Qwen3-30B-A3B-Instruct-2507
    • dedicated merged mode completed two real train steps and advanced the served model from @0 to @2

Notes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant