Skip to content

[train] VLM SFT support on Megatron backend (Qwen3-VL)#1752

Draft
s-chundi wants to merge 1 commit into
NovaSky-AI:mainfrom
s-chundi:vlm-sft-megatron
Draft

[train] VLM SFT support on Megatron backend (Qwen3-VL)#1752
s-chundi wants to merge 1 commit into
NovaSky-AI:mainfrom
s-chundi:vlm-sft-megatron

Conversation

@s-chundi
Copy link
Copy Markdown

@s-chundi s-chundi commented Jun 4, 2026

What

Adds vision-language-model (VLM) support to the SFT trainer on the Megatron backend, starting with Qwen3-VL. The Megatron path previously supported text-only SFT; the FSDP path already had VLM support. This PR brings the Megatron side up to parity for SFT by reusing the existing VLM data infrastructure (TensorList, Experience.pixel_values/image_grid_thw, replay-buffer image fields).

The vision tower runs on the first pipeline stage; pixel_values / image_grid_thw flow through the same TrainingInputBatchExperience path the FSDP backend already uses, and are passed into GPTModel.forward as Qwen-style kwargs.

Changes

  • skyrl/utils/tok.pycheck_is_vlm() and get_processor() helpers.
  • skyrl/train/sft_trainer.py — detect VLMs in setup(), build an HF processor, thread it through tokenize_chat_example / _tokenize_chat_last_assistant (now return_dict=True with a small _unbatch helper), emit image tensors as a TensorList in collate_sft_batch, force sequential tokenization for VLMs (the processor does not round-trip through the spawn worker pool), and disable sequence packing / microbatch-padding removal.
  • megatron_model_wrapper.pyis_vlm flag, VLM constraint asserts, and pixel_values/image_grid_thw passthrough in both forward and forward_backward_mini_batch.
  • megatron_worker.py — VLM detection off the HF config, plumb is_vlm into the wrapper, and forward image tensors through the micro-batch builders (policy forward/train + ref forward).
  • teststest_megatron_vlm_init.py (GPU: Megatron VLM forward vs HF reference, and a 5-step SFTTrainer train loop asserting loss decreases) and VLM cases in test_sft_tokenization.py (CPU: tokenization + collation).

(a) Constraints carried over from the FSDP VLM path

3D RoPE and multimodal token positions mean the following are unsupported and asserted/forced off:

  • sample / microbatch packing (remove_microbatch_padding, use_sequence_packing)
  • context parallelism (CP)
  • sequence parallelism (SP)
  • mixed text+image microbatches (batches are assumed homogeneously imaged)

(b) Parity / correctness

test_megatron_vlm_forward checks the Megatron VLM forward against a plain HuggingFace AutoModelForImageTextToText forward over the same tokens + images (finite, aligned log-probs). test_vlm_train runs 5 SFT steps and asserts the loss decreases.

(c) Parallelism combos exercised in tests

  • TP=2, PP=1 (forward parity)
  • TP=1, PP=2 (5-step train loop)

(d) Not supported

CP, SP, sample/microbatch packing, mixed text+image batches, MoE+VLM.

Adds vision-language-model support to the Megatron SFT path, mirroring the
existing FSDP VLM data plumbing. The vision tower runs on the first pipeline
stage; image tensors (pixel_values / image_grid_thw) flow through the same
TrainingInputBatch -> Experience path the FSDP backend already uses.

Changes:
- skyrl/utils/tok.py: add check_is_vlm and get_processor helpers.
- skyrl/train/sft_trainer.py: detect VLMs in setup(), build an HF processor,
  thread it through tokenize_chat_example / _tokenize_chat_last_assistant,
  emit image tensors as a TensorList in collate_sft_batch, force sequential
  tokenization for VLMs, and disable sequence packing / microbatch padding
  removal (unsupported for VLMs).
- megatron_model_wrapper.py: is_vlm flag, VLM constraint asserts (no packing,
  CP, or SP), and pixel_values/image_grid_thw passthrough in both forward and
  forward_backward_mini_batch.
- megatron_worker.py: VLM detection off the HF config, plumb is_vlm into the
  wrapper, and forward image tensors through the micro-batch builders.
- tests: GPU init/forward + 5-step train test (test_megatron_vlm_init.py),
  CPU tokenization/collation cases (test_sft_tokenization.py).

Constraints carried over from the FSDP VLM path: no sample/microbatch packing,
no context parallelism, no sequence parallelism, and homogeneous (all-image)
batches. MoE+VLM is out of scope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant