[train] VLM SFT support on Megatron backend (Qwen3-VL) by s-chundi · Pull Request #1752 · NovaSky-AI/SkyRL

s-chundi · 2026-06-04T21:09:45Z

What

Adds vision-language-model (VLM) support to the SFT trainer on the Megatron backend, starting with Qwen3-VL. The Megatron path previously supported text-only SFT; the FSDP path already had VLM support. This PR brings the Megatron side up to parity for SFT by reusing the existing VLM data infrastructure (TensorList, Experience.pixel_values/image_grid_thw, replay-buffer image fields).

The vision tower runs on the first pipeline stage; pixel_values / image_grid_thw flow through the same TrainingInputBatch → Experience path the FSDP backend already uses, and are passed into GPTModel.forward as Qwen-style kwargs.

Changes

skyrl/utils/tok.py — check_is_vlm() and get_processor() helpers.
skyrl/train/sft_trainer.py — detect VLMs in setup(), build an HF processor, thread it through tokenize_chat_example / _tokenize_chat_last_assistant (now return_dict=True with a small _unbatch helper), emit image tensors as a TensorList in collate_sft_batch, force sequential tokenization for VLMs (the processor does not round-trip through the spawn worker pool), and disable sequence packing / microbatch-padding removal.
megatron_model_wrapper.py — is_vlm flag, VLM constraint asserts, and pixel_values/image_grid_thw passthrough in both forward and forward_backward_mini_batch.
megatron_worker.py — VLM detection off the HF config, plumb is_vlm into the wrapper, and forward image tensors through the micro-batch builders (policy forward/train + ref forward).
tests — test_megatron_vlm_init.py (GPU: Megatron VLM forward vs HF reference, and a 5-step SFTTrainer train loop asserting loss decreases) and VLM cases in test_sft_tokenization.py (CPU: tokenization + collation).

(a) Constraints carried over from the FSDP VLM path

3D RoPE and multimodal token positions mean the following are unsupported and asserted/forced off:

sample / microbatch packing (remove_microbatch_padding, use_sequence_packing)
context parallelism (CP)
sequence parallelism (SP)
mixed text+image microbatches (batches are assumed homogeneously imaged)

(b) Parity / correctness

test_megatron_vlm_forward checks the Megatron VLM forward against a plain HuggingFace AutoModelForImageTextToText forward over the same tokens + images (finite, aligned log-probs). test_vlm_train runs 5 SFT steps and asserts the loss decreases.

(c) Parallelism combos exercised in tests

TP=2, PP=1 (forward parity)
TP=1, PP=2 (5-step train loop)

(d) Not supported

CP, SP, sample/microbatch packing, mixed text+image batches, MoE+VLM.

Adds vision-language-model support to the Megatron SFT path, mirroring the existing FSDP VLM data plumbing. The vision tower runs on the first pipeline stage; image tensors (pixel_values / image_grid_thw) flow through the same TrainingInputBatch -> Experience path the FSDP backend already uses. Changes: - skyrl/utils/tok.py: add check_is_vlm and get_processor helpers. - skyrl/train/sft_trainer.py: detect VLMs in setup(), build an HF processor, thread it through tokenize_chat_example / _tokenize_chat_last_assistant, emit image tensors as a TensorList in collate_sft_batch, force sequential tokenization for VLMs, and disable sequence packing / microbatch padding removal (unsupported for VLMs). - megatron_model_wrapper.py: is_vlm flag, VLM constraint asserts (no packing, CP, or SP), and pixel_values/image_grid_thw passthrough in both forward and forward_backward_mini_batch. - megatron_worker.py: VLM detection off the HF config, plumb is_vlm into the wrapper, and forward image tensors through the micro-batch builders. - tests: GPU init/forward + 5-step train test (test_megatron_vlm_init.py), CPU tokenization/collation cases (test_sft_tokenization.py). Constraints carried over from the FSDP VLM path: no sample/microbatch packing, no context parallelism, no sequence parallelism, and homogeneous (all-image) batches. MoE+VLM is out of scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] VLM SFT support on Megatron backend (Qwen3-VL)#1752

[train] VLM SFT support on Megatron backend (Qwen3-VL)#1752
s-chundi wants to merge 1 commit into
NovaSky-AI:mainfrom
s-chundi:vlm-sft-megatron

s-chundi commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

s-chundi commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

(a) Constraints carried over from the FSDP VLM path

(b) Parity / correctness

(c) Parallelism combos exercised in tests

(d) Not supported

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

s-chundi commented Jun 4, 2026 •

edited

Loading