[train] VLM SFT support on Megatron backend (Qwen3-VL)#1752
Draft
s-chundi wants to merge 1 commit into
Draft
Conversation
Adds vision-language-model support to the Megatron SFT path, mirroring the existing FSDP VLM data plumbing. The vision tower runs on the first pipeline stage; image tensors (pixel_values / image_grid_thw) flow through the same TrainingInputBatch -> Experience path the FSDP backend already uses. Changes: - skyrl/utils/tok.py: add check_is_vlm and get_processor helpers. - skyrl/train/sft_trainer.py: detect VLMs in setup(), build an HF processor, thread it through tokenize_chat_example / _tokenize_chat_last_assistant, emit image tensors as a TensorList in collate_sft_batch, force sequential tokenization for VLMs, and disable sequence packing / microbatch padding removal (unsupported for VLMs). - megatron_model_wrapper.py: is_vlm flag, VLM constraint asserts (no packing, CP, or SP), and pixel_values/image_grid_thw passthrough in both forward and forward_backward_mini_batch. - megatron_worker.py: VLM detection off the HF config, plumb is_vlm into the wrapper, and forward image tensors through the micro-batch builders. - tests: GPU init/forward + 5-step train test (test_megatron_vlm_init.py), CPU tokenization/collation cases (test_sft_tokenization.py). Constraints carried over from the FSDP VLM path: no sample/microbatch packing, no context parallelism, no sequence parallelism, and homogeneous (all-image) batches. MoE+VLM is out of scope. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds vision-language-model (VLM) support to the SFT trainer on the Megatron backend, starting with Qwen3-VL. The Megatron path previously supported text-only SFT; the FSDP path already had VLM support. This PR brings the Megatron side up to parity for SFT by reusing the existing VLM data infrastructure (
TensorList,Experience.pixel_values/image_grid_thw, replay-buffer image fields).The vision tower runs on the first pipeline stage;
pixel_values/image_grid_thwflow through the sameTrainingInputBatch→Experiencepath the FSDP backend already uses, and are passed intoGPTModel.forwardas Qwen-style kwargs.Changes
skyrl/utils/tok.py—check_is_vlm()andget_processor()helpers.skyrl/train/sft_trainer.py— detect VLMs insetup(), build an HF processor, thread it throughtokenize_chat_example/_tokenize_chat_last_assistant(nowreturn_dict=Truewith a small_unbatchhelper), emit image tensors as aTensorListincollate_sft_batch, force sequential tokenization for VLMs (the processor does not round-trip through the spawn worker pool), and disable sequence packing / microbatch-padding removal.megatron_model_wrapper.py—is_vlmflag, VLM constraint asserts, andpixel_values/image_grid_thwpassthrough in bothforwardandforward_backward_mini_batch.megatron_worker.py— VLM detection off the HF config, plumbis_vlminto the wrapper, and forward image tensors through the micro-batch builders (policy forward/train + ref forward).test_megatron_vlm_init.py(GPU: Megatron VLM forward vs HF reference, and a 5-stepSFTTrainertrain loop asserting loss decreases) and VLM cases intest_sft_tokenization.py(CPU: tokenization + collation).(a) Constraints carried over from the FSDP VLM path
3D RoPE and multimodal token positions mean the following are unsupported and asserted/forced off:
remove_microbatch_padding,use_sequence_packing)(b) Parity / correctness
test_megatron_vlm_forwardchecks the Megatron VLM forward against a plain HuggingFaceAutoModelForImageTextToTextforward over the same tokens + images (finite, aligned log-probs).test_vlm_trainruns 5 SFT steps and asserts the loss decreases.(c) Parallelism combos exercised in tests
(d) Not supported
CP, SP, sample/microbatch packing, mixed text+image batches, MoE+VLM.