feat: Abstract LLM/VLM forward-backward step by HuiyingLi · Pull Request #2228 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-05-13T19:15:48Z

Summary

Add a shared forward/backward helper for LLM and VLM fine-tuning paths.
Route LLM and VLM recipes through the shared helper while keeping recipe-specific hooks for CP prep, FP8 context, and VLM PP media staging.
Enable VLM pipeline-parallel validation by preparing validation dataloader media chunks for PP and running validation through schedule.eval.

Why

The LLM and VLM recipes had largely duplicated forward/backward control flow. VLM PP validation was previously skipped because validation used a direct model-forward path and did not prepare VLM media tensors for PP microbatches.

Validation

python -m py_compile nemo_automodel/components/training/forward_backward.py nemo_automodel/recipes/llm/train_ft.py nemo_automodel/recipes/vlm/finetune.py
python -m ruff check nemo_automodel/components/training/forward_backward.py nemo_automodel/recipes/llm/train_ft.py nemo_automodel/recipes/vlm/finetune.py
git diff --check
pytest -q tests/unit_tests/recipes/test_train_ft.py
pytest -q tests/unit_tests/recipes/test_finetune_vlm_helpers.py
Local Qwen3 VL MoE 30B EP4/PP2 smoke run with validation:
- train: loss 2.3266, num_label_tokens 930
- val: loss 2.2712, num_label_tokens 941

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot · 2026-05-13T19:15:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

akoumpa · 2026-05-13T20:23:23Z

+    distributed_config: Any,
+    loss_fn: Callable[..., torch.Tensor] | None,
+    calculate_loss_fn: Callable[..., torch.Tensor],
+    loss_buffer: list[torch.Tensor],


I would let the called handle this, and instead return the loss to the caller

akoumpa · 2026-05-13T20:23:39Z

+    loss_fn: Callable[..., torch.Tensor] | None,
+    calculate_loss_fn: Callable[..., torch.Tensor],


can we consolidate the two

loss_fn

calculate_loss_fn
?

akoumpa · 2026-05-13T20:24:48Z

+    pp_enabled: bool,
+    pp: Any | None,


is there a strong use-case for pp_enabled? if not, i would do something like pp_enabled = (pp is not None) inside forward_backward_step's body.

akoumpa · 2026-05-13T20:26:52Z

+    is_train: bool,
+    pp_enabled: bool,
+    pp: Any | None,
+    dp_group_size: int,


can we derive this from device_mesh by placing a convention on mesh naming?

akoumpa · 2026-05-13T20:28:21Z

+    return value
+
+
+def forward_backward_step(


I feel this function is challenging to get to the right abstraction, because it is trying to abstract LLM/VLM + train/eval + PP/non-PP + CP + FP8 + FSDP sync + loss calculation + modality hooks, so I'm thinking what ways could be explored to simplify it, without losing critical functionality.

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi added 2 commits May 13, 2026 11:23

Abstract LLM and VLM forward backward step

163f4b3

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Enable VLM PP validation

86b3cdb

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi changed the title ~~Abstract LLM/VLM forward-backward step~~ feat: Abstract LLM/VLM forward-backward step May 13, 2026

Remove obsolete PP validation skip flag

6fcdbcd

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

akoumpa reviewed May 13, 2026

View reviewed changes

Simplify forward backward helper API

198dfc1

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Abstract LLM/VLM forward-backward step#2228

feat: Abstract LLM/VLM forward-backward step#2228
HuiyingLi wants to merge 4 commits into
mainfrom
huiyingl/abstract-forward-backward-step

HuiyingLi commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

akoumpa May 13, 2026

Uh oh!

akoumpa May 13, 2026

Uh oh!

akoumpa May 13, 2026

Uh oh!

akoumpa May 13, 2026

Uh oh!

akoumpa May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		loss_fn: Callable[..., torch.Tensor] \| None,
		calculate_loss_fn: Callable[..., torch.Tensor],

Conversation

HuiyingLi commented May 13, 2026

Summary

Why

Validation

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

akoumpa May 13, 2026

Choose a reason for hiding this comment

Uh oh!

akoumpa May 13, 2026

Choose a reason for hiding this comment

Uh oh!

akoumpa May 13, 2026

Choose a reason for hiding this comment

Uh oh!

akoumpa May 13, 2026

Choose a reason for hiding this comment

Uh oh!

akoumpa May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants