fix(checkpoint): harden consolidated safetensors export by yuhezhang-ai · Pull Request #2289 · NVIDIA-NeMo/Automodel

yuhezhang-ai · 2026-05-21T16:26:13Z

What does this PR do ?

This PR has two parts.

First, it fixes bugs in HF safetensors consolidation. Non-standard upstream shard filenames no longer collapse into one output file, and when the original HF shard metadata is unavailable, AutoModel now falls back to deterministic size-based shard assignment.

Second, it changes the recommended export workflow. Instead of writing consolidated HF weights inline during every checkpoint save, docs and example configs now prefer save_consolidated: false and running the generated model/consolidate.sh helper after training. The helper is CPU-friendly, supports optional parallelism, and calls the packaged nemo_automodel.tools.offline_hf_consolidation module; the old standalone script path remains as a compatibility wrapper.

This PR also adds inline-consolidation warnings and adds optional TARGET_DTYPE / --target-dtype support for uncommon dtype conversion cases during offline export.

Changelog

Fixes:

Fix consolidated shard index parsing so non-standard upstream filenames, including Qwen3.5-style safetensors names, preserve the expected number of output files.
Add a deterministic size-based fallback for output shard assignment when the original HF safetensors index/path is unavailable.

Offline consolidation helper & docs & configs:

Generate a conservative consolidate.sh helper beside non-consolidated safetensors checkpoints, with CPU-friendly defaults and optional multi-process/threaded execution.
Move the offline consolidation implementation into nemo_automodel.tools.offline_hf_consolidation, keeping the old script path as a compatibility wrapper.
Update checkpoint docs and example configs to recommend save_consolidated: false during training, followed by bash <checkpoint>/model/consolidate.sh after training.
Multiple levels of Warn and Info when save_consolidated=True runs HF export inline during checkpoint saves and may waste GPU allocation time.
Add optional offline export dtype casting via TARGET_DTYPE / --target-dtype; non-floating tensors and tensors already in the target dtype are unchanged.
Add focused unit tests for shard parsing, fallback shard assignment, offline script generation, target dtype casting, and consolidation warnings.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Fixes #2279

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

copy-pr-bot · 2026-05-21T16:26:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2026-05-21T16:36:12Z

+        if not self._should_write_hf_metadata() or self._should_write_consolidated_safetensors() or not _is_rank_0():
+            return
+
+        script_path = os.path.join(model_dir, "consolidate.sh")


Pretty nice @yuhezhang-ai

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-05-22T19:34:56Z

/ok to test cbbca66

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

…consolidation-hardening Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai · 2026-05-22T21:57:00Z

/ok to test 2b78e77

jgerh

Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.

jgerh · 2026-05-22T22:44:53Z

 Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations.
 :::

 :::{dropdown} Accessing gated models


Suggested change

:::{dropdown} Accessing Gated Models

jgerh · 2026-05-22T22:47:29Z

 This guide uses **SQuAD v1.1** as a running example. Swap the dataset by changing `_target_` and the dataset arguments — see [Integrate Your Own Text Dataset](dataset.md) and [Dataset Overview](../dataset-overview.md).

 :::{dropdown} About SQuAD v1.1
 The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and a span answer. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.


Verify: answer span is the correct noun phrase in SQuAD

Suggested change

The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.

jgerh · 2026-05-22T22:47:54Z


 The `--nproc-per-node=8` flag specifies the number of GPUs per node. Adjust to your case (for a single GPU, omit the `--nproc-per-node` option).

 ### Invoke the Recipe Script Directly (advanced)


Suggested change

### Invoke the Recipe Script Directly (Advanced)

jgerh · 2026-05-22T22:49:04Z

 ```

 ### Sample Output
 Running the recipe using either the `automodel` app or by directly invoking the recipe script should produce


Suggested change

Running the recipe with the `automodel` app or by invoking the recipe script directly produces the following log:

jgerh · 2026-05-22T22:51:30Z

 ### Checkpoint Contents

-Checkpoints are saved in native Hugging Face format, so no conversion is required — they work directly with Transformers, PEFT, vLLM, lm-eval-harness, and other tools in the Hugging Face ecosystem. SFT and PEFT produce different checkpoint layouts. **SFT checkpoints** contain the full model weights at `model/consolidated/` — a single, self-contained Hugging Face model directory created by gathering distributed shards into one location — and can be loaded directly. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).
+Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).


Suggested change

Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).

Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) and are saved directly under `model/`. They do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).

jgerh · 2026-05-22T23:52:30Z

    --seq_length 1024
 ```

 ### 2. Model Verification


Suggested change

### Verify the Model

jgerh · 2026-05-22T23:54:44Z

 - **Per-layer hidden states** — cosine similarity and max absolute difference at the last token position
 - **Final logits** — cosine similarity, max absolute difference, and top-1 token agreement

 HF Transformers loads with `device_map="auto"`. NeMo loads via torchrun using the same config and code path as training (EP, FSDP, backend).


Suggested change

The HF Transformers library loads with `device_map="auto"`. NeMo loads through torchrun using the same config and code path as training (EP, FSDP, backend).

jgerh · 2026-05-22T23:54:55Z

  RESULT: PASS — all prompts above threshold 0.99
 ```

 ### 3. Training


Suggested change

### Train

jgerh · 2026-05-22T23:55:07Z

    --nodes 2 --partition batch
 ```

 ### 4. Eval


Suggested change

### Evaluate

jgerh · 2026-05-22T23:55:27Z

    --thinking
 ```

 ### 5. Inference Quality


Suggested change

### Analyze Inference Quality

jgerh

Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.

fix(checkpoint): harden consolidated safetensors export

841e3b0

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

akoumpa reviewed May 21, 2026

View reviewed changes

yuhezhang-ai added 4 commits May 21, 2026 15:21

fix(checkpoint): default HF export to offline consolidation

46ff956

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

docs(checkpoint): clarify offline HF consolidation

f1dc654

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

fix(checkpoint): support dtype casts for HF consolidation

a45c5c0

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

fix(checkpoint): clarify large consolidation warning

cbbca66

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

yuhezhang-ai marked this pull request as ready for review May 22, 2026 19:34

yuhezhang-ai requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, athitten, hemildesai, jgerh, pthombre, snowmanwwg and zyzhou5 as code owners May 22, 2026 19:34

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 19:35 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 19:35 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 19:35 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 19:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 19:41 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 19:44 Inactive

yuhezhang-ai added 2 commits May 22, 2026 14:53

fix(checkpoint): preserve offline consolidation compatibility

b7b826e

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

Merge remote-tracking branch 'origin/main' into yuhez/fix/checkpoint-…

2b78e77

…consolidation-hardening Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 21:57 Inactive

copy-pr-bot Bot temporarily deployed to test May 22, 2026 21:57 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:57 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 21:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 22:00 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 22:02 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 22:05 Inactive

jgerh reviewed May 23, 2026

View reviewed changes


	The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.


		The `--nproc-per-node=8` flag specifies the number of GPUs per node. Adjust to your case (for a single GPU, omit the `--nproc-per-node` option).

		### Invoke the Recipe Script Directly (advanced)


	Running the recipe with the `automodel` app or by invoking the recipe script directly produces the following log:


	The HF Transformers library loads with `device_map="auto"`. NeMo loads through torchrun using the same config and code path as training (EP, FSDP, backend).

Conversation

yuhezhang-ai commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuhezhang-ai commented May 22, 2026

Uh oh!

yuhezhang-ai commented May 22, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuhezhang-ai commented May 21, 2026 •

edited

Loading