Skip to content

fix(checkpoint): harden consolidated safetensors export#2289

Open
yuhezhang-ai wants to merge 7 commits into
mainfrom
yuhez/fix/checkpoint-consolidation-hardening
Open

fix(checkpoint): harden consolidated safetensors export#2289
yuhezhang-ai wants to merge 7 commits into
mainfrom
yuhez/fix/checkpoint-consolidation-hardening

Conversation

@yuhezhang-ai
Copy link
Copy Markdown
Contributor

@yuhezhang-ai yuhezhang-ai commented May 21, 2026

What does this PR do ?

This PR has two parts.

First, it fixes bugs in HF safetensors consolidation. Non-standard upstream shard filenames no longer collapse into one output file, and when the original HF shard metadata is unavailable, AutoModel now falls back to deterministic size-based shard assignment.

Second, it changes the recommended export workflow. Instead of writing consolidated HF weights inline during every checkpoint save, docs and example configs now prefer save_consolidated: false and running the generated model/consolidate.sh helper after training. The helper is CPU-friendly, supports optional parallelism, and calls the packaged nemo_automodel.tools.offline_hf_consolidation module; the old standalone script path remains as a compatibility wrapper.

This PR also adds inline-consolidation warnings and adds optional TARGET_DTYPE / --target-dtype support for uncommon dtype conversion cases during offline export.

Changelog

Fixes:

  • Fix consolidated shard index parsing so non-standard upstream filenames, including Qwen3.5-style safetensors names, preserve the expected number of output files.
  • Add a deterministic size-based fallback for output shard assignment when the original HF safetensors index/path is unavailable.

Offline consolidation helper & docs & configs:

  • Generate a conservative consolidate.sh helper beside non-consolidated safetensors checkpoints, with CPU-friendly defaults and optional multi-process/threaded execution.
  • Move the offline consolidation implementation into nemo_automodel.tools.offline_hf_consolidation, keeping the old script path as a compatibility wrapper.
  • Update checkpoint docs and example configs to recommend save_consolidated: false during training, followed by bash <checkpoint>/model/consolidate.sh after training.
  • Multiple levels of Warn and Info when save_consolidated=True runs HF export inline during checkpoint saves and may waste GPU allocation time.
  • Add optional offline export dtype casting via TARGET_DTYPE / --target-dtype; non-floating tensors and tensors already in the target dtype are unchanged.
  • Add focused unit tests for shard parsing, fallback shard assignment, offline script generation, target dtype casting, and consolidation warnings.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Fixes #2279

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

if not self._should_write_hf_metadata() or self._should_write_consolidated_safetensors() or not _is_rank_0():
return

script_path = os.path.join(model_dir, "consolidate.sh")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty nice @yuhezhang-ai

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
@yuhezhang-ai
Copy link
Copy Markdown
Contributor Author

/ok to test cbbca66

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
…consolidation-hardening

Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
@yuhezhang-ai
Copy link
Copy Markdown
Contributor Author

/ok to test 2b78e77

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.

Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations.
:::

:::{dropdown} Accessing gated models
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{dropdown} Accessing Gated Models

This guide uses **SQuAD v1.1** as a running example. Swap the dataset by changing `_target_` and the dataset arguments — see [Integrate Your Own Text Dataset](dataset.md) and [Dataset Overview](../dataset-overview.md).

:::{dropdown} About SQuAD v1.1
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and a span answer. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify: answer span is the correct noun phrase in SQuAD

Suggested change
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning.


The `--nproc-per-node=8` flag specifies the number of GPUs per node. Adjust to your case (for a single GPU, omit the `--nproc-per-node` option).

### Invoke the Recipe Script Directly (advanced)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Invoke the Recipe Script Directly (Advanced)

```

### Sample Output
Running the recipe using either the `automodel` app or by directly invoking the recipe script should produce
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Running the recipe with the `automodel` app or by invoking the recipe script directly produces the following log:

### Checkpoint Contents

Checkpoints are saved in native Hugging Face format, so no conversion is required — they work directly with Transformers, PEFT, vLLM, lm-eval-harness, and other tools in the Hugging Face ecosystem. SFT and PEFT produce different checkpoint layouts. **SFT checkpoints** contain the full model weights at `model/consolidated/` — a single, self-contained Hugging Face model directory created by gathering distributed shards into one location — and can be loaded directly. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).
Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).
Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) and are saved directly under `model/`. They do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment).

--seq_length 1024
```

### 2. Model Verification
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Verify the Model

- **Per-layer hidden states** — cosine similarity and max absolute difference at the last token position
- **Final logits** — cosine similarity, max absolute difference, and top-1 token agreement

HF Transformers loads with `device_map="auto"`. NeMo loads via torchrun using the same config and code path as training (EP, FSDP, backend).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The HF Transformers library loads with `device_map="auto"`. NeMo loads through torchrun using the same config and code path as training (EP, FSDP, backend).

RESULT: PASS — all prompts above threshold 0.99
```

### 3. Training
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Train

--nodes 2 --partition batch
```

### 4. Eval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Evaluate

--thinking
```

### 5. Inference Quality
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Analyze Inference Quality

Copy link
Copy Markdown
Contributor

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants