fix(checkpoint): harden consolidated safetensors export#2289
fix(checkpoint): harden consolidated safetensors export#2289yuhezhang-ai wants to merge 7 commits into
Conversation
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
| if not self._should_write_hf_metadata() or self._should_write_consolidated_safetensors() or not _is_rank_0(): | ||
| return | ||
|
|
||
| script_path = os.path.join(model_dir, "consolidate.sh") |
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
|
/ok to test cbbca66 |
Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
…consolidation-hardening Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>
|
/ok to test 2b78e77 |
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.
| Llama is a family of decoder-only transformer models developed by Meta. The 1B variant is a compact model suitable for research and edge deployment, featuring RoPE positional embeddings, grouped-query attention (GQA), and SwiGLU activations. | ||
| ::: | ||
|
|
||
| :::{dropdown} Accessing gated models |
There was a problem hiding this comment.
| :::{dropdown} Accessing Gated Models |
| This guide uses **SQuAD v1.1** as a running example. Swap the dataset by changing `_target_` and the dataset arguments — see [Integrate Your Own Text Dataset](dataset.md) and [Dataset Overview](../dataset-overview.md). | ||
|
|
||
| :::{dropdown} About SQuAD v1.1 | ||
| The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and a span answer. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning. |
There was a problem hiding this comment.
Verify: answer span is the correct noun phrase in SQuAD
| The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset where each example consists of a Wikipedia passage, a question, and an answer span. SQuAD v1.1 guarantees all questions are answerable from the context, making it suitable for straightforward fine-tuning. |
|
|
||
| The `--nproc-per-node=8` flag specifies the number of GPUs per node. Adjust to your case (for a single GPU, omit the `--nproc-per-node` option). | ||
|
|
||
| ### Invoke the Recipe Script Directly (advanced) |
There was a problem hiding this comment.
| ### Invoke the Recipe Script Directly (Advanced) |
| ``` | ||
|
|
||
| ### Sample Output | ||
| Running the recipe using either the `automodel` app or by directly invoking the recipe script should produce |
There was a problem hiding this comment.
| Running the recipe with the `automodel` app or by invoking the recipe script directly produces the following log: |
| ### Checkpoint Contents | ||
|
|
||
| Checkpoints are saved in native Hugging Face format, so no conversion is required — they work directly with Transformers, PEFT, vLLM, lm-eval-harness, and other tools in the Hugging Face ecosystem. SFT and PEFT produce different checkpoint layouts. **SFT checkpoints** contain the full model weights at `model/consolidated/` — a single, self-contained Hugging Face model directory created by gathering distributed shards into one location — and can be loaded directly. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment). | ||
| Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment). |
There was a problem hiding this comment.
| Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs), save those adapter files directly under `model/`, and do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment). | |
| Checkpoints are saved as Hugging Face-compatible safetensors. By default, SFT writes sharded model weights plus a generated `model/consolidate.sh` helper; run the helper after training to create `model/consolidated/` for Transformers, vLLM, lm-eval-harness, and other Hugging Face ecosystem tools. You can also set `save_consolidated: true` when you intentionally want inline HF export at every checkpoint save. **PEFT checkpoints** contain only the adapter weights (~MBs instead of GBs) and are saved directly under `model/`. They do not use `model/consolidate.sh` — at inference time you must load the original base model and apply the adapter on top. This distinction affects every downstream step (inference, publishing, deployment). |
| --seq_length 1024 | ||
| ``` | ||
|
|
||
| ### 2. Model Verification |
There was a problem hiding this comment.
| ### Verify the Model |
| - **Per-layer hidden states** — cosine similarity and max absolute difference at the last token position | ||
| - **Final logits** — cosine similarity, max absolute difference, and top-1 token agreement | ||
|
|
||
| HF Transformers loads with `device_map="auto"`. NeMo loads via torchrun using the same config and code path as training (EP, FSDP, backend). |
There was a problem hiding this comment.
| The HF Transformers library loads with `device_map="auto"`. NeMo loads through torchrun using the same config and code path as training (EP, FSDP, backend). |
| RESULT: PASS — all prompts above threshold 0.99 | ||
| ``` | ||
|
|
||
| ### 3. Training |
| --nodes 2 --partition batch | ||
| ``` | ||
|
|
||
| ### 4. Eval |
There was a problem hiding this comment.
| ### Evaluate |
| --thinking | ||
| ``` | ||
|
|
||
| ### 5. Inference Quality |
There was a problem hiding this comment.
| ### Analyze Inference Quality |
jgerh
left a comment
There was a problem hiding this comment.
Completed a tech pubs review of .md files, performed copyediting, and verified formatting for consistency/alignment with style guidelines. Let me know what you think.
What does this PR do ?
This PR has two parts.
First, it fixes bugs in HF safetensors consolidation. Non-standard upstream shard filenames no longer collapse into one output file, and when the original HF shard metadata is unavailable, AutoModel now falls back to deterministic size-based shard assignment.
Second, it changes the recommended export workflow. Instead of writing consolidated HF weights inline during every checkpoint save, docs and example configs now prefer
save_consolidated: falseand running the generatedmodel/consolidate.shhelper after training. The helper is CPU-friendly, supports optional parallelism, and calls the packagednemo_automodel.tools.offline_hf_consolidationmodule; the old standalone script path remains as a compatibility wrapper.This PR also adds inline-consolidation warnings and adds optional
TARGET_DTYPE/--target-dtypesupport for uncommon dtype conversion cases during offline export.Changelog
Fixes:
Offline consolidation helper & docs & configs:
consolidate.shhelper beside non-consolidated safetensors checkpoints, with CPU-friendly defaults and optional multi-process/threaded execution.nemo_automodel.tools.offline_hf_consolidation, keeping the old script path as a compatibility wrapper.save_consolidated: falseduring training, followed bybash <checkpoint>/model/consolidate.shafter training.save_consolidated=Trueruns HF export inline during checkpoint saves and may waste GPU allocation time.TARGET_DTYPE/--target-dtype; non-floating tensors and tensors already in the target dtype are unchanged.Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Fixes #2279