Skip to content

feat: add MXFP4/MXFP8 quantization support (llmc_compressor format) and related tests#1865

Draft
xin3he wants to merge 5 commits into
mainfrom
xinhe/5-28
Draft

feat: add MXFP4/MXFP8 quantization support (llmc_compressor format) and related tests#1865
xin3he wants to merge 5 commits into
mainfrom
xinhe/5-28

Conversation

@xin3he
Copy link
Copy Markdown
Contributor

@xin3he xin3he commented May 28, 2026

Description

This pull request adds support for MXFP4 and MXFP8 (Mixed-Precision Floating Point) model-free quantization to the auto_round library, alongside robust handling and configuration for these schemes. The changes include new quantization logic, configuration generation, updated validation, and comprehensive tests to ensure correct behavior and compatibility with the existing codebase.

MXFP4/MXFP8 Quantization Support:

  • Added support for MXFP4 and MXFP8 quantization schemes in model-free mode, including new logic for quantizing weights, generating output in the correct packed formats, and integrating with existing APIs.
  • Implemented _quantize_weight_mxfp function for quantizing weights into MXFP formats, utilizing existing utilities for exponent sharing and packing.
  • Updated scheme validation to allow MXFP4 and MXFP8 (with group_size=32) and improved error messages for unsupported configurations.

Configuration and Format Handling:

  • Added logic to generate quantization configuration files in the compressed-tensors/llm-compressor style for MXFP schemes, ensuring compatibility with downstream tools.

Robustness Improvements:

  • Modified shard processing to preserve original quantized tensors (such as FP8, FP4-packed) for ignored or skipped layers, preventing unwanted dequantization.
  • Added and updated tests to verify correct preservation of original quantized tensors and to ensure MXFP quantization logic produces expected output shapes and formats.

Testing Enhancements:

  • Added comprehensive end-to-end and unit tests for MXFP4/MXFP8 quantization, configuration output, and scheme validation, ensuring robust coverage of new features.

Miscellaneous:

  • Updated supported and unsupported scheme lists in tests to reflect new MXFP support.

These changes collectively enable efficient and correct model-free quantization using MXFP4 and MXFP8, while maintaining compatibility and robustness across the quantization pipeline.

Type of Change

New feature

Related Issues

Fixes or relates to #1741

  • MXFP8 and MXFP4 Accuracy is verified with Qwen/Qwen3-0.6B using lm_eval vllm backend.
  • To quantize deepseekv4
auto-round /workspace/models/deepseek-ai/DeepSeek-V4-Flash --model_free --scheme MXFP4 --ignore_layers  ffn.experts --output_dir /workspace/models/deepseek-ai/DeepSeek-V4-Flash-MXFP4
  • loading deepseekV4 MXFP4/MXFP8 CT format with vLLM is WIP

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Copilot AI review requested due to automatic review settings May 28, 2026 05:51
@xin3he xin3he changed the title feat: add MXFP4/MXFP8 quantization support and related tests feat: add MXFP4/MXFP8 quantization support (llmc_compressor format) and related tests May 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds MXFP4 / MXFP8 model-free (RTN) quantization to the auto_round library, routing through the existing quant_mx shared-exponent logic and QuantLinear.pack() to emit compressed-tensors / llm-compressor style outputs. Also tweaks shard processing to keep already-quantized tensors for ignored/skipped layers, and broadens model-free supported formats.

Changes:

  • New _quantize_weight_mxfp() plus an MXFP branch in _quantize_single_tensor(), scheme validation, and compressed-tensors style config generation via _build_mxfp_quantization_config().
  • _process_shard() now preserves original tensors for ignored/skipped .weight layers before FP8/FP4 dequantization; SUPPORTED_FORMATS and quantize_and_save() accept llm_compressor variants.
  • New unit / end-to-end tests for MXFP4 / MXFP8 model-free flows and an ignored-FP8-preservation test.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
auto_round/compressors/model_free.py Adds MXFP weight quantization, compressed-tensors config builder, MXFP validation branch, ignored-tensor preservation in shard processing, and expanded supported formats.
test/test_cpu/quantization/test_model_free.py Adds shape/dtype, end-to-end, AutoRound API, shard, ignored-FP8 preservation, and scheme-validation tests for MXFP.

Comment thread test/test_cpu/quantization/test_model_free.py
Comment thread auto_round/compressors/model_free.py Outdated
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
Comment thread auto_round/compressors/model_free.py
xin3he added 2 commits May 28, 2026 06:27
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
@xin3he xin3he marked this pull request as draft May 29, 2026 01:16
xin3he and others added 3 commits May 29, 2026 13:10
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Xin He <xin3.he@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants