Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
541c855
Added support for HF modelopt state reload for vllm fakequant
kinjalpatel27 Jan 21, 2026
9c5364f
changelog update
kinjalpatel27 Jan 21, 2026
0049a48
minor
kinjalpatel27 Jan 22, 2026
9216ec7
updated for TP>1
kinjalpatel27 Jan 22, 2026
bcb48f1
minor
kinjalpatel27 Jan 22, 2026
9017e6e
updated test
kinjalpatel27 Jan 26, 2026
480c292
test fix
kinjalpatel27 Jan 26, 2026
6c96ace
minor
kinjalpatel27 Jan 26, 2026
da97612
created seperate script for vllm fq export
kinjalpatel27 Jan 26, 2026
536b58a
minor
kinjalpatel27 Jan 26, 2026
80bde37
cleanup
kinjalpatel27 Jan 26, 2026
ad1994b
minor
kinjalpatel27 Feb 23, 2026
6955790
removed cleanup_for_torch_save
kinjalpatel27 Feb 23, 2026
5da52de
minor
kinjalpatel27 Feb 23, 2026
0afd73e
updated vllm fakequant to ignore absent disabled quantizer during reload
kinjalpatel27 Mar 3, 2026
b42d453
minor
kinjalpatel27 Mar 3, 2026
b1d8a1c
minor
kinjalpatel27 Mar 4, 2026
abca327
minor
kinjalpatel27 Mar 4, 2026
dacc895
minor
kinjalpatel27 Mar 4, 2026
e78af7b
fixed comments
kinjalpatel27 Mar 4, 2026
0ff98c0
Refactor and rearranged code
kinjalpatel27 Mar 6, 2026
fbd0793
minor
kinjalpatel27 Mar 6, 2026
cb50bad
minor
kinjalpatel27 Mar 6, 2026
7c36c5a
minor
kinjalpatel27 Mar 6, 2026
67e1074
minor
kinjalpatel27 Mar 6, 2026
dda5ec0
minor
kinjalpatel27 Mar 6, 2026
99c6d1c
Refactor vLLM quant: centralize dtype resolution, deduplicate attenti…
kinjalpatel27 Mar 11, 2026
4e00471
Fix weight quantizer disable before state export
kinjalpatel27 Mar 11, 2026
bd6fe04
Update test_hf_vllm_export to reflect weight-folding export behavior
kinjalpatel27 Mar 11, 2026
e9fec3c
cleanup
kinjalpatel27 Mar 11, 2026
5b6dc24
Updated export path to fold weights before export
kinjalpatel27 Mar 24, 2026
d76ca86
minor
kinjalpatel27 Mar 24, 2026
ca7020c
fixed code to enable graph compilation when reloading
kinjalpatel27 Mar 24, 2026
b458200
minor
kinjalpatel27 Mar 24, 2026
3458d15
minor
kinjalpatel27 Mar 24, 2026
72a06e5
updated megatron support
kinjalpatel27 Mar 24, 2026
fdb748a
Fridah/kinjal/vllm modelopt reload (#1068)
Fridah-nv Mar 24, 2026
a427aa3
cleanup
kinjalpatel27 Mar 24, 2026
f86ca34
cleanup
kinjalpatel27 Mar 24, 2026
586801a
fix test
kinjalpatel27 Mar 25, 2026
2170cfc
cleanup
kinjalpatel27 Mar 25, 2026
46c5670
minor
kinjalpatel27 Mar 26, 2026
ffc45dd
minor
kinjalpatel27 Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ NVIDIA Model Optimizer Changelog
- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
- Add support for block-granular RHT for non-power-of-2 dimensions.
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
- Add support for vLLM fakequant reload using ModelOpt state for HF models. See `examples/vllm_serve/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vllm_serve#load-qatptq-model-and-serve-in-vllm-wip>`_ for more details.

**Deprecations**

Expand Down
2 changes: 2 additions & 0 deletions examples/llm_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,8 @@ with torch.inference_mode():
python hf_ptq.py --pyt_ckpt_path <huggingface_model_card> --qformat fp8 --export_path <quantized_ckpt_path> --trust_remote_code
```

> *For exporting fake-quantized models for vLLM serving (e.g., for research or kernels not yet supported in real-quant), use the `--vllm_fakequant_export` flag. See [vllm_serve/README.md](../vllm_serve/README.md) for details.*

### Hugging Face framework [Script](./scripts/huggingface_example.sh)

Alternatively, the framework script `huggingface_example.sh` also supports quantize and export:
Expand Down
29 changes: 21 additions & 8 deletions examples/llm_ptq/hf_ptq.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@
from modelopt.recipe import ModelOptPTQRecipe, load_recipe
from modelopt.torch.export import (
export_hf_checkpoint,
export_hf_vllm_fq_checkpoint,
export_speculative_decoding,
export_tensorrt_llm_checkpoint,
get_model_type,
Expand Down Expand Up @@ -681,16 +682,21 @@ def export_quantized(

# Load any missing weights from non-standard safetensors (handled in get_model for non-low-memory mode)
# Store the MTP layer prefixes on the model for later exclusion from quantization
mtp_layer_prefixes, mtp_state_dict = load_mtp_weights(full_model, args.pyt_ckpt_path)
if args.vllm_fakequant_export:
export_hf_vllm_fq_checkpoint(full_model, export_dir=export_path)
else:
mtp_layer_prefixes, mtp_state_dict = load_mtp_weights(
full_model, args.pyt_ckpt_path
)

if mtp_layer_prefixes:
full_model._mtp_layer_prefixes = mtp_layer_prefixes
if mtp_layer_prefixes:
full_model._mtp_layer_prefixes = mtp_layer_prefixes

export_hf_checkpoint(
full_model,
export_dir=export_path,
extra_state_dict=mtp_state_dict,
)
export_hf_checkpoint(
full_model,
export_dir=export_path,
extra_state_dict=mtp_state_dict,
)

# Restore default padding and export the tokenizer as well.
if tokenizer is not None:
Expand Down Expand Up @@ -1218,6 +1224,13 @@ def parse_args() -> argparse.Namespace:
"Does not impact non-MOE models."
),
)
parser.add_argument(
"--vllm_fakequant_export",
default=False,
action="store_true",
help="Export as vLLM fake-quant checkpoint (produces vllm_fq_modelopt_state.pth "
"for use with vllm_serve_fakequant.py).",
)

args = parser.parse_args()
if args.moe_calib_experts_ratio is not None and not (0.0 < args.moe_calib_experts_ratio <= 1.0):
Expand Down
50 changes: 38 additions & 12 deletions examples/vllm_serve/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,11 @@ You can either edit the `quant_config` dictionary in `vllm_serve_fakequant.py`,
|-----------------|--------------------------------------------------|---------------------|
| QUANT_DATASET | Dataset name for calibration | cnn_dailymail |
| QUANT_CALIB_SIZE| Number of samples used for calibration | 512 |
| QUANT_CFG | Quantization format | NVFP4_DEFAULT_CFG |
| KV_QUANT_CFG | Quantization format for KV Cache | None |
| AMAX_FILE_PATH | Optional path to amax file (for loading amax) | None |
| QUANT_CFG | Quantization config | None |
| KV_QUANT_CFG | KV-cache quantization config | None |
| QUANT_FILE_PATH | Optional path to exported quantizer state dict `quantizer_state.pth` | None |
| MODELOPT_STATE_PATH | Optional path to exported `vllm_fq_modelopt_state.pth` (restores quantizer state and parameters) | None |
| CALIB_BATCH_SIZE | Calibration batch size | 1 |

Set these variables in your shell or Docker environment as needed to customize calibration.

Expand Down Expand Up @@ -56,21 +58,45 @@ lm_eval --model local-completions --tasks gsm8k --model_args model=<model_name>,

## Load QAT/PTQ model and serve in vLLM (WIP)

Overwrite the calibrated amax value with prepared values from either QAT/PTQ.
Step 1: export the model with bf16 weights and quantizer state. To export the model:

Step 1: export the model with bf16 weights and amax values. To export the model:
- For **HF** models, use `examples/llm_ptq/hf_ptq.py` with `--vllm_fakequant_export`:

- For HF model use `modelopt.torch.export.export_hf_vllm_fq_checkpoint` function.
- For MCore model use `modelopt.torch.export.export_mcore_gpt_to_hf_vllm_fq` function.
```bash
python ../llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
--qformat nvfp4 \
--calib_size 512 \
--export_path <EXPORT_DIR> \
--vllm_fakequant_export \
--trust_remote_code
```

This creates `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` (ModelOpt quantizer state for vLLM fake-quant reload) and saves the HF-exported model under `<EXPORT_DIR>` (config/tokenizer/weights).

Note: `--pyt_ckpt_path` can point to either an HF checkpoint or a ModelOpt-saved checkpoint (e.g., a QAT/QAD checkpoint produced by `examples/llm_qat/main.py`). If the input checkpoint is already quantized, the script will **skip re-quantization** and only export artifacts for vLLM fakequant reload.

- For **MCore** models, export the model with flag `--export-vllm-fq` as described in [Megatron-LM README](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt#-nvfp4-quantization-qauntization-aware-training-and-model-export). This generates `quantizer_state.pth`, which contains quantizer tensors for vLLM reload via `QUANT_FILE_PATH`.

Step 2: use the exported artifacts when serving:

- **HF export**: pass the exported `vllm_fq_modelopt_state.pth` via `MODELOPT_STATE_PATH`

```bash
# HF
MODELOPT_STATE_PATH=<vllm_fq_modelopt_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
```

Step 2: configure <quant_amax.pth> from exported model using AMAX_FILE_PATH environment variable in step 1. For example:
- **MCore export**: pass the exported `quantizer_state.pth` via `QUANT_FILE_PATH` and set `QUANT_CFG` to match the MCore quantization recipe

```bash
AMAX_FILE_PATH=<vllm_amax.pth> QUANT_CFG=<quant_config> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
# MCore
QUANT_CFG=<quant_cfg> QUANT_FILE_PATH=<quantizer_state.pth> python vllm_serve_fakequant.py <model_path> -tp 8 --host 0.0.0.0 --port 8000
```

## Known Problems

1. AWQ is not yet supported in vLLM.
2. QAT checkpoint export doesn't have KV Cache quantization enabled. KV Cache fake quantization works for PTQ.
3. Mixed precision checkpoint doesn't work currently.
1. **MCore reload does not use `MODELOPT_STATE_PATH`**; use `QUANT_FILE_PATH` and make sure `QUANT_CFG` matches the quantization recipe used for the original MCore model (otherwise quantizer keys/config won’t align).
2. AWQ reload is not supported yet
3. KV cache quantization export and reload is not supported in MCore yet.
4. **`NVFP4_KV_CFG` and `NVFP4_AFFINE_KV_CFG` require `--enforce-eager`**; these configs use a dynamic-block Triton kernel for KV-cache quantization that is incompatible with CUDA graph capture (the kernel grid is computed from Python-level tensor shapes, which get baked in at capture time). Without `--enforce-eager`, the captured grid will be wrong for different batch sizes, producing incorrect outputs.
Loading
Loading