Skip to content

[BugFix][5271237][ONNX] Add Q/DQ placement for Conv->LayerNorm patterns#1117

Open
ajrasane wants to merge 2 commits intomainfrom
ajrasane/conv_layernorm
Open

[BugFix][5271237][ONNX] Add Q/DQ placement for Conv->LayerNorm patterns#1117
ajrasane wants to merge 2 commits intomainfrom
ajrasane/conv_layernorm

Conversation

@ajrasane
Copy link
Contributor

@ajrasane ajrasane commented Mar 24, 2026

What does this PR do?

Type of change: BugFix

  • ModelOpt was not placing Q/DQ nodes between Conv and LayerNormalization, causing TensorRT to select slower i8f16 kernels instead of faster i8i8 kernels for Conv layers whose output feeds into LayerNorm (e.g., ConvNext models).

Changes:

  • Register LayerNormalization in ORT's QDQ registry (ort_utils.py)
  • Add find_conv_to_layernorm_nodes() to detect Conv→(copy ops)→LayerNorm patterns (graph_utils.py)
  • Add detected LayerNorm nodes to quantizable nodes list so Q/DQ pairs are inserted on Conv output (int8.py)

Usage

  # No API change — existing quantize() call now automatically handles Conv->LayerNorm patterns
  import modelopt.onnx.quantization as moq

  moq.quantize("convnext.onnx", quantize_mode="int8")
  # Output model will now have: Conv -> Transpose -> Q -> DQ -> LayerNorm

Testing

  • Added unit test test_conv_layernorm_quantization with a ConvNext-like test model (Conv→Transpose→LayerNorm→Transpose→Conv)
  • All 237 unit tests pass (236 existing + 1 new)
  • Validated on ConvNext-tiny (opset 17): 23/23 LayerNorm nodes get Q/DQ on activation input
  • Built TRT engine in nvcr.io/nvidia/tensorrt:26.02-py3 — Conv layers after LayerNorm select i8i8 kernels

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Expanded quantization to include LayerNormalization in Q/DQ flows and detect Conv→LayerNormalization patterns for quantization.
  • Refactor

    • Improved internal graph-analysis helpers to make detection logic reusable and more reliable.
  • Tests

    • Added a model builder for Conv→LayerNorm graphs and a unit test validating end-to-end Conv→LayerNormalization quantization.

@ajrasane ajrasane self-assigned this Mar 24, 2026
@ajrasane ajrasane requested a review from a team as a code owner March 24, 2026 23:55
@ajrasane ajrasane requested a review from cjluo-nv March 24, 2026 23:55
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

📝 Walkthrough

Walkthrough

Identify LayerNormalization nodes whose inputs originate from Conv partitions (possibly via Transpose/Reshape/Copy), include them among quantization targets, register LayerNormalization in the QDQ registry, and add a test model + unit test to validate Conv→LayerNorm Q/DQ placement.

Changes

Cohort / File(s) Summary
Core Quantization Logic
modelopt/onnx/quantization/graph_utils.py, modelopt/onnx/quantization/int8.py, modelopt/onnx/quantization/ort_utils.py
Added module-level _is_following_cask_partition(node, cask_partition_nodes, max_depth=10) and find_conv_to_layernorm_nodes(...) to detect LayerNormalization nodes fed (possibly through copy-like ops) from CASK partitions; updated _find_nodes_to_quantize() to include these nodes in quantizable_nodes; registered LayerNormalization in QDQRegistry mapped to QDQNormalization.
Test Infrastructure
tests/_test_utils/onnx/lib_test_models.py
Added build_conv_layernorm_model() that constructs an ONNX model (Conv → Transpose → LayerNormalization → Transpose → Conv), sets opset/ir versions, runs shape inference and model checking, and returns a ready onnx.ModelProto.
Unit Tests
tests/unit/onnx/quantization/test_qdq_rules_int8.py
Added test_conv_layernorm_quantization() which uses the new model builder, runs quantization, asserts the .quant.onnx output exists, and verifies Conv inputs and the LayerNormalization activation input are produced by DequantizeLinear nodes.

Sequence Diagram(s)

sequenceDiagram
    participant Graph as ONNX Graph
    participant Scanner as Conv→LayerNorm Scanner
    participant Quantizer as Quantization Engine
    participant QDQReg as QDQ Registry

    Graph->>Scanner: enumerate LayerNormalization nodes
    Scanner->>Graph: trace input producers backwards (through Transpose/Reshape/Copy)
    Graph-->>Scanner: report whether producer originates from CASK partition
    Scanner->>Quantizer: return matched LayerNormalization nodes
    Quantizer->>QDQReg: query handler for LayerNormalization
    QDQReg-->>Quantizer: mapped to QDQNormalization
    Quantizer->>Graph: mark/instrument Conv and LayerNorm activation paths with Q/DQ
    Graph-->>Quantizer: return quantized model
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding Q/DQ placement for Conv->LayerNorm patterns, which is the core functionality added across multiple files.
Security Anti-Patterns ✅ Passed PR introduces Conv→LayerNorm quantization handling with no security anti-patterns, unsafe function calls, hardcoded credentials, or problematic dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajrasane/conv_layernorm

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1117/

Built to branch gh-pages at 2026-03-25 19:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/graph_utils.py`:
- Around line 345-350: The code currently only adds outputs from the last_node
of each partition to cask_output_tensor_names which misses outputs produced by
earlier or branching nodes; update the loop over cask_fusible_partitions (and
the inner loop referencing last_node) to iterate over every node in each
partition (e.g., for node in partition) and collect node.outputs' names into
cask_output_tensor_names so all partition outputs are captured for
Conv→LayerNorm matching.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 763a51bf-433e-4e78-bdb4-d0e2466fae44

📥 Commits

Reviewing files that changed from the base of the PR and between 3431f8b and 88f2eca.

📒 Files selected for processing (5)
  • modelopt/onnx/quantization/graph_utils.py
  • modelopt/onnx/quantization/int8.py
  • modelopt/onnx/quantization/ort_utils.py
  • tests/_test_utils/onnx/lib_test_models.py
  • tests/unit/onnx/quantization/test_qdq_rules_int8.py

@codecov
Copy link

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 80.64516% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.26%. Comparing base (291498b) to head (523da9d).

Files with missing lines Patch % Lines
modelopt/onnx/quantization/graph_utils.py 78.57% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1117      +/-   ##
==========================================
+ Coverage   70.21%   70.26%   +0.05%     
==========================================
  Files         228      228              
  Lines       25952    25970      +18     
==========================================
+ Hits        18221    18247      +26     
+ Misses       7731     7723       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Correctness & Simplicity

Overall the PR is well-motivated and the approach is sound. A few suggestions:

1. Deduplicate backward traversal logic

_is_input_from_cask_partition (new) does essentially the same backward-through-copy-ops traversal as the existing _is_following_cask_partition (line ~340), but with a different interface (tensors + output tensor name set vs. nodes + node name set).

Consider reusing _is_following_cask_partition directly. For each LayerNormalization node, get the producer of inputs[0] and call _is_following_cask_partition(producer). This would:

  • Eliminate _is_input_from_cask_partition entirely
  • Remove the cask_output_tensor_names set construction
  • Avoid the assumption that partition[-1] is the output node (see point 2)
for node in graph.nodes:
    if node.op \!= "LayerNormalization":
        continue
    inp_tensor = node.inputs[0]
    if inp_tensor.inputs:
        producer = inp_tensor.inputs[0]
        if _is_following_cask_partition(producer):
            conv_to_ln_nodes.append(node)

2. partition[-1] assumption

last_node = partition[-1]
for output_tensor in last_node.outputs:
    cask_output_tensor_names.add(output_tensor.name)

This assumes partitions are ordered with the output node last. If a CASK partition ever has different ordering or multiple output nodes not at the end, this would miss patterns. Reusing _is_following_cask_partition (which checks all nodes via cask_partition_nodes) avoids this assumption entirely.

3. QDQNormalization for LayerNorm — input count difference

BatchNormalization has 5 inputs while LayerNormalization has 3. Worth confirming that QDQNormalization only quantizes input[0] (activation) and doesn't attempt to access inputs that don't exist for LayerNorm.

4. Minor: recursive traversal without depth limit

_is_input_from_cask_partition recurses through all copy ops without a bound. The existing _is_following_cask_partition has the same issue, but a max-depth guard (e.g., 10) would be cheap insurance against pathological graphs for both.

@ajrasane
Copy link
Contributor Author

LayerNormalization is already present in is_normalization_op

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane force-pushed the ajrasane/conv_layernorm branch from 45cf99f to 523da9d Compare March 25, 2026 19:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/onnx/quantization/graph_utils.py`:
- Around line 346-353: The helper _is_following_cask_partition incorrectly
requires every parent of a copy op to lead to a CASK partition; instead, when
encountering copy ops (e.g., Reshape, Squeeze, Unsqueeze — the copy/shape-only
ops used in activation path), only follow the data input edge (the activation
input, typically input index 0) rather than all parents. Update the logic where
get_parent_nodes(node) is used in _is_following_cask_partition so that for
parents whose op_type is a copy op you recurse only on the parent node that
supplies the data input (e.g., obtain the input node at index 0 and recurse on
that), while leaving non-copy parents unchanged; this will also fix callers such
as filter_quantizable_kgen_heads that rely on this helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 486fadf3-a592-4f69-9c82-9ea690cc2d17

📥 Commits

Reviewing files that changed from the base of the PR and between 45cf99f and 523da9d.

📒 Files selected for processing (5)
  • modelopt/onnx/quantization/graph_utils.py
  • modelopt/onnx/quantization/int8.py
  • modelopt/onnx/quantization/ort_utils.py
  • tests/_test_utils/onnx/lib_test_models.py
  • tests/unit/onnx/quantization/test_qdq_rules_int8.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • modelopt/onnx/quantization/int8.py
  • tests/unit/onnx/quantization/test_qdq_rules_int8.py
  • tests/_test_utils/onnx/lib_test_models.py

Comment on lines +346 to +353
parent_nodes = get_parent_nodes(node)
if len(parent_nodes) == 0:
return False

return all(
_is_following_cask_partition(parent, cask_partition_nodes, max_depth - 1)
for parent in parent_nodes
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Follow only the data edge through copy ops.

Line 350 currently requires every parent of a copy op to trace back to a CASK partition. That rejects valid Conv -> Reshape/Squeeze/Unsqueeze -> LayerNormalization chains when the shape/axes input is dynamic, because those auxiliary inputs are not part of the activation path. The same false negative now affects filter_quantizable_kgen_heads() too, since it shares this helper.

💡 Proposed fix
-    parent_nodes = get_parent_nodes(node)
-    if len(parent_nodes) == 0:
+    if not node.inputs or not node.inputs[0].inputs:
         return False
 
-    return all(
-        _is_following_cask_partition(parent, cask_partition_nodes, max_depth - 1)
-        for parent in parent_nodes
-    )
+    data_parent = node.inputs[0].inputs[0]
+    return _is_following_cask_partition(data_parent, cask_partition_nodes, max_depth - 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/graph_utils.py` around lines 346 - 353, The helper
_is_following_cask_partition incorrectly requires every parent of a copy op to
lead to a CASK partition; instead, when encountering copy ops (e.g., Reshape,
Squeeze, Unsqueeze — the copy/shape-only ops used in activation path), only
follow the data input edge (the activation input, typically input index 0)
rather than all parents. Update the logic where get_parent_nodes(node) is used
in _is_following_cask_partition so that for parents whose op_type is a copy op
you recurse only on the parent node that supplies the data input (e.g., obtain
the input node at index 0 and recurse on that), while leaving non-copy parents
unchanged; this will also fix callers such as filter_quantizable_kgen_heads that
rely on this helper.

Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All review feedback has been addressed:

  • Backward traversal logic deduplicated into a single _is_following_cask_partition at module level
  • partition[-1] assumption removed — now checks all partition nodes
  • Recursion depth limit added (max_depth=10)
  • Confirmed QDQNormalization is safe for LayerNorm (LayerNormalization is in is_normalization_op, and only input[0]/output[0] are quantized)

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants