[Common] Reduced padding kernel compilation time#2827
[Common] Reduced padding kernel compilation time#2827Oleg-Goncharov wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Greptile SummaryThis PR removes the Confidence Score: 5/5Safe to merge — minimal one-line removal per kernel with benchmark data confirming no performance regression. The change removes two No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[multi_padding_kernel / multi_unpadding_kernel] --> B[Find tensor for block]
B --> C["outer loop: for iter in 0..n_iterations\n(n_iterations = WARP_SIZE / n_warps = 8)\nNo longer force-unrolled"]
C --> D["#pragma unroll\nfor i2 in 0..nvec"]
D --> E[Load input vector]
E --> F["#pragma unroll\nfor j2 in 0..nvec — copy to output"]
F --> G{row < num_rows?}
G -- yes --> H[Store output vector]
G -- no --> I{row < padded_num_rows?}
I -- yes --> J[Write zeros — padding kernel only]
I -- no --> K[Skip]
H --> C
J --> C
K --> C
Reviews (3): Last reviewed commit: "Merge branch 'main' into pr_reduced_padd..." | Re-trigger Greptile |
|
Please benchmark the kernel before and after this change. |
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Description
This PR reduces the compilation time of
padding.cufrom approximately 600 seconds to 3 seconds by removing the outer-loop unroll.Kernel performance remains effectively unchanged across different outer-loop unroll factors. The input multi-tensor consists of square tensors with dimensions {1024, 2048, 4096, 8192, 16384}. Measured kernel runtime in microseconds:
Type of change
Changes
Please list the changes introduced in this PR:
#pragma unrolldirective.Checklist: