Inference Profiling and Optimization Worklog

A collection of inference profiling problems that I've discovered during my work. This repo contains worklogs for each of those problems and is meant as a guide to help develop intuition for systematically analyzing profiling problems and finding the root cause of slowdown.

Each problem has a docker container for reproducibility of the environment and the inference slowdown.

Problems:

Problem 1 - TorchAO Float8 (In Progress): TorchAO's Float8WeightOnlyConfig is much slower than the eager baseline, and has much higher peak VRAM usage than eager. This was discovered during my work on a survey of quantization formats (found on TorchAO-0.13.0 and resolved on versions 0.14.1 and above. Tested on RTX4090 and RTX5090).

Problem 2 - GemLite Autotune (Not Started): Understand optimizations that the autotuner does by using GemLite's Triton GEMM kernels for RTX 4090 as an example. An opportunity to use nsys and ncu trace difference functionalities, and to recognize different routes for kernel optimization that the autotuner might pick given different inputs and kernel parameters.

Problem 3 - TorchInductor CUDA Graph Memory (Not Started): CUDAGraphs can lead to high GPU memory usage even without input duplication (for shape alignment) when used with quantized inference. This problem aims to understand the impact of CUDAGraphs on GPU memory usage and execution performance.

Note: Details of the docker container and environment setup can be found in DOCKER_README.md.

Profilers used:

Torch Profiler
NVIDIA Nsight Systems
NVIDIA Nsight Compute

Common bottlenecks:

Coarse-Grained Bottlenecks: The GPU (Streaming Multiprocessors) getting "stalled" (spending time in idle stage) while waiting for its dependencies:
a. CPU/Host side operations required by a GPU kernel.
b. Input/Output from/to the GPU global memory.
c. Network data movement (CPU to GPU buffer copies) in case of multi-GPU computation.
e. Launch overhead for each GPU kernel.
Fine-Grained Bottlenecks: Latency caused at the instruction level in the GPU kernel:
a. Use of atomic instructions to prevent contention at the same memory location.
b. Inter-SM communication via the global memory.
c. Quantization at the level of warps/wave due to warp divergence, or at the level of data tiles due to partially filled tiles. -> Causes low goodput from the SMs.
d. Shared memory bank conflicts: Different data being requested from the same memory bank location.
e. Register spilling to the local memory.
f. L2 cache misses.

Principled approach to inference optimization:

Always go from top to bottom, or coarser profiling to fine-grained one, when optimizing inference workloads. Below is a list of typical bottlenecks and tricks that arise in inference optimization.

Model Serving/User Request Batching Level:

Batching of user requests as per the token budget and the type of operation (prefill v/s decode).
Different dynamic batching algorithms.

Multi-GPU:

Types of parallelism within a model's forward pass given 1 input sample: EP, CP, SP, TP, PP.
Types of distributed Attention: Ring Attention, Ulysses-Attention.
Types of reductions across a network: Ring, Tree, Butterfly.
Types of network interfaces: P2P, InfiniBand, GPU-RDMA, NVLink Switch (NVLS SHARP), PCIe.
Types of communication kernels: NCCL, NVSHMEM.

Single GPU:

Reducing CPU launch overhead by CUDA graphs and persistent kernels.
Overlapping CPU work with GPU kernels via async launching.
Cache hits: Global memory v/s L2 cache.
Coalesced memory accesses from global memory.
Async memory load from the global memory to overlap memory load with compute.
Zero-copy movement of data from CPU to GPU.

Cluster of Cooperative Thread Arrays (Multi-SM):

Communication latency across SMs. Avoid using Global memory and directly be able to access each other's shared memory.

Cooperative Thread Array (SM):

Shared memory bank conflicts and swizzling of data to avoid it.

Warp Scheduler level:

Warp specialization and the implicit management of the dependencies between warps by using warp scheduler.

Warp level:

Warp divergence, wave and tile quantization.

Thread level:

Usage/load on the CUDA cores v/s Tensor Cores v/s Special Function Units. Avoiding contention between such compute resources.
Avoiding contention between memory resources like registers by using tensor cores and TMEM.

References/Resources to learn the background:

Nsight Systems Docs: https://docs.nvidia.com/nsight-systems/index.html
Nsight Compute Docs: https://docs.nvidia.com/nsight-compute/index.html
(Detailed walkthrough of the whole flow) Introduction to Kernel Performance Analysis with NVIDIA Nsight Compute: https://www.youtube.com/watch?v=fsC3QeZHM1U

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.github/workflows		.github/workflows
common_utils		common_utils
fast_sam3		fast_sam3
gemlite_autotune		gemlite_autotune
torchao_float8		torchao_float8
torchinductor_cudagraph_memory		torchinductor_cudagraph_memory
.dockerignore		.dockerignore
.gitignore		.gitignore
DOCKER_README.md		DOCKER_README.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference Profiling and Optimization Worklog

Problems:

Profilers used:

Common bottlenecks:

Principled approach to inference optimization:

Model Serving/User Request Batching Level:

Multi-GPU:

Single GPU:

Cluster of Cooperative Thread Arrays (Multi-SM):

Cooperative Thread Array (SM):

Warp Scheduler level:

Warp level:

Thread level:

References/Resources to learn the background:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

vipulSharma18/Inference-Profiling-and-Optimization-Worklog

Folders and files

Latest commit

History

Repository files navigation

Inference Profiling and Optimization Worklog

Problems:

Profilers used:

Common bottlenecks:

Principled approach to inference optimization:

Model Serving/User Request Batching Level:

Multi-GPU:

Single GPU:

Cluster of Cooperative Thread Arrays (Multi-SM):

Cooperative Thread Array (SM):

Warp Scheduler level:

Warp level:

Thread level:

References/Resources to learn the background:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages