|
1 | 1 | # Worklog for TorchAO's Float8WeightOnlyConfig Inference Slowdown Profiling |
2 | | - |
3 | | -## The problem: |
| 2 | +## The Problem: |
4 | 3 | TorchAO's Float8WeightOnlyConfig has a massive **decrease in inference throughput**, **low memory bandwidth utilization**, and **high peak VRAM** usage. |
5 | 4 |
|
6 | | -TorchAO issue raised after I discovered this problem while working on a [survey of quantization formats](https://github.com/vipulSharma18/Survey-of-Quantization-Formats): [TorchAO GH Issue](https://github.com/pytorch/ao/issues/3288). |
| 5 | +TorchAO issue raised after I discovered this problem while working on a [survey of quantization formats](https://github.com/vipulSharma18/Survey-of-Quantization-Formats?tab=readme-ov-file#benchmarking-results): [TorchAO GH Issue](https://github.com/pytorch/ao/issues/3288). |
7 | 6 |
|
8 | 7 | _Example:_ Meta-Llama-3.1-8B inference. |
9 | 8 |
|
@@ -31,22 +30,22 @@ As a first, and possibly only step, we use the GPT-Fast benchmark provided by To |
31 | 30 |
|
32 | 31 | ## Torch Memory Profile |
33 | 32 |
|
34 | | -The 0-th inference iteration is profiled with CUDA memory snapshot. |
35 | | - |
36 | | -### Baseline |
37 | | -Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_None_torch_memory_profiler.pickle` |
| 33 | +The 0-th inference iteration is profiled with CUDA memory snapshot. The snapshots are available at the following paths: `llama_benchmark/Meta-Llama-3.1-8B_None_torch_memory_profiler.pickle`, `llama_benchmark/Meta-Llama-3.1-8B_float8dq-tensor_torch_memory_profiler.pickle`, `llama_benchmark/Meta-Llama-3.1-8B_float8wo_torch_memory_profiler.pickle`. |
38 | 34 |
|
39 | | - |
| 35 | +<div align="center"> |
| 36 | + <img src="figures/none_whole_timeline.png" alt="Baseline Whole Timeline" width="800"> |
| 37 | + <p><strong>Figure 1:</strong> Baseline Whole Timeline</p> |
| 38 | +</div> |
40 | 39 |
|
41 | | -### FP8 Static Weights and Dynamic Activations Quantization |
42 | | -Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_float8dq-tensor_torch_memory_profiler.pickle` |
43 | | - |
44 | | - |
45 | | - |
46 | | -### FP8 Static Weights Only Quantization |
47 | | -Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_float8wo_torch_memory_profiler.pickle` |
| 40 | +<div align="center"> |
| 41 | + <img src="figures/float8dq_whole_timeline.png" alt="FP8 Weights and Activations DQ Whole Timeline" width="800"> |
| 42 | + <p><strong>Figure 2:</strong> FP8 Weights and Activations DQ Whole Timeline</p> |
| 43 | +</div> |
48 | 44 |
|
49 | | - |
| 45 | +<div align="center"> |
| 46 | + <img src="figures/float8wo_whole_timeline.png" alt="FP8 Weights Only Static Quantization Whole Timeline" width="800"> |
| 47 | + <p><strong>Figure 3:</strong> FP8 Weights Only Static Quantization Whole Timeline</p> |
| 48 | +</div> |
50 | 49 |
|
51 | 50 | ## Torch Execution Trace |
52 | 51 | ### Baseline |
|
0 commit comments