Skip to content

Commit f245cba

Browse files
committed
label figs, figs link fix, minor updates to readme
1 parent fad2270 commit f245cba

File tree

1 file changed

+15
-16
lines changed

1 file changed

+15
-16
lines changed

torchao_float8/README.md

Lines changed: 15 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# Worklog for TorchAO's Float8WeightOnlyConfig Inference Slowdown Profiling
2-
3-
## The problem:
2+
## The Problem:
43
TorchAO's Float8WeightOnlyConfig has a massive **decrease in inference throughput**, **low memory bandwidth utilization**, and **high peak VRAM** usage.
54

6-
TorchAO issue raised after I discovered this problem while working on a [survey of quantization formats](https://github.com/vipulSharma18/Survey-of-Quantization-Formats): [TorchAO GH Issue](https://github.com/pytorch/ao/issues/3288).
5+
TorchAO issue raised after I discovered this problem while working on a [survey of quantization formats](https://github.com/vipulSharma18/Survey-of-Quantization-Formats?tab=readme-ov-file#benchmarking-results): [TorchAO GH Issue](https://github.com/pytorch/ao/issues/3288).
76

87
_Example:_ Meta-Llama-3.1-8B inference.
98

@@ -31,22 +30,22 @@ As a first, and possibly only step, we use the GPT-Fast benchmark provided by To
3130

3231
## Torch Memory Profile
3332

34-
The 0-th inference iteration is profiled with CUDA memory snapshot.
35-
36-
### Baseline
37-
Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_None_torch_memory_profiler.pickle`
33+
The 0-th inference iteration is profiled with CUDA memory snapshot. The snapshots are available at the following paths: `llama_benchmark/Meta-Llama-3.1-8B_None_torch_memory_profiler.pickle`, `llama_benchmark/Meta-Llama-3.1-8B_float8dq-tensor_torch_memory_profiler.pickle`, `llama_benchmark/Meta-Llama-3.1-8B_float8wo_torch_memory_profiler.pickle`.
3834

39-
![Baseline Whole Timeline](figures/none_whole_timeline.png)
35+
<div align="center">
36+
<img src="figures/none_whole_timeline.png" alt="Baseline Whole Timeline" width="800">
37+
<p><strong>Figure 1:</strong> Baseline Whole Timeline</p>
38+
</div>
4039

41-
### FP8 Static Weights and Dynamic Activations Quantization
42-
Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_float8dq-tensor_torch_memory_profiler.pickle`
43-
44-
![FP8 Weights and Activations DQ Whole Timeline](figures/none_whole_timeline.png)
45-
46-
### FP8 Static Weights Only Quantization
47-
Snapshot: `llama_benchmark/Meta-Llama-3.1-8B_float8wo_torch_memory_profiler.pickle`
40+
<div align="center">
41+
<img src="figures/float8dq_whole_timeline.png" alt="FP8 Weights and Activations DQ Whole Timeline" width="800">
42+
<p><strong>Figure 2:</strong> FP8 Weights and Activations DQ Whole Timeline</p>
43+
</div>
4844

49-
![FP8 Weights Only Static Quantization Whole Timeline](figures/none_whole_timeline.png)
45+
<div align="center">
46+
<img src="figures/float8wo_whole_timeline.png" alt="FP8 Weights Only Static Quantization Whole Timeline" width="800">
47+
<p><strong>Figure 3:</strong> FP8 Weights Only Static Quantization Whole Timeline</p>
48+
</div>
5049

5150
## Torch Execution Trace
5251
### Baseline

0 commit comments

Comments
 (0)