Skip to content

Commit 97d5c87

Browse files
committed
delete outdated benchmarking scripts
Summary: #3449 is a newer version of these which uses the HuggingFace model definition. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 9d85193 ghstack-comment-id: 3628761600 Pull-Request: #3466
1 parent 931ea73 commit 97d5c87

File tree

3 files changed

+5
-475
lines changed

3 files changed

+5
-475
lines changed

torchao/_models/llama/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
# Llama Benchmarks
22

3+
> [! WARNING]
4+
> These benchmarks are deprecated.
5+
36
The llama folder contains code/scripts for stable benchmarking llama models.
47

58
To get model weights, go to https://huggingface.co/meta-llama/Llama-2-7b, https://huggingface.co/meta-llama/Meta-Llama-3-8B, https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
@@ -8,8 +11,8 @@ and follow the steps to gain access.
811
Then from the torchao root directory use `huggingface-cli login` and follow the steps to login, then `sh ./scripts/prepare.sh` to
912
download and convert the model weights
1013

11-
once done you can execute benchmarks from the torchao/_models/llama dir with `sh benchmarks.sh`. You can perform and benchmarking or evaluation
12-
directly using `generate.py` or `eval.py`.
14+
once done you can execute benchmarks from the torchao/_models/llama dir with `sh benchmarks.sh`. You can perform and benchmarking
15+
directly using `generate.py`.
1316

1417
## KV Cache Quantization - Memory Efficient Inference
1518
We've added some features to `model.py` compared to the original gpt-fast implementation in order to enable long context length (and necessarily memory efficient) inference. Specifically we've added kv_cache quantization and a linear_causal_mask implementation which are **able to reduce memory usage by 50-60%** at long context lengths.

0 commit comments

Comments
 (0)