Skip to content

Commit ee796cc

Browse files
committed
delete outdated benchmarking scripts
Summary: #3449 is a newer version of these which uses the HuggingFace model definition. Test Plan: CI Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0ad33cb ghstack-comment-id: 3628761600 Pull-Request: #3466
1 parent 78c048f commit ee796cc

File tree

3 files changed

+4
-475
lines changed

3 files changed

+4
-475
lines changed

torchao/_models/llama/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Llama Benchmarks
22

3+
> :warning: **Warning:** These benchmarks are deprecated.
4+
35
The llama folder contains code/scripts for stable benchmarking llama models.
46

57
To get model weights, go to https://huggingface.co/meta-llama/Llama-2-7b, https://huggingface.co/meta-llama/Meta-Llama-3-8B, https://huggingface.co/meta-llama/Meta-Llama-3.1-8B
@@ -8,8 +10,8 @@ and follow the steps to gain access.
810
Then from the torchao root directory use `huggingface-cli login` and follow the steps to login, then `sh ./scripts/prepare.sh` to
911
download and convert the model weights
1012

11-
once done you can execute benchmarks from the torchao/_models/llama dir with `sh benchmarks.sh`. You can perform and benchmarking or evaluation
12-
directly using `generate.py` or `eval.py`.
13+
once done you can execute benchmarks from the torchao/_models/llama dir with `sh benchmarks.sh`. You can perform and benchmarking
14+
directly using `generate.py`.
1315

1416
## KV Cache Quantization - Memory Efficient Inference
1517
We've added some features to `model.py` compared to the original gpt-fast implementation in order to enable long context length (and necessarily memory efficient) inference. Specifically we've added kv_cache quantization and a linear_causal_mask implementation which are **able to reduce memory usage by 50-60%** at long context lengths.

0 commit comments

Comments
 (0)