-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Perf: Some small enhancements #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you please sync your repo with the original? Yours is 16 commits behind, and it doesn't work on a MacBook due to some issues I fixed later, which aren't included in your repo. I'm mentioning this because I want to test it. |
buttercrab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Could you fix the lint & format? |
|
@tsdocode I know the issue is closed, but I did get to test it. I saw a ~30% increase in processing on my M3 MacBook Pro with 36GB VRAM. It's a huge difference, very noticeable as well. Here are the logs from my recent run of example/simple-mac.py: generate step 86: speed=6.268 tokens/s, realtime factor=0.073x
generate step 172: speed=11.486 tokens/s, realtime factor=0.134x
generate step 258: speed=11.468 tokens/s, realtime factor=0.133x
generate step 344: speed=11.472 tokens/s, realtime factor=0.133x
generate step 430: speed=11.379 tokens/s, realtime factor=0.132x
generate step 516: speed=11.226 tokens/s, realtime factor=0.131x
generate step 602: speed=11.396 tokens/s, realtime factor=0.133x
generate step 688: speed=11.337 tokens/s, realtime factor=0.132x
generate: avg steps=758.0, total duration=75.467sAnd below are the logs of a previous run for the same script with the previous version: generate: starting generation loop
generate step 86: speed=7.759 tokens/s, realtime factor=0.090x
generate step 172: speed=8.470 tokens/s, realtime factor=0.098x
generate step 258: speed=8.536 tokens/s, realtime factor=0.099x
generate step 344: speed=8.489 tokens/s, realtime factor=0.099x
generate step 430: speed=8.607 tokens/s, realtime factor=0.100x
generate step 516: speed=8.615 tokens/s, realtime factor=0.100x
generate step 602: speed=8.592 tokens/s, realtime factor=0.100x
generate step 688: speed=8.597 tokens/s, realtime factor=0.100x
generate: avg steps=747.0, total duration=91.026s |
|
@V12Hero Add this to you code before running generate, this fused qkv operation, maybe it will help speedup a little more: |
|
@tsdocode is this how you would want it placed? from dia.model import Dia
model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")
text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
for idx in range(len(model.model.decoder.layers)):
layer = model.model.decoder.layers[idx]
layer.self_attention.patch_fused_qkv()
# It is important to set the `use_torch_compile` argument to `False` when using Dia on MacOS.
# This is because the `torch.compile` function is not supported on MacOS.
output = model.generate(text, use_torch_compile=False, verbose=True)
model.save_audio("simple.mp3", output) |
|
Yes |
|
Ok, I can't see any difference, the slight dip in the performance is because I'm now running a bit low on battery but the overall performance I'd say is the same generate: starting generation loop
generate step 86: speed=9.641 tokens/s, realtime factor=0.112x
generate step 172: speed=11.205 tokens/s, realtime factor=0.130x
generate step 258: speed=11.287 tokens/s, realtime factor=0.131x
generate step 344: speed=11.277 tokens/s, realtime factor=0.131x
generate step 430: speed=11.266 tokens/s, realtime factor=0.131x
generate step 516: speed=11.194 tokens/s, realtime factor=0.130x
generate step 602: speed=11.253 tokens/s, realtime factor=0.131x
generate step 688: speed=11.281 tokens/s, realtime factor=0.131x
generate step 774: speed=11.173 tokens/s, realtime factor=0.130x
generate: avg steps=772.0, total duration=71.836s |

Did in PR:
Attentionto 2 class:SelfAttentionandCrossAttentionfor further optimizationTest scripts:
Result on A100 80Gb:
Other room for speed-up: