Perf: Some small enhancements #229

tsdocode · 2025-05-27T18:11:19Z

Did in PR:

Split Attention to 2 class: SelfAttention and CrossAttention for further optimization
Add Fused QKV: Operate q_proj, k_proj, v_proj in one layer
Enhance Rope: reuse cos, sin for self attention
Add adjust KV cache max_length == max_token_lengths instead of always set to model's max length => less VRAM use, slightly faster

Test scripts:

from random import choice

import torch

from dia.model import Dia


torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.triton.unique_kernel_names = True
torch._inductor.config.fx_graph_cache = True

# debugging
torch._logging.set_logs(graph_breaks=True, recompiles=True)

model_name = "nari-labs/Dia-1.6B"
compute_dtype = "float16"

model = Dia.from_pretrained(model_name, compute_dtype=compute_dtype)


for idx in range(len(model.model.decoder.layers)):
    layer = model.model.decoder.layers[idx]
    layer.self_attention.patch_fused_qkv()


test_cases = [
    "[S1] Dia is an open weights text to dialogue model.",
    "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face.",
    "[S1] torch.compile is a new feature in PyTorch that allows you to compile your model with a single line of code.",
    "[S1] torch.compile is a new feature in PyTorch that allows you to compile your model with a single line of code. [S2] It is a new feature in PyTorch that allows you to compile your model with a single line of code.",
]


use_torch_compile = True

MAX_TOKENS = 86*5

# # # Wram up
for _ in range(2):
    text = choice(test_cases)
    output = model.generate(text, use_torch_compile=use_torch_compile, verbose=True, max_tokens=MAX_TOKENS)

text = choice(test_cases)

# Benchmark
for i in range(10):
    output = model.generate(text, use_torch_compile=use_torch_compile, verbose=True, max_tokens=MAX_TOKENS)
    text = choice(test_cases)

Result on A100 80Gb:

~216 token/s => ~232 token/s

Other room for speed-up:

Upon analyzing the flame graph, a significant 20% gap has been identified in the token generation time. This gap occurs between the GPU launch (_decode_step) and the sampling phase (computing the next token and checking for ending conditions), indicating a potential area for optimization.

V12Hero · 2025-05-27T19:03:31Z

Could you please sync your repo with the original? Yours is 16 commits behind, and it doesn't work on a MacBook due to some issues I fixed later, which aren't included in your repo. I'm mentioning this because I want to test it.

tsdocode · 2025-05-28T02:56:51Z

I synced the latest code, please help me check it

buttercrab

LGTM

buttercrab · 2025-05-28T03:53:11Z

Could you fix the lint & format?

V12Hero · 2025-05-28T15:19:54Z

@tsdocode I know the issue is closed, but I did get to test it. I saw a ~30% increase in processing on my M3 MacBook Pro with 36GB VRAM. It's a huge difference, very noticeable as well.

Here are the logs from my recent run of example/simple-mac.py:

generate step 86: speed=6.268 tokens/s, realtime factor=0.073x
generate step 172: speed=11.486 tokens/s, realtime factor=0.134x
generate step 258: speed=11.468 tokens/s, realtime factor=0.133x
generate step 344: speed=11.472 tokens/s, realtime factor=0.133x
generate step 430: speed=11.379 tokens/s, realtime factor=0.132x
generate step 516: speed=11.226 tokens/s, realtime factor=0.131x
generate step 602: speed=11.396 tokens/s, realtime factor=0.133x
generate step 688: speed=11.337 tokens/s, realtime factor=0.132x
generate: avg steps=758.0, total duration=75.467s

And below are the logs of a previous run for the same script with the previous version:

generate: starting generation loop
generate step 86: speed=7.759 tokens/s, realtime factor=0.090x
generate step 172: speed=8.470 tokens/s, realtime factor=0.098x
generate step 258: speed=8.536 tokens/s, realtime factor=0.099x
generate step 344: speed=8.489 tokens/s, realtime factor=0.099x
generate step 430: speed=8.607 tokens/s, realtime factor=0.100x
generate step 516: speed=8.615 tokens/s, realtime factor=0.100x
generate step 602: speed=8.592 tokens/s, realtime factor=0.100x
generate step 688: speed=8.597 tokens/s, realtime factor=0.100x
generate: avg steps=747.0, total duration=91.026s

tsdocode · 2025-05-28T15:23:26Z

@V12Hero Add this to you code before running generate, this fused qkv operation, maybe it will help speedup a little more:

for idx in range(len(model.model.decoder.layers)):
    layer = model.model.decoder.layers[idx]
    layer.self_attention.patch_fused_qkv()

# generate code

V12Hero · 2025-05-28T16:11:20Z

@tsdocode is this how you would want it placed?

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B", compute_dtype="float16")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

for idx in range(len(model.model.decoder.layers)):
    layer = model.model.decoder.layers[idx]
    layer.self_attention.patch_fused_qkv()
    
# It is important to set the `use_torch_compile` argument to `False` when using Dia on MacOS.
# This is because the `torch.compile` function is not supported on MacOS.
output = model.generate(text, use_torch_compile=False, verbose=True)

model.save_audio("simple.mp3", output)

tsdocode · 2025-05-28T16:12:29Z

Yes

V12Hero · 2025-05-28T16:14:41Z

Ok, I can't see any difference, the slight dip in the performance is because I'm now running a bit low on battery but the overall performance I'd say is the same

generate: starting generation loop
generate step 86: speed=9.641 tokens/s, realtime factor=0.112x
generate step 172: speed=11.205 tokens/s, realtime factor=0.130x
generate step 258: speed=11.287 tokens/s, realtime factor=0.131x
generate step 344: speed=11.277 tokens/s, realtime factor=0.131x
generate step 430: speed=11.266 tokens/s, realtime factor=0.131x
generate step 516: speed=11.194 tokens/s, realtime factor=0.130x
generate step 602: speed=11.253 tokens/s, realtime factor=0.131x
generate step 688: speed=11.281 tokens/s, realtime factor=0.131x
generate step 774: speed=11.173 tokens/s, realtime factor=0.130x
generate: avg steps=772.0, total duration=71.836s

tsdocode added 2 commits May 28, 2025 00:54

feat: add fused qkv and reduce rope calculation

079a830

chore: add adjust max kvcache length as max length

dc1d461

tsdocode mentioned this pull request May 27, 2025

community speed matrix #203

Open

chore: fix lint

0211e3a

buttercrab approved these changes May 28, 2025

View reviewed changes

chore: format with ruff

c7cba6d

buttercrab merged commit cb07e05 into nari-labs:main May 28, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: Some small enhancements #229

Perf: Some small enhancements #229

tsdocode commented May 27, 2025 •

edited

Loading

Uh oh!

V12Hero commented May 27, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

buttercrab left a comment

Uh oh!

buttercrab commented May 28, 2025

Uh oh!

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Perf: Some small enhancements #229

Perf: Some small enhancements #229

Conversation

tsdocode commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

V12Hero commented May 27, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

buttercrab left a comment

Choose a reason for hiding this comment

Uh oh!

buttercrab commented May 28, 2025

Uh oh!

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

tsdocode commented May 28, 2025

Uh oh!

V12Hero commented May 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tsdocode commented May 27, 2025 •

edited

Loading