Skip to content

Commit 038d47c

Browse files
nihuiCopilot
andauthored
Update docs/developer-guide/kvcache.md
Co-authored-by: Copilot <[email protected]>
1 parent 15ff085 commit 038d47c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

docs/developer-guide/kvcache.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ The caching strategy is fundamentally different for self-attention and cross-att
3636
#### cross-attention (static k/v)
3737
- **purpose:** Allows the decoder to attend to the output of the encoder (e.g., attending to audio features in speech recognition, or an input sentence in translation).
3838
- **cache Logic:** The k and v matrices are derived from the encoder's output, which is computed only **once** per input sequence. Therefore, the k and v for cross-attention are **static** and do not change during the decoding process. They are "cached" in the sense that they are pre-computed and reused in every decoding step.
39-
- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layer for cross-attention is also configured with `7=1` and cache I/O blobs. However, the implementation correctly identifies cross-attention (where the query blob is different from the key/value blobs) and reuses the `cache_k_in` and `cache_v_in` directly, without performing concatenation. This allows the static encoder k/v to be passed efficiently through the network.
39+
- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layers for cross-attention are also configured with `7=1` and cache I/O blobs. However, the implementation correctly identifies cross-attention (where the query blob is different from the key/value blobs) and reuses the `cache_k_in` and `cache_v_in` directly, without performing concatenation. This allows the static encoder k/v to be passed efficiently through the network.
4040

4141
## 3. ncnn kv cache memory layout
4242

0 commit comments

Comments
 (0)