Tencent
diff --git a/‎docs/developer-guide/kvcache.md‎
Lines changed: 38 additions & 23 deletions b/‎docs/developer-guide/kvcache.md‎
Lines changed: 38 additions & 23 deletions
diff --git a/‎docs/developer-guide/operators.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/developer-guide/operators.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎src/layer/sdpa.cpp‎
Lines changed: 123 additions & 20 deletions b/‎src/layer/sdpa.cpp‎
Lines changed: 123 additions & 20 deletions
diff --git a/‎src/layer/sdpa.h‎
Lines changed: 1 addition & 0 deletions b/‎src/layer/sdpa.h‎
Lines changed: 1 addition & 0 deletions
@@ -1,6 +1,6 @@
 # high-performance transformer inference with mha kv cache in ncnn
 
-This document details the implementation and usage of the key-value (kv) cache for the `MultiHeadAttention` layer in ncnn. This feature significantly accelerates autoregressive inference for Transformer-based models, such as large language models and other encoder-decoder architectures.
+This document details the implementation and usage of the key-value (kv) cache for the `MultiHeadAttention` and `SDPA` layer in ncnn. This feature significantly accelerates autoregressive inference for Transformer-based models, such as large language models and other encoder-decoder architectures.
 
 ## 1. what is kv cache?
 
@@ -20,9 +20,9 @@ Without optimization, the model must recompute the k and v matrices for all prec
 - **reduced computation:** It eliminates redundant calculations, saving significant computational resources and energy.
 - **enables real-time applications:** The performance gain makes it feasible to deploy large Transformer models for interactive and real-time tasks.
 
-## 2. ncnn mha kv cache implementation
+## 2. ncnn kv cache implementation
 
-ncnn introduces kv cache support directly into its `MultiHeadAttention` layer. The implementation is designed to be efficient and flexible, handling both the dynamic cache of self-attention and the static k/v of cross-attention found in encoder-decoder architectures.
+ncnn introduces kv cache support directly into its `MultiHeadAttention` and `SDPA` layer. The implementation is designed to be efficient and flexible, handling both the dynamic cache of self-attention and the static k/v of cross-attention found in encoder-decoder architectures.
 
 ### self-attention vs. cross-attention cache logic
 
@@ -31,34 +31,49 @@ The caching strategy is fundamentally different for self-attention and cross-att
 #### self-attention (dynamic cache)
 - **purpose:** Allows the decoder to attend to previously generated tokens in its own sequence (e.g., the text being generated).
 - **cache Logic:** The cache is **dynamic** and grows with each generated token. In step `t`, the k and v for token `t` are computed and appended to the cache from step `t-1`.
-- **ncnn implementation:** The `MultiHeadAttention` layer for self-attention is modified to accept two additional inputs (`cache_k_in`, `cache_v_in`) and produce two corresponding outputs (`cache_k_out`, `cache_v_out`). The `7=1` parameter enables this dynamic caching behavior inside the layer.
+- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layers for self-attention are modified to accept two additional inputs (`cache_k_in`, `cache_v_in`) and produce two corresponding outputs (`cache_k_out`, `cache_v_out`). The `7=1` parameter enables this dynamic caching behavior inside the layer.
 
 #### cross-attention (static k/v)
 - **purpose:** Allows the decoder to attend to the output of the encoder (e.g., attending to audio features in speech recognition, or an input sentence in translation).
 - **cache Logic:** The k and v matrices are derived from the encoder's output, which is computed only **once** per input sequence. Therefore, the k and v for cross-attention are **static** and do not change during the decoding process. They are "cached" in the sense that they are pre-computed and reused in every decoding step.
-- **ncnn implementation:** The `MultiHeadAttention` layer for cross-attention is also configured with `7=1` and cache I/O blobs. However, the implementation correctly identifies cross-attention (where the query blob is different from the key/value blobs) and reuses the `cache_k_in` and `cache_v_in` directly, without performing concatenation. This allows the static encoder k/v to be passed efficiently through the network.
+- **ncnn implementation:** The `MultiHeadAttention` and `SDPA` layers for cross-attention are also configured with `7=1` and cache I/O blobs. However, the implementation correctly identifies cross-attention (where the query blob is different from the key/value blobs) and reuses the `cache_k_in` and `cache_v_in` directly, without performing concatenation. This allows the static encoder k/v to be passed efficiently through the network.
 
-## 3. ncnn mha kv cache memory layout
+## 3. ncnn kv cache memory layout
 
-The memory layout of the kv cache is a critical design choice for performance. ncnn uses a **transposed layout** for the cache blobs. The primary reason for this is to **ensure that data for each attention head is contiguous in memory, which significantly boosts gemm performance.**
+The memory layout of the kv cache is a critical design choice for performance. ncnn uses different layouts for `MultiHeadAttention` and `SDPA` to optimize for their respective calculation patterns.
+
+### `MultiHeadAttention` cache layout (Transposed)
+
+The `MultiHeadAttention` layer uses a **transposed layout** for its cache blobs. The primary reason for this is to **ensure that data for each attention head is contiguous in memory, which significantly boosts gemm performance.**
 
 *   **input blobs (q, k, v):** These typically have a shape where height represents the sequence length.
     *   `ncnn::Mat` dimensions: `(w = embed_dim, h = seq_len)`
 
-*   **cache blobs (e.g., `k_affine`, `v_affine`):** These are stored in a **transposed** format.
+*   **cache blobs (`k_cache`, `v_cache`):** These are stored in a **transposed** format.
     *   `ncnn::Mat` dimensions: `(w = seq_len, h = embed_dim)`
 
 **the rationale:**
 
-1.  **slicing by Head:** During the attention calculation, the code slices the `k_affine` and `v_affine` matrices along their height to isolate the data for each head (e.g., using `row_range(head_index * embed_dim_per_head, embed_dim_per_head)`).
+1.  **slicing by Head:** During the attention calculation, the code slices the `k_cache` and `v_cache` matrices along their height to isolate the data for each head (e.g., using `row_range(head_index * embed_dim_per_head, embed_dim_per_head)`).
 2.  **memory contiguity:** Because `ncnn::Mat` uses a row-major memory layout, this slicing operation on the transposed cache blob results in a sub-matrix where all the data for a single head is perfectly contiguous.
 3.  **gemm efficiency:** Subsequent matrix multiplication operations (`q * k^T` and `Attention * v`) can then operate on these contiguous memory blocks. This maximizes CPU cache locality and the effectiveness of simd instructions, leading to a substantial increase in computational speed.
 
 If a non-transposed layout were used, the data for each head would be strided in memory, causing frequent cache misses and dramatically slowing down the performance-critical gemm calculations. Therefore, this transposed layout is a deliberate and crucial optimization for computation.
 
+### `SDPA` cache layout (Standard)
+
+The `SDPA` layer uses the **standard ncnn Mat layout**, where the sequence length is represented by the height.
+
+*   **input blobs (q, k, v):** `(w = embed_dim, h = seq_len, c = num_heads)`
+*   **cache blobs (`k_cache`, `v_cache`):** `(w = embed_dim, h = seq_len, c = num_heads)`
+
+**the rationale:**
+
+The `SDPA` layer's internal implementation directly concatenates the cache blobs (`past_k`, `past_v`) with the current ones (`cur_k`, `cur_v`) along the height dimension (`seq_len`). This simpler approach avoids the need for a transposed layout while still being highly efficient, as the concatenation logic is handled inside the optimized C++ implementation.
+
 ## 4. converting models to support kv cache
 
-To enable kv cache, you must modify the model's `.param` file to add the necessary cache inputs and outputs to all `MultiHeadAttention` layers in the decoder.
+To enable kv cache, you must modify the model's `.param` file to add the necessary cache inputs and outputs to all `MultiHeadAttention` and `SDPA` layers in the decoder.
 
 ### step 1: export a sequence-length-1 model
 
@@ -68,9 +83,9 @@ First, export your model from its original framework (e.g., PyTorch) using a seq
 
 After exporting, a script is needed to edit the generated `.ncnn.param` file to make it cache-aware.
 
-#### A. Adding kv cache to All MultiHeadAttention Layers
+#### A. Adding kv cache to All MultiHeadAttention and SDPA Layers
 
-You must add cache inputs/outputs to **every** `MultiHeadAttention` layer in the decoder.
+You must add cache inputs/outputs to **every** `MultiHeadAttention` / `SDPA` layer in the decoder.
 
 - **change `input_count` and `output_count`:** Increase both by 2.
 - **add blob names:** Append new, unique blob names for `cache_k_in`, `cache_v_in`, `cache_k_out`, and `cache_v_out`.
@@ -81,7 +96,7 @@ Here is a robust Python function that automates this process:
 def add_kv_cache_to_ncnn_param(filename):
     """
     Modifies an ncnn.param file to add a kv cache mechanism to all
-    MultiHeadAttention layers and overwrites the original file.
+    MultiHeadAttention and SDPA layers and overwrites the original file.
     This handles both self-attention and cross-attention layers.
     """
     import os
@@ -98,15 +113,15 @@ def add_kv_cache_to_ncnn_param(filename):
     original_layer_count = int(header_parts[0])
     original_blob_count = int(header_parts[1])
 
-    mha_indices = [i for i, line in enumerate(lines) if line.strip().startswith("MultiHeadAttention")]
-    mha_count = len(mha_indices)
+    attention_indices = [i for i, line in enumerate(lines) if line.strip().startswith("MultiHeadAttention") or line.strip().startswith("SDPA")]
+    attention_count = len(attention_indices)
 
-    if mha_count == 0:
-        print("No 'MultiHeadAttention' layers found. The file will not be modified.")
+    if attention_count == 0:
+        print("No 'MultiHeadAttention' or 'SDPA' layers found. The file will not be modified.")
         return
 
-    # --- modify MultiHeadAttention layers ---
-    for i, line_index in enumerate(mha_indices):
+    # --- modify MultiHeadAttention and SDPA layers ---
+    for i, line_index in enumerate(attention_indices):
         parts = lines[line_index].strip().split()
         layer_type, layer_name, input_count_str, output_count_str = parts[:4]
         input_count, output_count = int(input_count_str), int(output_count_str)
@@ -132,15 +147,15 @@ def add_kv_cache_to_ncnn_param(filename):
     new_layer_count = original_layer_count + 1
     # each mha needs 2 new *input* blobs and produces 2 new *output* blobs.
     # the total number of unique blobs increases by 4 for each mha.
-    new_blob_count = original_blob_count + (mha_count * 4)
+    new_blob_count = original_blob_count + (attention_count * 4)
     lines[header_line_index] = f"{new_layer_count} {new_blob_count}\n"
 
     # find where to insert the new input layer (after existing ones)
     insert_pos = header_line_index + 1
     while insert_pos < len(lines) and lines[insert_pos].strip().startswith("Input"):
         insert_pos += 1
 
-    cache_blob_names = [name for i in range(mha_count) for name in (f"cache_k_in_{i}", f"cache_v_in_{i}")]
+    cache_blob_names = [name for i in range(attention_count) for name in (f"cache_k_in_{i}", f"cache_v_in_{i}")]
     input_layer_line = (
         f"{'Input':<24} {'kv_cache_in':<24} 0 {len(cache_blob_names)} "
         f"{' '.join(cache_blob_names)}\n"
@@ -150,7 +165,7 @@ def add_kv_cache_to_ncnn_param(filename):
     with open(filename, 'w', encoding='utf-8') as f:
         f.writelines(lines)
 
-    print(f"Successfully added kv cache to {mha_count} MultiHeadAttention layers.")
+    print(f"Successfully added kv cache to {attention_count} MultiHeadAttention / SDPA layers.")
 
 # usage:
 # add_kv_cache_to_ncnn_param("your_model_decoder.ncnn.param")
@@ -206,7 +221,7 @@ void find_mha_kvcache_blobs(const ncnn::Net& net, kvcache_info& info)
     for (const ncnn::Layer* layer : net.layers())
     {
         // cache-enabled mha layer has 3 outputs (out, cache_k_out, cache_v_out) instead of 1
-        if (layer->typeindex == ncnn::LayerType::MultiHeadAttention && layer->tops.size() == 3)
+        if ((layer->typeindex == ncnn::LayerType::MultiHeadAttention || layer->typeindex == ncnn::LayerType::SDPA) && layer->tops.size() == 3)
         {
             // the script adds cache_k and cache_v as the last two inputs/outputs
             int input_count = layer->bottoms.size();
 
@@ -1811,6 +1811,7 @@ for each num_head part
 | --------- | ------------- | ----- | --------- | ----------------- |
 | 5         | attn_mask     | int   | 0         |                   |
 | 6         | scale         | float | 0.f       | auto = 1.f / sqrt(embed_dim) |
+| 7         | kv_cache      | int   | 0         |                   |
 | 18        | int8_scale_term | int | 0         |                   |
 
 # SELU
 
@@ -17,6 +17,7 @@ int SDPA::load_param(const ParamDict& pd)
 {
     attn_mask = pd.get(5, 0);
     scale = pd.get(6, 0.f);
+    kv_cache = pd.get(7, 0);
     int8_scale_term = pd.get(18, 0);
 
     return 0;
@@ -33,20 +34,24 @@ int SDPA::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_bl
 #endif
 
     const Mat& query = bottom_blobs[0];
-    const Mat& key = bottom_blobs[1];
-    const Mat& value = bottom_blobs[2];
-    const Mat& attn_mask_blob = bottom_blobs.size() == 4 ? bottom_blobs[3] : Mat();
+    const Mat& cur_key = bottom_blobs[1];
+    const Mat& cur_value = bottom_blobs[2];
+    const Mat& attn_mask_blob = attn_mask ? bottom_blobs[3] : Mat();
+    const Mat& past_key = kv_cache ? bottom_blobs[attn_mask ? 4 : 3] : Mat();
+    const Mat& past_value = kv_cache ? bottom_blobs[attn_mask ? 5 : 4] : Mat();
 
     const int embed_dim = query.w;
     const int src_seqlen = query.h;
     const int num_heads = query.c;
-    const int dst_seqlen = key.h;
-    const int num_group = key.c;
-    const int out_embed_dim = value.w;
-
-    // assert key.w == embed_dim
-    // assert key.h == value.h == dst_seqlen
-    // assert value.c == num_group
+    const int cur_seqlen = cur_key.h;
+    const int num_group = cur_key.c;
+    const int out_embed_dim = cur_value.w;
+    const int past_seqlen = kv_cache ? past_key.h : 0;
+    const int dst_seqlen = past_seqlen + cur_seqlen;
+
+    // assert cur_key.w == embed_dim
+    // assert cur_key.h == cur_value.h == cur_seqlen
+    // assert cur_value.c == num_group
     // assert num_heads % num_group == 0
 
     const float _scale = scale == 0.f ? 1.f / sqrt(embed_dim) : scale;
@@ -61,6 +66,46 @@ int SDPA::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_bl
     if (qk_cross.empty())
         return -100;
 
+    Mat key = cur_key;
+    if (past_seqlen > 0)
+    {
+        key.create(embed_dim, dst_seqlen, num_group, 4u, opt.blob_allocator);
+        if (key.empty())
+            return -100;
+
+        // concat
+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q = 0; q < num_group; q++)
+        {
+            const Mat past_key_head = past_key.channel(q);
+            const Mat cur_key_head = cur_key.channel(q);
+            Mat key_head = key.channel(q);
+
+            memcpy(key_head.row(0), past_key_head, embed_dim * past_seqlen * sizeof(float));
+            memcpy(key_head.row(past_seqlen), cur_key_head, embed_dim * cur_seqlen * sizeof(float));
+        }
+    }
+
+    Mat value = cur_value;
+    if (past_seqlen > 0)
+    {
+        value.create(out_embed_dim, dst_seqlen, num_group, 4u, opt.blob_allocator);
+        if (value.empty())
+            return -100;
+
+        // concat
+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q = 0; q < num_group; q++)
+        {
+            const Mat past_value_head = past_value.channel(q);
+            const Mat cur_value_head = cur_value.channel(q);
+            Mat value_head = value.channel(q);
+
+            memcpy(value_head.row(0), past_value_head, out_embed_dim * past_seqlen * sizeof(float));
+            memcpy(value_head.row(past_seqlen), cur_value_head, out_embed_dim * cur_seqlen * sizeof(float));
+        }
+    }
+
     #pragma omp parallel for num_threads(opt.num_threads)
     for (int q = 0; q < num_heads; q++)
     {
@@ -153,6 +198,13 @@ int SDPA::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_bl
         }
     }
 
+    if (kv_cache)
+    {
+        // assert top_blobs.size() == 3
+        top_blobs[1] = key;
+        top_blobs[2] = value;
+    }
+
     return 0;
 }
 
@@ -223,20 +275,24 @@ static void dynamic_quantize_2d_per_h(const Mat& blob, Mat& blob_int8, Mat& scal
 int SDPA::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const
 {
     const Mat& query = bottom_blobs[0];
-    const Mat& key = bottom_blobs[1];
-    const Mat& value = bottom_blobs[2];
-    const Mat& attn_mask_blob = bottom_blobs.size() == 4 ? bottom_blobs[3] : Mat();
+    const Mat& cur_key = bottom_blobs[1];
+    const Mat& cur_value = bottom_blobs[2];
+    const Mat& attn_mask_blob = attn_mask ? bottom_blobs[3] : Mat();
+    const Mat& past_key = kv_cache ? bottom_blobs[attn_mask ? 4 : 3] : Mat();
+    const Mat& past_value = kv_cache ? bottom_blobs[attn_mask ? 5 : 4] : Mat();
 
     const int embed_dim = query.w;
     const int src_seqlen = query.h;
     const int num_heads = query.c;
-    const int dst_seqlen = key.h;
-    const int num_group = key.c;
-    const int out_embed_dim = value.w;
-
-    // assert key.w == embed_dim
-    // assert key.h == value.h == dst_seqlen
-    // assert value.c == num_group
+    const int cur_seqlen = cur_key.h;
+    const int num_group = cur_key.c;
+    const int out_embed_dim = cur_value.w;
+    const int past_seqlen = kv_cache ? past_key.h : 0;
+    const int dst_seqlen = past_seqlen + cur_seqlen;
+
+    // assert cur_key.w == embed_dim
+    // assert cur_key.h == cur_value.h == cur_seqlen
+    // assert cur_value.c == num_group
     // assert num_heads % num_group == 0
 
     const float _scale = scale == 0.f ? 1.f / sqrt(embed_dim) : scale;
@@ -271,6 +327,46 @@ int SDPA::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& t
     if (query_or_qk_cross_int8_scales.empty())
         return -100;
 
+    Mat key = cur_key;
+    if (past_seqlen > 0)
+    {
+        key.create(embed_dim, dst_seqlen, num_group, 4u, opt.blob_allocator);
+        if (key.empty())
+            return -100;
+
+        // concat
+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q = 0; q < num_group; q++)
+        {
+            const Mat past_key_head = past_key.channel(q);
+            const Mat cur_key_head = cur_key.channel(q);
+            Mat key_head = key.channel(q);
+
+            memcpy(key_head.row(0), past_key_head, embed_dim * past_seqlen * sizeof(float));
+            memcpy(key_head.row(past_seqlen), cur_key_head, embed_dim * cur_seqlen * sizeof(float));
+        }
+    }
+
+    Mat value = cur_value;
+    if (past_seqlen > 0)
+    {
+        value.create(out_embed_dim, dst_seqlen, num_group, 4u, opt.blob_allocator);
+        if (value.empty())
+            return -100;
+
+        // concat
+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q = 0; q < num_group; q++)
+        {
+            const Mat past_value_head = past_value.channel(q);
+            const Mat cur_value_head = cur_value.channel(q);
+            Mat value_head = value.channel(q);
+
+            memcpy(value_head.row(0), past_value_head, out_embed_dim * past_seqlen * sizeof(float));
+            memcpy(value_head.row(past_seqlen), cur_value_head, out_embed_dim * cur_seqlen * sizeof(float));
+        }
+    }
+
     #pragma omp parallel for num_threads(opt.num_threads)
     for (int q = 0; q < num_heads; q++)
     {
@@ -389,6 +485,13 @@ int SDPA::forward_int8(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& t
         }
     }
 
+    if (kv_cache)
+    {
+        // assert top_blobs.size() == 3
+        top_blobs[1] = key;
+        top_blobs[2] = value;
+    }
+
     return 0;
 }
 #endif // NCNN_INT8
 
@@ -25,6 +25,7 @@ class SDPA : public Layer
 public:
     int attn_mask;
     float scale;
+    int kv_cache;
 
     int int8_scale_term;
 };