diff --git a/src/data/nav/aitransport.ts b/src/data/nav/aitransport.ts index cb8eeb5f11..fa176195ad 100644 --- a/src/data/nav/aitransport.ts +++ b/src/data/nav/aitransport.ts @@ -29,6 +29,10 @@ export default { name: 'Message per token', link: '/docs/ai-transport/features/token-streaming/message-per-token', }, + { + name: 'Token streaming limits', + link: '/docs/ai-transport/features/token-streaming/token-rate-limits', + }, ], }, { diff --git a/src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx b/src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx new file mode 100644 index 0000000000..7b6d66a8f4 --- /dev/null +++ b/src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx @@ -0,0 +1,66 @@ +--- +title: Token streaming limits +meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance." +--- + +LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits). + +Ably scales as your traffic grows, and rate limits exist to protect service quality in the case of accidental spikes or deliberate abuse. They also provide a level of protection to consumption rates if abuse does occur. On the correct package for your use case, hitting a limit is an infrequent occurrence. The approach to staying within limits when using AI Transport depends on which [token streaming pattern](/docs/ai-transport/features/token-streaming) you use. + +## Message-per-response + +The [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern includes automatic rate limit protection. AI Transport prevents a single response stream from reaching the message rate limit by rolling up multiple appends into a single published message: + +1. Your agent streams tokens to the channel at the model's output rate +2. Ably publishes the first token immediately, then automatically rolls up subsequent tokens on receipt +3. Clients receive the same number of tokens per second, delivered in fewer messages + +By default, a single response stream will be delivered at 25 messages per second or the model output rate, whichever is lower. This means you can publish two simultaneous response streams on the same channel or connection with any [Ably package](/docs/platform/pricing#packages), because each stream is limited to 50% of the [connection inbound message rate](/docs/platform/pricing/limits#connection). You will be charged for the number of published messages, not for the number of streamed tokens. + +### Configuring rollup behaviour + +Ably joins all appends for a single response that are received during the rollup window into one published message. You can specify the rollup window for a particular connection by setting the `appendRollupWindow` transport parameter. This allows you to determine how much of the connection message rate can be consumed by a single response stream and control your consumption costs. + + +| appendRollupWindow | Maximum message rate for a single response | +|---|---| +| 0ms | Model output rate | +| 20ms | 50 messages/s | +| 40ms *(default)* | 25 messages/s | +| 100ms | 10 messages/s | +| 500ms *(max)* | 2 messages/s | + +The following example code demonstrates establishing a connection to Ably with `appendRollupWindow` set to 100ms: + + +```javascript +const ably = new Ably.Realtime( + { + key: 'your-api-key', + transportParams: { appendRollupWindow: 100 } + } +); +``` + + + + +## Message-per-token + +The [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern requires you to manage rate limits directly. Each token publishes as a separate message, so high-speed model output can consume message allowances quickly. + +To stay within limits: + +- Calculate your headroom by comparing your model's peak output rate against your package's [connection inbound message rate](/docs/platform/pricing/limits#connection) +- Account for concurrency by multiplying peak rates by the maximum number of simultaneous streams your application supports +- If required, batch tokens in your agent before publishing to the SDK, reducing message count while maintaining delivery speed + +If your application requires higher message rates than your current package allows, [contact Ably](/contact) to discuss options. + +## Next steps + +- Review [Ably platform limits](/docs/platform/pricing/limits) to understand rate limit thresholds for your package +- Learn about the [message-per-response](/docs/ai-transport/features/token-streaming/message-per-response) pattern for automatic rate limit protection +- Learn about the [message-per-token](/docs/ai-transport/features/token-streaming/message-per-token) pattern for fine-grained control