-
Notifications
You must be signed in to change notification settings - Fork 46
AIT-221: Document how token streaming interacts with rate limits #3092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: AIT-129-AIT-Docs-release-branch
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx
Outdated
Show resolved
Hide resolved
| 2. As the token rate approaches a threshold percentage of the [connection inbound message rate](/docs/platform/pricing/limits#connection), Ably batches tokens together automatically | ||
| 3. Clients receive the same number of tokens per second, delivered in fewer messages | ||
|
|
||
| By default, a single response stream uses up to 50% of the connection inbound message rate. This allows two simultaneous response streams on the same channel or connection. [Contact Ably](/contact) to adjust this threshold if your application requires a different allocation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've challenged this (https://ably.atlassian.net/wiki/spaces/AI/pages/4624515073/AITDR-003+Append+batching+in+the+frontdoor?focusedCommentId=4667342849)
Should this mention that it's specifiable via transport param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the text to match what is now in the DR. I'm deliberately not including a time value for the batching, because "we add 40ms latency to your token delivery time" sounds more negative than "we'll deliver your tokens in 25 messages/s so you don't hit your rate limits".
I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs, so left the customer-configurable part out. Can update with this info and associated docs link if @SimonWoolf is confident that it will be done, or add it when the transport param docs go out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs
Right now we don't have a single central 'transport params docs' page with a list of what transportParams exist -- we figured it would be a grab-bag of random unrelated things, which didn't seem useful. So we just mention them in the parts of the documentation where it's actually relevant, e.g. remainPresentFor is mentioned in https://ably.com/docs/presence-occupancy/presence#unstable-connections. And for some we just don't document them at all unless someone comes and presents us with a problem for which it's the appropriate solution, like rewindOnFailedResume.
In this case there's no bit of documentation more relevant than this page for this option, so I'd suggest just putting it here.
Right now it's named appendRollupWindow, lmk if you think a different name would be better. As with all our time options it's specified in milliseconds as an integer, and is capped at 500ms.
src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx
Outdated
Show resolved
Hide resolved
| meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance." | ||
| --- | ||
|
|
||
| LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without planning, concurrent token streams across multiple channels risk triggering rate limits
This sentence implies the danger is from concurrent token streams across multiple channels. But that's not the case: earlier in the paragraph you note that a single token stream can still give you 150 tokens per second, more than enough to hit rate limits just from the one stream. (And if you have multiple streams, for the purposes of the connection inbound rate limit it makes no difference whether they're on multiple channels or the same channel)
|
|
||
| LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits). | ||
|
|
||
| Ably scales as your traffic grows, and rate limits exist to protect service quality in the case of accidental spikes or deliberate abuse. They also provide a level of protection to consumption rates if abuse does occur. On the correct package for your use case, hitting a limit is an infrequent occurrence. The approach to staying within limits when using AI Transport depends on which [token streaming pattern](/docs/ai-transport/features/token-streaming) you use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This paragraph talks about limits scaling as your traffic grows, and the importance of being on the correct package for your use case. That applies to whole-account / quota / extensive limits. But the rest of this document is about the connection client-to-server message rate limit and the channel message rate limit, which are both local, intensive limits: they don't scale as traffic grows and don't change with your quota. So this paragraph seems like it might cause confusion in this context
Description
Document how Ably message rate limits interact with applications that are streaming tokens, for both the message-per-token and message-per-response streaming patterns.
https://ably.atlassian.net/browse/AIT-221
Checklist