AIT-221: Document how token streaming interacts with rate limits #3092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

rainbowFi wants to merge 7 commits into AIT-129-AIT-Docs-release-branch from ait-221-rate-limits

Contributor

rainbowFi commented Jan 12, 2026

Description

Document how Ably message rate limits interact with applications that are streaming tokens, for both the message-per-token and message-per-response streaming patterns.

https://ably.atlassian.net/browse/AIT-221

Checklist

Commits have been rebased.
Linting has been run against the changed file(s).
The PR adheres to the writing style guide and contribution guide.

rainbowFi added 2 commits

January 12, 2026 09:01


          AIT-221: Document how token streaming interacts with rate limits

78d8406


          Fix article title in nav

0fc6189

coderabbitai bot commented Jan 12, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rainbowFi added the review-app label

ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3

January 12, 2026 09:54

Inactive


          Update text to clarify language and remove duplicate link

cb9876d

ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3

January 12, 2026 09:58

Inactive

rainbowFi requested a review from mschristensen

January 12, 2026 10:01

paddybyers reviewed

View reviewed changes

src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx Outdated Show resolved Hide resolved

paddybyers reviewed

View reviewed changes

src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx Outdated

+. As the token rate approaches a threshold percentage of the [connection inbound message rate](/docs/platform/pricing/limits#connection), Ably batches tokens together automatically
+. Clients receive the same number of tokens per second, delivered in fewer messages
+              By default, a single response stream uses up to 50% of the connection inbound message rate. This allows two simultaneous response streams on the same channel or connection. [Contact Ably](/contact) to adjust this threshold if your application requires a different allocation.

Member

paddybyers Jan 12, 2026

I've challenged this (https://ably.atlassian.net/wiki/spaces/AI/pages/4624515073/AITDR-003+Append+batching+in+the+frontdoor?focusedCommentId=4667342849)

Should this mention that it's specifiable via transport param?

Contributor Author

rainbowFi Jan 12, 2026

I've updated the text to match what is now in the DR. I'm deliberately not including a time value for the batching, because "we add 40ms latency to your token delivery time" sounds more negative than "we'll deliver your tokens in 25 messages/s so you don't hit your rate limits".

I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs, so left the customer-configurable part out. Can update with this info and associated docs link if @SimonWoolf is confident that it will be done, or add it when the transport param docs go out.

Member

SimonWoolf Jan 12, 2026 •

edited

Loading

I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs

Right now we don't have a single central 'transport params docs' page with a list of what transportParams exist -- we figured it would be a grab-bag of random unrelated things, which didn't seem useful. So we just mention them in the parts of the documentation where it's actually relevant, e.g. remainPresentFor is mentioned in https://ably.com/docs/presence-occupancy/presence#unstable-connections. And for some we just don't document them at all unless someone comes and presents us with a problem for which it's the appropriate solution, like rewindOnFailedResume.

In this case there's no bit of documentation more relevant than this page for this option, so I'd suggest just putting it here.

Right now it's named appendRollupWindow, lmk if you think a different name would be better. As with all our time options it's specified in milliseconds as an integer, and is capped at 500ms.

mschristensen requested a review from SimonWoolf

January 12, 2026 14:50


          Update based on Paddy's comments

981464c

ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3

January 12, 2026 15:34

Inactive

m-hulbert reviewed

View reviewed changes

src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx Outdated Show resolved Hide resolved


          Update naming following review

ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3

January 12, 2026 15:42

Inactive

rainbowFi added 2 commits

January 14, 2026 16:03


          WIP - update with transport param

44ffbf0


          Complete updates to include transport param documentation

127a97c

ably-ci deployed to ably-docs-ait-221-rate--8jeuq3

January 14, 2026 17:21

View deployment

SimonWoolf reviewed

View reviewed changes

src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx

+              meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance."
+              ---
+              LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits).

Member

SimonWoolf Jan 14, 2026

Without planning, concurrent token streams across multiple channels risk triggering rate limits

This sentence implies the danger is from concurrent token streams across multiple channels. But that's not the case: earlier in the paragraph you note that a single token stream can still give you 150 tokens per second, more than enough to hit rate limits just from the one stream. (And if you have multiple streams, for the purposes of the connection inbound rate limit it makes no difference whether they're on multiple channels or the same channel)

src/pages/docs/ai-transport/features/token-streaming/token-rate-limits.mdx


		LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits).

		Ably scales as your traffic grows, and rate limits exist to protect service quality in the case of accidental spikes or deliberate abuse. They also provide a level of protection to consumption rates if abuse does occur. On the correct package for your use case, hitting a limit is an infrequent occurrence. The approach to staying within limits when using AI Transport depends on which [token streaming pattern](/docs/ai-transport/features/token-streaming) you use.

Member

SimonWoolf Jan 14, 2026

This paragraph talks about limits scaling as your traffic grows, and the importance of being on the correct package for your use case. That applies to whole-account / quota / extensive limits. But the rest of this document is about the connection client-to-server message rate limit and the channel message rate limit, which are both local, intensive limits: they don't scale as traffic grows and don't change with your quota. So this paragraph seems like it might cause confusion in this context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels