Skip to content

Conversation

@rainbowFi
Copy link
Contributor

Description

Document how Ably message rate limits interact with applications that are streaming tokens, for both the message-per-token and message-per-response streaming patterns.

https://ably.atlassian.net/browse/AIT-221

Checklist

@coderabbitai
Copy link

coderabbitai bot commented Jan 12, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rainbowFi rainbowFi added the review-app Create a Heroku review app label Jan 12, 2026
@ably-ci ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3 January 12, 2026 09:54 Inactive
@ably-ci ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3 January 12, 2026 09:58 Inactive
2. As the token rate approaches a threshold percentage of the [connection inbound message rate](/docs/platform/pricing/limits#connection), Ably batches tokens together automatically
3. Clients receive the same number of tokens per second, delivered in fewer messages

By default, a single response stream uses up to 50% of the connection inbound message rate. This allows two simultaneous response streams on the same channel or connection. [Contact Ably](/contact) to adjust this threshold if your application requires a different allocation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've challenged this (https://ably.atlassian.net/wiki/spaces/AI/pages/4624515073/AITDR-003+Append+batching+in+the+frontdoor?focusedCommentId=4667342849)

Should this mention that it's specifiable via transport param?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the text to match what is now in the DR. I'm deliberately not including a time value for the batching, because "we add 40ms latency to your token delivery time" sounds more negative than "we'll deliver your tokens in 25 messages/s so you don't hit your rate limits".

I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs, so left the customer-configurable part out. Can update with this info and associated docs link if @SimonWoolf is confident that it will be done, or add it when the transport param docs go out.

Copy link
Member

@SimonWoolf SimonWoolf Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if we'd have completed any client updates and docs for the transport param before we were ready to release these docs

Right now we don't have a single central 'transport params docs' page with a list of what transportParams exist -- we figured it would be a grab-bag of random unrelated things, which didn't seem useful. So we just mention them in the parts of the documentation where it's actually relevant, e.g. remainPresentFor is mentioned in https://ably.com/docs/presence-occupancy/presence#unstable-connections. And for some we just don't document them at all unless someone comes and presents us with a problem for which it's the appropriate solution, like rewindOnFailedResume.

In this case there's no bit of documentation more relevant than this page for this option, so I'd suggest just putting it here.

Right now it's named appendRollupWindow, lmk if you think a different name would be better. As with all our time options it's specified in milliseconds as an integer, and is capped at 500ms.

@ably-ci ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3 January 12, 2026 15:34 Inactive
@ably-ci ably-ci temporarily deployed to ably-docs-ait-221-rate--8jeuq3 January 12, 2026 15:42 Inactive
meta_description: "Learn how token streaming interacts with Ably message limits and how to ensure your application delivers consistent performance."
---

LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without planning, concurrent token streams across multiple channels risk triggering rate limits

This sentence implies the danger is from concurrent token streams across multiple channels. But that's not the case: earlier in the paragraph you note that a single token stream can still give you 150 tokens per second, more than enough to hit rate limits just from the one stream. (And if you have multiple streams, for the purposes of the connection inbound rate limit it makes no difference whether they're on multiple channels or the same channel)


LLM token streaming introduces bursty traffic patterns to your application, with some models outputting 150+ distinct events (i.e. tokens or response deltas) per second. Output rates can vary unpredictably over the lifetime of a response stream, and you have limited control over third-party model behaviour. Without planning, concurrent token streams across multiple channels risk triggering [rate limits](/docs/platform/pricing/limits).

Ably scales as your traffic grows, and rate limits exist to protect service quality in the case of accidental spikes or deliberate abuse. They also provide a level of protection to consumption rates if abuse does occur. On the correct package for your use case, hitting a limit is an infrequent occurrence. The approach to staying within limits when using AI Transport depends on which [token streaming pattern](/docs/ai-transport/features/token-streaming) you use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph talks about limits scaling as your traffic grows, and the importance of being on the correct package for your use case. That applies to whole-account / quota / extensive limits. But the rest of this document is about the connection client-to-server message rate limit and the channel message rate limit, which are both local, intensive limits: they don't scale as traffic grows and don't change with your quota. So this paragraph seems like it might cause confusion in this context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-app Create a Heroku review app

Development

Successfully merging this pull request may close these issues.

6 participants