docs(streaming): update streaming configuration documentation (#1542)

Pouyanpi · web-flow · commit 3ab7a4fa4415 · 2025-12-16T14:52:30.000-08:00
* docs(streaming): update streaming configuration documentation feature: #1538 Revise streaming docs to clarify usage of stream_async(), remove outdated global streaming config, and add CLI usage instructions. Explain output rails streaming requirements and deprecation of StreamingHandler. Improve examples and guidance for token usage tracking. * not sure about these 2
diff --git a/docs/configure-rails/yaml-schema/streaming/global-streaming.md b/docs/configure-rails/yaml-schema/streaming/global-streaming.md
@@ -1,50 +1,33 @@
 ---
-title: Global Streaming
-description: Enable streaming mode for LLM token generation in config.yml.
+title: Streaming
+description: Using streaming mode for LLM token generation in NeMo Guardrails.
 ---
 
-# Global Streaming
+# Streaming
 
-Enable streaming mode for the main LLM generation at the top level of `config.yml`.
+NeMo Guardrails supports streaming LLM responses via the `stream_async()` method. No configuration is required to enable streaming—simply use `stream_async()` instead of `generate_async()`.
 
-## Configuration
+## Basic Usage
 
-```yaml
-streaming: True
-```
-
-## What It Does
-
-When enabled, global streaming:
+```python
+from nemoguardrails import LLMRails, RailsConfig
 
-- Sets `streaming = True` on the underlying LLM model
-- Enables `stream_usage = True` for token usage tracking
-- Allows using the `stream_async()` method on `LLMRails`
-- Makes the LLM produce tokens incrementally instead of all at once
+config = RailsConfig.from_path("./config")
+rails = LLMRails(config)
 
-## Default
+messages = [{"role": "user", "content": "Hello!"}]
 
-`False`
+async for chunk in rails.stream_async(messages=messages):
+    print(chunk, end="", flush=True)
+```
 
 ---
 
-## When to Use
-
-### Streaming Without Output Rails
-
-If you do not have output rails configured, only global streaming is needed:
-
-```yaml
-streaming: True
-```
-
-### Streaming With Output Rails
+## Streaming With Output Rails
 
-When using output rails with streaming, you must also configure [output rail streaming](output-rail-streaming.md):
+When using output rails with streaming, you must configure [output rail streaming](output-rail-streaming.md):
 
 ```yaml
-streaming: True
-
 rails:
   output:
     flows:
@@ -53,27 +36,15 @@ rails:
       enabled: True
 ```
 
----
+If output rails are configured but `rails.output.streaming.enabled` is not set to `True`, calling `stream_async()` will raise an `StreamingNotSupportedError`.
 
-## Python API Usage
+---
 
-### Simple Streaming
+## Streaming With Handler (Deprecated)
 
-```python
-from nemoguardrails import LLMRails, RailsConfig
-
-config = RailsConfig.from_path("./config")
-rails = LLMRails(config)
+> **Warning:** Using `StreamingHandler` directly is deprecated and will be removed in a future release. Use `stream_async()` instead.
 
-messages = [{"role": "user", "content": "Hello!"}]
-
-async for chunk in rails.stream_async(messages=messages):
-    print(chunk, end="", flush=True)
-```
-
-### Streaming With Handler
-
-For more control, use a `StreamingHandler`:
+For advanced use cases requiring more control over token processing, you can use a `StreamingHandler` with `generate_async()`:
 
 ```python
 from nemoguardrails import LLMRails, RailsConfig
@@ -113,9 +84,19 @@ Enable streaming in the request body by setting `stream` to `true`:
 
 ---
 
+## CLI Usage
+
+Use the `--streaming` flag with the chat command:
+
+```bash
+nemoguardrails chat path/to/config --streaming
+```
+
+---
+
 ## Token Usage Tracking
 
-When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` for the underlying LLM model.
+When using `stream_async()`, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` on the underlying LLM model.
 
 Access token usage through the `log` generation option:
 
diff --git a/docs/configure-rails/yaml-schema/streaming/index.md b/docs/configure-rails/yaml-schema/streaming/index.md
@@ -1,37 +1,23 @@
 ---
 title: Streaming Configuration
-description: Configure streaming for LLM token generation and output rail processing in config.yml.
+description: Configure streaming for output rail processing in config.yml.
 ---
 
 # Streaming Configuration
 
-NeMo Guardrails supports two levels of streaming configuration:
+NeMo Guardrails supports streaming out of the box when using the `stream_async()` method. No configuration is required to enable basic streaming.
 
-1. **Global streaming** - Controls LLM token generation
-2. **Output rail streaming** - Controls how output rails process streamed tokens
-
-## Configuration Comparison
-
-| Aspect | Global `streaming` | Output Rail `streaming.enabled` |
-|--------|-------------------|--------------------------------|
-| **Scope** | LLM token generation | Output rail processing |
-| **Required for** | Any streaming | Streaming with output rails |
-| **Affects** | How LLM produces tokens | How rails process token chunks |
-| **Default** | `False` | `False` |
+When you have **output rails** configured, you need to explicitly enable streaming for them to process tokens in chunked mode.
 
 ## Quick Example
 
-When using streaming with output rails, both configurations are required:
+When using streaming with output rails:
 
 ```yaml
-# Global: Enable LLM streaming
-streaming: True
-
 rails:
   output:
     flows:
       - self check output
-    # Output rail streaming: Enable chunked processing
     streaming:
       enabled: True
       chunk_size: 200
@@ -40,18 +26,11 @@ rails:
 
 ## Streaming Configuration Details
 
-The following guides provide detailed documentation for each streaming configuration area.
+The following guides provide detailed documentation for streaming configuration.
 
 ::::{grid} 1 1 2 2
 :gutter: 3
 
-:::{grid-item-card} Global Streaming
-:link: global-streaming
-:link-type: doc
-
-Enable streaming mode for LLM token generation in config.yml.
-:::
-
 :::{grid-item-card} Output Rail Streaming
 :link: output-rail-streaming
 :link-type: doc
diff --git a/docs/run-rails/streaming.md b/docs/run-rails/streaming.md
@@ -1,20 +1,12 @@
 # Streaming
 
-If the application LLM supports streaming, you can configure NeMo Guardrails to stream tokens as well.
+If the application LLM supports streaming, NeMo Guardrails can stream tokens as well. Streaming is automatically enabled when you use the `stream_async()` method - no configuration is required.
 
 For information about configuring streaming with output guardrails, refer to the following:
 
 - For configuration, refer to [streaming output configuration](../user-guides/configuration-guide.md#streaming-output-configuration).
 - For sample Python client code, refer to [streaming output](../getting-started/5-output-rails/README.md#streaming-output).
 
-## Configuration
-
-To activate streaming on a guardrails configuration, add the following to your `config.yml`:
-
-```yaml
-streaming: True
-```
-
 ## Usage
 
 ### Chat CLI
@@ -215,13 +207,7 @@ POST /v1/chat/completions
 We also support streaming for LLMs deployed using `HuggingFacePipeline`.
 One example is provided in the [HF Pipeline Dolly](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/llm/hf_pipeline_dolly/README.md) configuration.
 
-To use streaming for HF Pipeline LLMs, you first need to set the streaming flag in your `config.yml`.
-
-```yaml
-streaming: True
-```
-
-Then you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object,
+To use streaming for HF Pipeline LLMs, you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object,
 add it to the `kwargs` of the pipeline and to the `model_kwargs` of the `HuggingFacePipelineCompatible` object.
 
 ```python