diff --git a/docs/configure-rails/yaml-schema/streaming/global-streaming.md b/docs/configure-rails/yaml-schema/streaming/global-streaming.md index 309587d94..ce72a5cb4 100644 --- a/docs/configure-rails/yaml-schema/streaming/global-streaming.md +++ b/docs/configure-rails/yaml-schema/streaming/global-streaming.md @@ -1,50 +1,33 @@ --- -title: Global Streaming -description: Enable streaming mode for LLM token generation in config.yml. +title: Streaming +description: Using streaming mode for LLM token generation in NeMo Guardrails. --- -# Global Streaming +# Streaming -Enable streaming mode for the main LLM generation at the top level of `config.yml`. +NeMo Guardrails supports streaming LLM responses via the `stream_async()` method. No configuration is required to enable streaming—simply use `stream_async()` instead of `generate_async()`. -## Configuration +## Basic Usage -```yaml -streaming: True -``` - -## What It Does - -When enabled, global streaming: +```python +from nemoguardrails import LLMRails, RailsConfig -- Sets `streaming = True` on the underlying LLM model -- Enables `stream_usage = True` for token usage tracking -- Allows using the `stream_async()` method on `LLMRails` -- Makes the LLM produce tokens incrementally instead of all at once +config = RailsConfig.from_path("./config") +rails = LLMRails(config) -## Default +messages = [{"role": "user", "content": "Hello!"}] -`False` +async for chunk in rails.stream_async(messages=messages): + print(chunk, end="", flush=True) +``` --- -## When to Use - -### Streaming Without Output Rails - -If you do not have output rails configured, only global streaming is needed: - -```yaml -streaming: True -``` - -### Streaming With Output Rails +## Streaming With Output Rails -When using output rails with streaming, you must also configure [output rail streaming](output-rail-streaming.md): +When using output rails with streaming, you must configure [output rail streaming](output-rail-streaming.md): ```yaml -streaming: True - rails: output: flows: @@ -53,27 +36,15 @@ rails: enabled: True ``` ---- +If output rails are configured but `rails.output.streaming.enabled` is not set to `True`, calling `stream_async()` will raise an `StreamingNotSupportedError`. -## Python API Usage +--- -### Simple Streaming +## Streaming With Handler (Deprecated) -```python -from nemoguardrails import LLMRails, RailsConfig - -config = RailsConfig.from_path("./config") -rails = LLMRails(config) +> **Warning:** Using `StreamingHandler` directly is deprecated and will be removed in a future release. Use `stream_async()` instead. -messages = [{"role": "user", "content": "Hello!"}] - -async for chunk in rails.stream_async(messages=messages): - print(chunk, end="", flush=True) -``` - -### Streaming With Handler - -For more control, use a `StreamingHandler`: +For advanced use cases requiring more control over token processing, you can use a `StreamingHandler` with `generate_async()`: ```python from nemoguardrails import LLMRails, RailsConfig @@ -113,9 +84,19 @@ Enable streaming in the request body by setting `stream` to `true`: --- +## CLI Usage + +Use the `--streaming` flag with the chat command: + +```bash +nemoguardrails chat path/to/config --streaming +``` + +--- + ## Token Usage Tracking -When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` for the underlying LLM model. +When using `stream_async()`, NeMo Guardrails automatically enables token usage tracking by setting `stream_usage = True` on the underlying LLM model. Access token usage through the `log` generation option: diff --git a/docs/configure-rails/yaml-schema/streaming/index.md b/docs/configure-rails/yaml-schema/streaming/index.md index 79310c15a..4896a9f85 100644 --- a/docs/configure-rails/yaml-schema/streaming/index.md +++ b/docs/configure-rails/yaml-schema/streaming/index.md @@ -1,37 +1,23 @@ --- title: Streaming Configuration -description: Configure streaming for LLM token generation and output rail processing in config.yml. +description: Configure streaming for output rail processing in config.yml. --- # Streaming Configuration -NeMo Guardrails supports two levels of streaming configuration: +NeMo Guardrails supports streaming out of the box when using the `stream_async()` method. No configuration is required to enable basic streaming. -1. **Global streaming** - Controls LLM token generation -2. **Output rail streaming** - Controls how output rails process streamed tokens - -## Configuration Comparison - -| Aspect | Global `streaming` | Output Rail `streaming.enabled` | -|--------|-------------------|--------------------------------| -| **Scope** | LLM token generation | Output rail processing | -| **Required for** | Any streaming | Streaming with output rails | -| **Affects** | How LLM produces tokens | How rails process token chunks | -| **Default** | `False` | `False` | +When you have **output rails** configured, you need to explicitly enable streaming for them to process tokens in chunked mode. ## Quick Example -When using streaming with output rails, both configurations are required: +When using streaming with output rails: ```yaml -# Global: Enable LLM streaming -streaming: True - rails: output: flows: - self check output - # Output rail streaming: Enable chunked processing streaming: enabled: True chunk_size: 200 @@ -40,18 +26,11 @@ rails: ## Streaming Configuration Details -The following guides provide detailed documentation for each streaming configuration area. +The following guides provide detailed documentation for streaming configuration. ::::{grid} 1 1 2 2 :gutter: 3 -:::{grid-item-card} Global Streaming -:link: global-streaming -:link-type: doc - -Enable streaming mode for LLM token generation in config.yml. -::: - :::{grid-item-card} Output Rail Streaming :link: output-rail-streaming :link-type: doc diff --git a/docs/run-rails/streaming.md b/docs/run-rails/streaming.md index 4f54fac7a..03b9849d9 100644 --- a/docs/run-rails/streaming.md +++ b/docs/run-rails/streaming.md @@ -1,20 +1,12 @@ # Streaming -If the application LLM supports streaming, you can configure NeMo Guardrails to stream tokens as well. +If the application LLM supports streaming, NeMo Guardrails can stream tokens as well. Streaming is automatically enabled when you use the `stream_async()` method - no configuration is required. For information about configuring streaming with output guardrails, refer to the following: - For configuration, refer to [streaming output configuration](../user-guides/configuration-guide.md#streaming-output-configuration). - For sample Python client code, refer to [streaming output](../getting-started/5-output-rails/README.md#streaming-output). -## Configuration - -To activate streaming on a guardrails configuration, add the following to your `config.yml`: - -```yaml -streaming: True -``` - ## Usage ### Chat CLI @@ -215,13 +207,7 @@ POST /v1/chat/completions We also support streaming for LLMs deployed using `HuggingFacePipeline`. One example is provided in the [HF Pipeline Dolly](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/configs/llm/hf_pipeline_dolly/README.md) configuration. -To use streaming for HF Pipeline LLMs, you first need to set the streaming flag in your `config.yml`. - -```yaml -streaming: True -``` - -Then you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object, +To use streaming for HF Pipeline LLMs, you need to create an `nemoguardrails.llm.providers.huggingface.AsyncTextIteratorStreamer` streamer object, add it to the `kwargs` of the pipeline and to the `model_kwargs` of the `HuggingFacePipelineCompatible` object. ```python