This guide explains how to use the NVIDIA RAG system's summarization features, including how to enable summary generation during document ingestion and how to retrieve document summaries via the API.
When uploading documents to the vector store using the ingestion API (POST /documents), you can request that a summary be generated for each document. This is controlled by the generate_summary flag in the data field of the multipart form request.
POST /v1/documents
Content-Type: multipart/form-data
- documents: [file1.pdf, file2.docx, ...]
- data: '{
"collection_name": "my_collection",
"blocking": false,
"split_options": {"chunk_size": 512, "chunk_overlap": 150},
"custom_metadata": [],
"generate_summary": true
}'- generate_summary: Set to
trueto enable summary generation for each uploaded document. The summary generation always happens asynchronously in the backend after the ingestion is complete. The ingestion status is reported to be completed irrespective of whether summarization has been successfully completed or not.
response = await ingestor.upload_documents(
collection_name="my_collection",
vdb_endpoint="http://localhost:19530",
blocking=False,
filepaths=["/path/to/file1.pdf"],
generate_summary=True
)Once a document has been ingested with summarization enabled, you can retrieve its summary using the GET /summary endpoint.
GET /v1/summary?collection_name=<collection>&file_name=<filename>&blocking=<bool>&timeout=<seconds>
- collection_name (required): Name of the collection containing the document.
- file_name (required): Name of the file for which to retrieve the summary.
- blocking (optional, default: false):
- If
true, the request will wait (up totimeoutseconds) for the summary to be generated if it is not yet available. - If
false, the request will return immediately. If the summary is not ready, a 404 response is returned.
- If
- timeout (optional, default: 300): Maximum time to wait (in seconds) if
blockingis true.
GET /v1/summary?collection_name=my_collection&file_name=file1.pdf&blocking=true&timeout=60response = await rag.get_summary(
collection_name="my_collection",
file_name="file1.pdf",
blocking=False, # Set to True to wait for summary generation
timeout=20 # Maximum wait time in seconds if blocking is True
)
print(response){
"summary": "This document provides an overview of ...",
"file_name": "file1.pdf",
"collection_name": "my_collection",
"status": "SUCCESS",
"message": "Summary generated successfully."
}{
"message": "Summary for file1.pdf not found. Set wait=true to wait for generation.",
"status": "FAILED"
}{
"message": "Timeout waiting for summary generation for file1.pdf",
"status": "FAILED"
}The summarization feature can be configured using the following environment variables:
The summarization feature uses specialized prompts defined in the prompt.yaml file. Two key prompts work together: document_summary_prompt for single-chunk processing and iterative_summary_prompt for multi-chunk documents.
Environment Variables:
- SUMMARY_LLM: The model name to use for summarization (default:
nvidia/llama-3.3-nemotron-super-49b-v1) - SUMMARY_LLM_SERVERURL: The server URL hosting the summarization model (default: empty, uses NVIDIA hosted API)
- SUMMARY_LLM_MAX_CHUNK_LENGTH: Maximum chunk size in characters for document processing (default:
50000) - SUMMARY_CHUNK_OVERLAP: Overlap between chunks for iterative summarization in characters (default:
200)
export SUMMARY_LLM="nvidia/llama-3.3-nemotron-super-49b-v1"
export SUMMARY_LLM_SERVERURL=""
export SUMMARY_LLM_MAX_CHUNK_LENGTH=50000
export SUMMARY_CHUNK_OVERLAP=200The summarization system uses an intelligent chunking approach with different prompts for different scenarios:
-
Single Chunk Processing: If a document fits within
SUMMARY_LLM_MAX_CHUNK_LENGTHcharacters, it's processed as a single chunk.- Prompt used:
document_summary_prompt- Takes the entire document content and generates a comprehensive summary in one pass
- Prompt used:
-
Iterative Multi-Chunk Processing: For larger documents:
- The document is split into chunks using
SUMMARY_LLM_MAX_CHUNK_LENGTHas the maximum size SUMMARY_CHUNK_OVERLAPcharacters are preserved between chunks for context- Initial chunk:
document_summary_promptis used to generate an initial summary from the first chunk - Subsequent chunks:
iterative_summary_promptis used to update the existing summary with new information from each additional chunk - The final result is a comprehensive summary of the entire document
- The document is split into chunks using
This approach ensures that even very large documents can be summarized effectively while maintaining context across chunk boundaries. The prompt selection automatically adapts based on document size and processing stage.
- Summarization is only available if
generate_summarywas set totrueduring document upload. - If you request a summary for a document that was not ingested with summarization enabled, the summary will not be available.
- Use the
blockingparameter to control whether your request waits for summary generation or returns immediately. - The summary is pre-generated and stored in minio database; repeated requests for the same document will return the same summary unless the document is re-uploaded or updated.
- For optimal performance, adjust
SUMMARY_LLM_MAX_CHUNK_LENGTHbased on your model's context window and available resources. - Larger chunk sizes generally produce better summaries but require more memory and processing time.
For more details, refer to the OpenAPI schema and Python usage examples.