RAGnificent combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.
Key features include HTML-to-markdown/JSON/XML conversion with support for various elements, intelligent content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.
Check out the deepwiki for a granular breakdown of the repository contents, purpose and structure.
- Features - Feature overview and capabilities
- Configuration - Configuration management and environment setup
- Optimization - Performance tuning and optimization guide
git clone https://github.com/krljakob/RAGnificent.git
cd RAGnificent
# Quick setup
./build_all.sh # Unix/macOS
# or: .\build_all.ps1 # Windows
# Manual setup
uv venv && export PATH=".venv/bin:$PATH"
uv pip install -r requirements.txt && uv pip install -e .
pytest# Basic conversion
python -m RAGnificent https://example.com -o output.md
# With RAG chunking
python -m RAGnificent https://example.com --save-chunks --chunk-dir chunks
# Multiple formats and parallel processing
python -m RAGnificent --links-file urls.txt --parallel -f json# Python API
from RAGnificent.core.scraper import MarkdownScraper
scraper = MarkdownScraper()
html = scraper.scrape_website("https://example.com")
markdown = scraper.convert_to_markdown(html, "https://example.com")
chunks = scraper.create_chunks(markdown, "https://example.com")# Run all tests (including benchmarks)
pytest
# Fast development testing (recommended)
./run_tests.sh fast # ~15 seconds
# or
pytest -m "not benchmark and not slow"
# Run specific test categories
./run_tests.sh unit # Unit tests only
./run_tests.sh integration # Integration tests
./run_tests.sh benchmark # Performance benchmarks
./run_tests.sh profile # Show slowest tests
# Run specific test files
pytest tests/unit/test_chunk_utils.py -v
pytest tests/rust/test_python_bindings.py -vTest Performance: Tests are organized by speed - fast unit tests run in ~15 seconds, while full suite including benchmarks takes ~22 seconds. Benchmarks are skipped by default for rapid development cycles.
Current Status: 48 tests with comprehensive coverage across core functionality.
RAGnificent/: Main Python packagecore/: Core functionality (scraper, cache, config, etc.)rag/: RAG-specific components (embedding, vector store, search)utils/: Utility modules (chunking, sitemap parsing)
src/: Rust source code for performance-critical operationstests/: Comprehensive test suiteexamples/: Demo scripts and usage examplesdocs/: Detailed documentation
cargo bench
python scripts/visualize_benchmarks.py- Fork the repository
- Create your feature branch (
git checkout -b feature/fix-rate-limiting) - Commit your changes (
git commit -m 'Fix rate limiting edge case') - Push to the branch (
git push origin feature/fix-rate-limiting) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
krljakob