RAGnificent

RAGnificent combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.

Key features include HTML-to-markdown/JSON/XML conversion with support for various elements, intelligent content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.

Check out the deepwiki for a granular breakdown of the repository contents, purpose and structure.

Documentation

Features - Feature overview and capabilities
Configuration - Configuration management and environment setup
Optimization - Performance tuning and optimization guide

Installation

git clone https://github.com/krljakob/RAGnificent.git
cd RAGnificent

# Quick setup
./build_all.sh  # Unix/macOS
# or: .\build_all.ps1  # Windows

# Manual setup
uv venv && export PATH=".venv/bin:$PATH"
uv pip install -r requirements.txt && uv pip install -e .
pytest

Quick Start

# Basic conversion
python -m RAGnificent https://example.com -o output.md

# With RAG chunking
python -m RAGnificent https://example.com --save-chunks --chunk-dir chunks

# Multiple formats and parallel processing
python -m RAGnificent --links-file urls.txt --parallel -f json

# Python API
from RAGnificent.core.scraper import MarkdownScraper

scraper = MarkdownScraper()
html = scraper.scrape_website("https://example.com")
markdown = scraper.convert_to_markdown(html, "https://example.com")
chunks = scraper.create_chunks(markdown, "https://example.com")

Testing

# Run all tests (including benchmarks)
pytest

# Fast development testing (recommended)
./run_tests.sh fast  # ~15 seconds
# or
pytest -m "not benchmark and not slow"

# Run specific test categories
./run_tests.sh unit         # Unit tests only
./run_tests.sh integration  # Integration tests
./run_tests.sh benchmark    # Performance benchmarks
./run_tests.sh profile      # Show slowest tests

# Run specific test files
pytest tests/unit/test_chunk_utils.py -v
pytest tests/rust/test_python_bindings.py -v

Test Performance: Tests are organized by speed - fast unit tests run in ~15 seconds, while full suite including benchmarks takes ~22 seconds. Benchmarks are skipped by default for rapid development cycles.

Current Status: 48 tests with comprehensive coverage across core functionality.

Development

Code Organization

RAGnificent/: Main Python package
- core/: Core functionality (scraper, cache, config, etc.)
- rag/: RAG-specific components (embedding, vector store, search)
- utils/: Utility modules (chunking, sitemap parsing)
src/: Rust source code for performance-critical operations
tests/: Comprehensive test suite
examples/: Demo scripts and usage examples
docs/: Detailed documentation

Running Benchmarks

cargo bench
python scripts/visualize_benchmarks.py

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/fix-rate-limiting)
Commit your changes (git commit -m 'Fix rate limiting edge case')
Push to the branch (git push origin feature/fix-rate-limiting)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

krljakob

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.cursor		.cursor
.github/workflows		.github/workflows
RAGnificent		RAGnificent
benches		benches
config		config
data		data
docs		docs
examples		examples
scripts		scripts
src		src
test_data		test_data
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
.sourcery.yaml		.sourcery.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build_all.ps1		build_all.ps1
build_all.sh		build_all.sh
justfile		justfile
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
setup.py		setup.py
test_links.txt		test_links.txt
uv.lock		uv.lock
view_qdrant_data.py		view_qdrant_data.py
webui.py		webui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAGnificent

Documentation

Installation

Quick Start

Testing

Development

Code Organization

Running Benchmarks

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

License

krljakob/RAGnificent

Folders and files

Latest commit

History

Repository files navigation

RAGnificent

Documentation

Installation

Quick Start

Testing

Development

Code Organization

Running Benchmarks

Contributing

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Uh oh!

Languages

Packages