Skip to content

A webscraper built with Rust and Python that turns webpages into markdown, JSON, or XML formats. Works well for LLM applications and RAG workflows.

License

Notifications You must be signed in to change notification settings

krljakob/RAGnificent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAGnificent

RAGnificent

RAGnificent combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG (Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.

Key features include HTML-to-markdown/JSON/XML conversion with support for various elements, intelligent content chunking that preserves document structure, and systematic content discovery through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.

Check out the deepwiki for a granular breakdown of the repository contents, purpose and structure.

Documentation

  • Features - Feature overview and capabilities
  • Configuration - Configuration management and environment setup
  • Optimization - Performance tuning and optimization guide

Installation

git clone https://github.com/krljakob/RAGnificent.git
cd RAGnificent

# Quick setup
./build_all.sh  # Unix/macOS
# or: .\build_all.ps1  # Windows

# Manual setup
uv venv && export PATH=".venv/bin:$PATH"
uv pip install -r requirements.txt && uv pip install -e .
pytest

Quick Start

# Basic conversion
python -m RAGnificent https://example.com -o output.md

# With RAG chunking
python -m RAGnificent https://example.com --save-chunks --chunk-dir chunks

# Multiple formats and parallel processing
python -m RAGnificent --links-file urls.txt --parallel -f json
# Python API
from RAGnificent.core.scraper import MarkdownScraper

scraper = MarkdownScraper()
html = scraper.scrape_website("https://example.com")
markdown = scraper.convert_to_markdown(html, "https://example.com")
chunks = scraper.create_chunks(markdown, "https://example.com")

Testing

# Run all tests (including benchmarks)
pytest

# Fast development testing (recommended)
./run_tests.sh fast  # ~15 seconds
# or
pytest -m "not benchmark and not slow"

# Run specific test categories
./run_tests.sh unit         # Unit tests only
./run_tests.sh integration  # Integration tests
./run_tests.sh benchmark    # Performance benchmarks
./run_tests.sh profile      # Show slowest tests

# Run specific test files
pytest tests/unit/test_chunk_utils.py -v
pytest tests/rust/test_python_bindings.py -v

Test Performance: Tests are organized by speed - fast unit tests run in ~15 seconds, while full suite including benchmarks takes ~22 seconds. Benchmarks are skipped by default for rapid development cycles.

Current Status: 48 tests with comprehensive coverage across core functionality.

Development

Code Organization

  • RAGnificent/: Main Python package
    • core/: Core functionality (scraper, cache, config, etc.)
    • rag/: RAG-specific components (embedding, vector store, search)
    • utils/: Utility modules (chunking, sitemap parsing)
  • src/: Rust source code for performance-critical operations
  • tests/: Comprehensive test suite
  • examples/: Demo scripts and usage examples
  • docs/: Detailed documentation

Running Benchmarks

cargo bench
python scripts/visualize_benchmarks.py

Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/fix-rate-limiting)
  3. Commit your changes (git commit -m 'Fix rate limiting edge case')
  4. Push to the branch (git push origin feature/fix-rate-limiting)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

krljakob

About

A webscraper built with Rust and Python that turns webpages into markdown, JSON, or XML formats. Works well for LLM applications and RAG workflows.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8