Skip to content

A high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides modern UI for searching.

License

Notifications You must be signed in to change notification settings

LoayAhmed304/search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

$\color{rgba(240, 171, )}{\textsf{Hola! Can you say "The best search engine ever" with me? }}$

πŸ” Search Engine Project

A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend

Project Overview

This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.

Video Preview

Seo-Preview-V7.mp4

✨ Features

πŸ•·οΈ Web Crawler

  • Multi-threaded Architecture: Configurable thread pool (default: 20 threads)
  • High Performance: Crawls 1000 documents in under 1 minute using 5 threads
  • Smart Batching: Prioritizes popular pages using frequency-based batching
  • Robots.txt Compliance: Respects web server policies with robust caching
  • Duplicate Detection: Content hashing prevents redundant processing
  • URL Normalization: Standardizes and filters invalid URLs
  • Compression: Stores crawled content efficiently
Optimization Techniques
  • Uses documents compression and decompression to store data of much less size in the database for faster operations.
  • The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.

🧾 Indexer

Transforms HTML documents into inverted indices for fast search

  • Advanced Tokenization: Intelligent text processing and cleanup
  • Stop Word Filtering: Removes common words for better relevance
  • Stemming Support: Reduces words to their root forms
  • Field Extraction: Processes titles, headers, and content separately
  • Efficient Storage: Optimized database operations

πŸ“Š Ranking System

Ranks pages based on their PageRank, TF, and IDF scores

  • TF-IDF scoring for term relevance per page
  • Normalized PageRank influence for domain authority
    • Using PageRank algorithm, which takes ~10ms on 6,000 documents
  • Structural field boosts for <title> and <h1> tag matches
  • Penalty applied for missing <h1> tags
  • Score capping to avoid overinflation
  • Computes PageRank as an offline process for all crawled URLs
  • Optimized database operations for fetching & saving ranks,
    • targets: <200ms for 6,000 documents.
    • Ranking logic runs in 8~50ms depending on query

πŸ” Query processing & Phrase Searching

Processes user queries, supports phrase search, and generates result snippets.

  • Unified Tokenization: Applies the same cleanup, stemming, and stop-word removal as the indexer to ensure consistency
  • Exact Phrase Search: Supports precise phrase matching, even when stop words are present
  • Multi-threading: Speeds up snippet generation and phrase matching
  • Fast Response Times:
    • General query: 0.01 – 0.2 seconds
    • Phrase search: < 0.3 seconds

Optimization Techniques

  • Snippet generation is triggered only when the corresponding result page is requested
  • Uses token position lookup from the inverted index β€” avoids full document scans (no regex!)
  • Early filtering (before stemming) to narrow down the result set for phrase searching queries

Flow

πŸ“ Development Guidelines

  • Follow the conventions outlined in Project-Guidelines.md

πŸ› οΈ How to Run

Prerequisites

  • MongoDB must be running locally
  • Maven should be installed and available in your terminal
  • Ensure your application.properties file is properly configured
  • Clone the repository and navigate to the project root directory

Backend Setup

  1. Open a terminal and navigate to the backend directory:

    cd engine
  2. Run Spring Boot

    mvn spring-boot:run

To start each module in order:

  1. Run the Crawler (default thread count is 20)
make crawl THREADS=10
  1. Run the Indexer
make index
  1. Run the PageRank Module
make pagerank
  1. Run the Ranker (with an optional query)
make rank QUERY="your search terms"

Frontend Setup

  1. Open a new terminal and go to the frontend directory:

    cd client
  2. Install dependencies:

    npm install
  3. Start the development server:

    npm run dev
  4. Open your browser and visit: http://localhost:5173

πŸ‘₯ Contributors

Typing SVG

Habiba Ayman

Tasneem Mohamed

Loay Ahmed

Helana Nady

About

A high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides modern UI for searching.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •