A full-stack search engine built with Java/Spring Boot backend and Vue.js frontend
This is a high-performance search engine that crawls web pages, indexes content, calculates PageRank scores, and provides a modern web interface for searching. The system is designed with scalability and performance in mind, featuring multi-threaded crawling, efficient indexing, and intelligent ranking algorithms.
Seo-Preview-V7.mp4
- Multi-threaded Architecture: Configurable thread pool (default: 20 threads)
- High Performance: Crawls 1000 documents in under 1 minute using 5 threads
- Smart Batching: Prioritizes popular pages using frequency-based batching
- Robots.txt Compliance: Respects web server policies with robust caching
- Duplicate Detection: Content hashing prevents redundant processing
- URL Normalization: Standardizes and filters invalid URLs
- Compression: Stores crawled content efficiently
- Uses documents compression and decompression to store data of much less size in the database for faster operations.
- The RobotsHandler implements a domain-based caching system, maps hostnames to parsed robots.txt rules, ensuring each domain's rules are fetched only once regardless of how many URLs from that domain are crawled.
Transforms HTML documents into inverted indices for fast search
- Advanced Tokenization: Intelligent text processing and cleanup
- Stop Word Filtering: Removes common words for better relevance
- Stemming Support: Reduces words to their root forms
- Field Extraction: Processes titles, headers, and content separately
- Efficient Storage: Optimized database operations
Ranks pages based on their PageRank, TF, and IDF scores
- TF-IDF scoring for term relevance per page
- Normalized PageRank influence for domain authority
- Using PageRank algorithm, which takes ~10ms on 6,000 documents
- Structural field boosts for
<title>and<h1>tag matches - Penalty applied for missing
<h1>tags - Score capping to avoid overinflation
- Computes PageRank as an offline process for all crawled URLs
- Optimized database operations for fetching & saving ranks,
- targets: <200ms for 6,000 documents.
- Ranking logic runs in 8~50ms depending on query
Processes user queries, supports phrase search, and generates result snippets.
- Unified Tokenization: Applies the same cleanup, stemming, and stop-word removal as the indexer to ensure consistency
- Exact Phrase Search: Supports precise phrase matching, even when stop words are present
- Multi-threading: Speeds up snippet generation and phrase matching
- Fast Response Times:
- General query: 0.01 β 0.2 seconds
- Phrase search: < 0.3 seconds
- Snippet generation is triggered only when the corresponding result page is requested
- Uses token position lookup from the inverted index β avoids full document scans (no regex!)
- Early filtering (before stemming) to narrow down the result set for phrase searching queries
- Follow the conventions outlined in
Project-Guidelines.md
- MongoDB must be running locally
- Maven should be installed and available in your terminal
- Ensure your
application.propertiesfile is properly configured - Clone the repository and navigate to the project root directory
-
Open a terminal and navigate to the backend directory:
cd engine -
Run Spring Boot
mvn spring-boot:run
To start each module in order:
- Run the Crawler (default thread count is 20)
make crawl THREADS=10- Run the Indexer
make index- Run the PageRank Module
make pagerank- Run the Ranker (with an optional query)
make rank QUERY="your search terms"-
Open a new terminal and go to the frontend directory:
cd client -
Install dependencies:
npm install
-
Start the development server:
npm run dev
-
Open your browser and visit: http://localhost:5173
Habiba Ayman |
Tasneem Mohamed |
Loay Ahmed |
Helana Nady |

