You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The project will be structured with these core components:
21
-
22
-
1.**Crawler** - Main component that orchestrates the crawling process
23
-
2.**RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts
24
-
3.**ResponseParser** - Parses HTML responses to extract links
25
-
4.**Cache** - Stores visited URLs and their responses
26
-
5.**LinkFilter** - Filters links based on domain/subdomain rules
27
-
6.**TaskManager** - Manages parallel execution of crawling tasks
28
-
29
-
### 2. Caching Strategy
30
-
31
-
For the caching requirement:
32
-
33
-
-**In-memory cache**: Fast but limited by available RAM
34
-
-**File-based cache**: Persistent but slower
35
-
-**Database cache**: Structured and persistent, but requires setup
36
-
37
-
We'll start with a simple in-memory cache using Python's built-in `dict` for development, then expand to a persistent solution like SQLite for production use.
38
-
39
-
### 3. Concurrency Model
40
-
41
-
For optimizing to leverage all available processors:
42
-
43
-
-**Threading**: Good for I/O bound operations like web requests
44
-
-**Multiprocessing**: Better for CPU-bound tasks
45
-
-**Async I/O**: Excellent for many concurrent I/O operations
46
-
47
-
We'll use `asyncio` with `aiohttp` for making concurrent requests, as web scraping is primarily I/O bound.
48
-
49
-
### 4. URL Handling and Filtering
50
-
51
-
For domain/subdomain filtering:
52
-
- Use `urllib.parse` to extract and compare domains
53
-
- Implement a configurable rule system (allow/deny lists)
54
-
- Handle relative URLs properly by converting them to absolute
55
-
56
-
### 5. Depth Management
57
-
58
-
For recursion depth:
59
-
- Track depth as a parameter passed to each recursive call
60
-
- Implement a max depth check before proceeding with crawling
61
-
- Consider breadth-first vs. depth-first strategies
62
-
63
-
### 6. Error Handling & Politeness
64
-
65
-
Additional considerations:
66
-
- Robust error handling for network issues and malformed HTML
67
-
- Rate limiting to avoid overwhelming servers
68
-
- Respect for `robots.txt` rules
69
-
- User-agent identification
70
-
71
-
### 7. Data Storage
72
-
73
-
For storing the crawled data:
74
-
- Define a clear structure for storing URLs and their associated content
75
-
- Consider what metadata to keep (status code, headers, timestamps)
If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:
227
-
228
-
```bash
229
-
./venv/bin/python generate_test_site.py
230
-
```
231
-
232
-
For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.
0 commit comments