Skip to content

Commit 278503f

Browse files
Merge pull request #3 from johnburbridge/refactor-docs
Refactor docs
2 parents c204e16 + 78c39ad commit 278503f

File tree

5 files changed

+275
-146
lines changed

5 files changed

+275
-146
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,5 @@ htmlcov/
7171
coverage.xml
7272
*.cover
7373
/example-site/*
74+
/cache/*
75+
/results/*

README-test-environment.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Web Scraper Test Environment
22

3+
[← Back to README](README.md)
4+
35
This directory contains a complete local test environment for testing the web scraper against a controlled website with a known structure.
46

57
## Generated Test Site

README.md

Lines changed: 10 additions & 146 deletions
Original file line numberDiff line numberDiff line change
@@ -3,84 +3,19 @@
33
[![Python Tests](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml/badge.svg)](https://github.com/johnburbridge/scraper/actions/workflows/python-package.yml)
44
[![Coverage](https://codecov.io/gh/johnburbridge/scraper/branch/main/graph/badge.svg)](https://codecov.io/gh/johnburbridge/scraper)
55

6-
## Objectives
7-
* Given a URL, recursively crawl its links
8-
* Store the response
9-
* Parse the response extracting new links
10-
* Visit each link and repeat the operations above
11-
* Cache the results to avoid duplicative requests
12-
* Optionally, specify the maximum recursion depth
13-
* Optionally, specify whether to allow requests to other subdomains or domains
14-
* Optimize the process to leverage all available processors
6+
A flexible web crawler that recursively crawls websites, respects robots.txt, and provides various output options.
157

16-
## Design
8+
## Documentation
179

18-
### 1. Architecture Components
10+
- [Project Overview and Features](docs/project.md)
11+
- [Development Guide](docs/develop.md)
12+
- [Test Environment Documentation](README-test-environment.md)
1913

20-
The project will be structured with these core components:
21-
22-
1. **Crawler** - Main component that orchestrates the crawling process
23-
2. **RequestHandler** - Handles HTTP requests with proper headers, retries, and timeouts
24-
3. **ResponseParser** - Parses HTML responses to extract links
25-
4. **Cache** - Stores visited URLs and their responses
26-
5. **LinkFilter** - Filters links based on domain/subdomain rules
27-
6. **TaskManager** - Manages parallel execution of crawling tasks
28-
29-
### 2. Caching Strategy
30-
31-
For the caching requirement:
32-
33-
- **In-memory cache**: Fast but limited by available RAM
34-
- **File-based cache**: Persistent but slower
35-
- **Database cache**: Structured and persistent, but requires setup
36-
37-
We'll start with a simple in-memory cache using Python's built-in `dict` for development, then expand to a persistent solution like SQLite for production use.
38-
39-
### 3. Concurrency Model
40-
41-
For optimizing to leverage all available processors:
42-
43-
- **Threading**: Good for I/O bound operations like web requests
44-
- **Multiprocessing**: Better for CPU-bound tasks
45-
- **Async I/O**: Excellent for many concurrent I/O operations
46-
47-
We'll use `asyncio` with `aiohttp` for making concurrent requests, as web scraping is primarily I/O bound.
48-
49-
### 4. URL Handling and Filtering
50-
51-
For domain/subdomain filtering:
52-
- Use `urllib.parse` to extract and compare domains
53-
- Implement a configurable rule system (allow/deny lists)
54-
- Handle relative URLs properly by converting them to absolute
55-
56-
### 5. Depth Management
57-
58-
For recursion depth:
59-
- Track depth as a parameter passed to each recursive call
60-
- Implement a max depth check before proceeding with crawling
61-
- Consider breadth-first vs. depth-first strategies
62-
63-
### 6. Error Handling & Politeness
64-
65-
Additional considerations:
66-
- Robust error handling for network issues and malformed HTML
67-
- Rate limiting to avoid overwhelming servers
68-
- Respect for `robots.txt` rules
69-
- User-agent identification
70-
71-
### 7. Data Storage
72-
73-
For storing the crawled data:
74-
- Define a clear structure for storing URLs and their associated content
75-
- Consider what metadata to keep (status code, headers, timestamps)
76-
77-
## User Guide
78-
79-
### Installation
14+
## Installation
8015

8116
1. Clone the repository:
8217
```bash
83-
git clone https://github.com/your-username/scraper.git
18+
git clone https://github.com/johnburbridge/scraper.git
8419
cd scraper
8520
```
8621

@@ -95,7 +30,7 @@ source venv/bin/activate # On Windows: venv\Scripts\activate
9530
pip install -r requirements.txt
9631
```
9732

98-
### Basic Usage
33+
## Basic Usage
9934

10035
To start crawling a website:
10136

@@ -105,7 +40,7 @@ python main.py https://example.com
10540

10641
This will crawl the website with default settings (depth of 3, respecting robots.txt, not following external links).
10742

108-
### Command Line Options
43+
## Command Line Options
10944

11045
The scraper supports the following command-line arguments:
11146

@@ -128,7 +63,7 @@ The scraper supports the following command-line arguments:
12863
| `--max-subsitemaps MAX_SUBSITEMAPS` | Maximum number of sub-sitemaps to process (default: 5) |
12964
| `--sitemap-timeout SITEMAP_TIMEOUT` | Timeout in seconds for sitemap processing (default: 30) |
13065

131-
### Examples
66+
## Examples
13267

13368
#### Crawl with a specific depth limit:
13469
```bash
@@ -159,74 +94,3 @@ python main.py https://example.com --depth 4 --concurrency 20 --ignore-robots
15994
```bash
16095
python main.py https://example.com --delay 1.0
16196
```
162-
163-
## Testing
164-
165-
The project includes a local testing environment based on Docker that generates a controlled website structure for development and testing purposes.
166-
167-
### Test Environment Features
168-
169-
- 400+ HTML pages in a hierarchical structure
170-
- Maximum depth of 5 levels
171-
- Navigation links between pages at different levels
172-
- Proper `robots.txt` and `sitemap.xml` files
173-
- Random metadata on pages for testing extraction
174-
175-
### Setting Up the Test Environment
176-
177-
1. Make sure Docker and Docker Compose are installed and running.
178-
179-
2. Generate the test site (if not already done):
180-
```bash
181-
./venv/bin/python generate_test_site.py
182-
```
183-
184-
3. Start the Nginx server:
185-
```bash
186-
docker-compose up -d
187-
```
188-
189-
4. The test site will be available at http://localhost:8080
190-
191-
### Running Tests Against the Test Environment
192-
193-
#### Basic crawl:
194-
```bash
195-
python main.py http://localhost:8080 --depth 2
196-
```
197-
198-
#### Test with sitemap parsing:
199-
```bash
200-
python main.py http://localhost:8080 --use-sitemap
201-
```
202-
203-
#### Test robots.txt handling:
204-
```bash
205-
# Default behavior respects robots.txt
206-
python main.py http://localhost:8080 --depth 4
207-
208-
# Ignore robots.txt to crawl all pages
209-
python main.py http://localhost:8080 --depth 4 --ignore-robots
210-
```
211-
212-
#### Save the crawled results:
213-
```bash
214-
python main.py http://localhost:8080 --output-dir test_results
215-
```
216-
217-
### Stopping the Test Environment
218-
219-
To stop the Docker container:
220-
```bash
221-
docker-compose down
222-
```
223-
224-
### Regenerating the Test Site
225-
226-
If you need to regenerate the test site with different characteristics, modify the configuration variables at the top of the `generate_test_site.py` file and run:
227-
228-
```bash
229-
./venv/bin/python generate_test_site.py
230-
```
231-
232-
For more details on the test environment, see the [README-test-environment.md](README-test-environment.md) file.

docs/develop.md

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
# Development Guide
2+
3+
[← Back to README](../README.md)
4+
5+
This guide provides instructions for setting up a development environment, running tests, and contributing to the scraper project.
6+
7+
## Setting Up a Development Environment
8+
9+
### Prerequisites
10+
11+
- Python 3.11 or higher
12+
- Docker and Docker Compose (for integration testing)
13+
- Git
14+
15+
### Initial Setup
16+
17+
1. Clone the repository:
18+
```bash
19+
git clone https://github.com/johnburbridge/scraper.git
20+
cd scraper
21+
```
22+
23+
2. Create and activate a virtual environment:
24+
```bash
25+
python -m venv venv
26+
source venv/bin/activate # On Windows: venv\Scripts\activate
27+
```
28+
29+
3. Install development dependencies:
30+
```bash
31+
pip install -r requirements-dev.txt
32+
pip install -r requirements.txt
33+
```
34+
35+
## Running Tests
36+
37+
### Unit Tests
38+
39+
To run all unit tests:
40+
```bash
41+
pytest
42+
```
43+
44+
To run tests with coverage reporting:
45+
```bash
46+
pytest --cov=scraper --cov-report=term-missing
47+
```
48+
49+
To run a specific test file:
50+
```bash
51+
pytest tests/test_crawler.py
52+
```
53+
54+
### Integration Tests
55+
56+
The project includes a Docker-based test environment that generates a controlled website for testing.
57+
58+
1. Generate the test site:
59+
```bash
60+
python generate_test_site.py
61+
```
62+
63+
2. Start the test environment:
64+
```bash
65+
docker-compose up -d
66+
```
67+
68+
3. Run the scraper against the test site:
69+
```bash
70+
python main.py http://localhost:8080 --depth 2
71+
```
72+
73+
4. Stop the test environment when done:
74+
```bash
75+
docker-compose down
76+
```
77+
78+
### Alternative Test Server
79+
80+
If Docker is unavailable, you can use the Python-based test server:
81+
82+
```bash
83+
python serve_test_site.py
84+
```
85+
86+
This will start a local HTTP server on port 8080 serving the same test site.
87+
88+
## Code Quality Tools
89+
90+
### Linting
91+
92+
To check code quality with flake8:
93+
```bash
94+
flake8 scraper tests
95+
```
96+
97+
### Type Checking
98+
99+
To run type checking with mypy:
100+
```bash
101+
mypy scraper
102+
```
103+
104+
### Code Formatting
105+
106+
To format code with black:
107+
```bash
108+
black scraper tests
109+
```
110+
111+
## Debugging
112+
113+
### Verbose Output
114+
115+
To enable verbose logging:
116+
```bash
117+
python main.py https://example.com -v
118+
```
119+
120+
### Profiling
121+
122+
To profile the crawler's performance:
123+
```bash
124+
python -m cProfile -o crawler.prof main.py https://example.com --depth 1
125+
python -c "import pstats; p = pstats.Stats('crawler.prof'); p.sort_stats('cumtime').print_stats(30)"
126+
```
127+
128+
## Test Coverage
129+
130+
Current test coverage is monitored through CI and displayed as a badge in the README. To increase coverage:
131+
132+
1. Check current coverage gaps:
133+
```bash
134+
pytest --cov=scraper --cov-report=term-missing
135+
```
136+
137+
2. Target untested functions or code paths with new tests
138+
3. Verify coverage improvement after adding tests
139+
140+
## Project Structure
141+
142+
```
143+
scraper/ # Main package directory
144+
├── __init__.py # Package initialization
145+
├── cache_manager.py # Cache implementation
146+
├── callbacks.py # Callback functions for crawled pages
147+
├── crawler.py # Main crawler class
148+
├── request_handler.py # HTTP request/response handling
149+
├── response_parser.py # HTML parsing and link extraction
150+
├── robots_parser.py # robots.txt parsing and checking
151+
└── sitemap_parser.py # sitemap.xml parsing
152+
153+
tests/ # Test suite
154+
├── __init__.py
155+
├── conftest.py # pytest fixtures
156+
├── test_cache.py # Tests for cache_manager.py
157+
├── test_crawler.py # Tests for crawler.py
158+
├── test_request_handler.py
159+
├── test_response_parser.py
160+
├── test_robots_parser.py
161+
└── test_sitemap_parser.py
162+
163+
docs/ # Documentation
164+
├── project.md # Project overview and features
165+
└── develop.md # Development guide
166+
167+
.github/workflows/ # CI configuration
168+
```
169+
170+
## Contributing
171+
172+
### Pull Request Process
173+
174+
1. Create a new branch for your feature or bugfix
175+
2. Implement your changes with appropriate tests
176+
3. Ensure all tests pass and coverage doesn't decrease
177+
4. Submit a pull request with a clear description of the changes
178+
179+
### Coding Standards
180+
181+
- Follow PEP 8 style guidelines
182+
- Include docstrings for all functions, classes, and modules
183+
- Add type hints to function signatures
184+
- Keep functions focused on a single responsibility
185+
- Write tests for all new functionality

0 commit comments

Comments
 (0)