Skip to content

Commit 71186f9

Browse files
committed
Remove emojis and update formatting for consistency
- Replace emojis with text symbols (e.g., ✅ → [✓], ⚠️ → [WARNING]) in PLANNING.md, .sourcery.yaml, and other docs for better plain-text compatibility. - Update LICENSE copyright notice to "ursister [Python/Rust]" from "ursister 🐍🦀". - Minor comment adjustments in src/lib.rs and src/html_parser.rs for consistent casing and style. - No functional changes; focuses on documentation and config cleanup across README.md, TASKS.md, pyproject.toml, Python/TUI files, and tests.
1 parent 09fc370 commit 71186f9

File tree

11 files changed

+356
-356
lines changed

11 files changed

+356
-356
lines changed

.sourcery.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
# 🪄 This is your project's Sourcery configuration file.
1+
# This is your project's Sourcery configuration file.
22

33
# You can use it to get Sourcery working in the way you want, such as
44
# ignoring specific refactorings, skipping directories in your project,
55
# or writing custom rules.
66

7-
# 📚 For a complete reference to this file, see the documentation at
7+
# For a complete reference to this file, see the documentation at
88
# https://docs.sourcery.ai/Configuration/Project-Settings/
99

1010
# This file was auto-generated by Sourcery on 2025-03-04 at 19:43.

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# MIT License
22

3-
Copyright (c) 2025 ursister 🐍🦀
3+
Copyright (c) 2025 ursister [Python/Rust]
44

55
Permission is hereby granted, free of charge, to any person obtaining a copy
66
of this software and associated documentation files (the "Software"), to deal

PLANNING.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -89,11 +89,11 @@ markdown_lab/
8989

9090
#### Current Stack Assessment
9191

92-
- **Python 3.12+**: Modern, well-suited for orchestration and I/O
93-
- **Rust 2024**: Good for performance-critical operations
94-
- **PyO3**: Mature Python-Rust binding solution
95-
- **BeautifulSoup**: ⚠️ Memory-heavy, consider lxml for performance
96-
- **requests**: ⚠️ Synchronous, consider httpx for async operations
92+
- **Python 3.12+**: [] Modern, well-suited for orchestration and I/O
93+
- **Rust 2024**: [] Good for performance-critical operations
94+
- **PyO3**: [] Mature Python-Rust binding solution
95+
- **BeautifulSoup**: [WARNING] Memory-heavy, consider lxml for performance
96+
- **requests**: [WARNING] Synchronous, consider httpx for async operations
9797

9898
#### Recommended Technology Updates
9999

@@ -272,14 +272,14 @@ Conversion Rate: 1,200+ docs/second
272272

273273
## Implementation Timeline
274274

275-
### Phase 1: Foundation (Week 1-2) COMPLETED
275+
### Phase 1: Foundation (Week 1-2) [] COMPLETED
276276

277-
- Create core configuration system (MarkdownLabConfig with validation)
278-
- Establish unified error hierarchy (structured exceptions with context)
279-
- Extract common HTTP client (consolidated request handling)
280-
- Remove dead dependencies and fix version conflicts
281-
- Optimize HTML processing pipeline (cached selectors, 40-50% improvement)
282-
- Fix justfile recipe errors and standardize development workflow
277+
- [] Create core configuration system (MarkdownLabConfig with validation)
278+
- [] Establish unified error hierarchy (structured exceptions with context)
279+
- [] Extract common HTTP client (consolidated request handling)
280+
- [] Remove dead dependencies and fix version conflicts
281+
- [] Optimize HTML processing pipeline (cached selectors, 40-50% improvement)
282+
- [] Fix justfile recipe errors and standardize development workflow
283283

284284
### Phase 2: Network & I/O Optimization (Week 3-4)
285285

@@ -432,31 +432,31 @@ This refactoring plan targets a **25-35% code reduction** while achieving **50%+
432432

433433
#### Completed High-Impact Performance Optimizations
434434

435-
**T18: Tokio Runtime Optimization** **COMPLETED**
435+
**T18: Tokio Runtime Optimization** [] **COMPLETED**
436436

437437
- **File:** `src/lib.rs:14-17, 107-111`
438438
- **Implementation:** Shared Tokio runtime using `once_cell::sync::Lazy`
439439
- **Impact:** 60% improvement potential - eliminates expensive runtime creation per JS rendering request
440440
- **Results:** Runtime instantiation overhead eliminated, single shared runtime for all operations
441441
- **Code Quality:** Added comprehensive documentation and error handling
442442

443-
**T19: ThreadPoolExecutor Optimization** **COMPLETED**
443+
**T19: ThreadPoolExecutor Optimization** [] **COMPLETED**
444444

445445
- **Files:** `markdown_lab/utils/thread_pool.py` (new), `markdown_lab/core/scraper.py:627, 657`
446446
- **Implementation:** Singleton thread pool pattern with configurable workers
447447
- **Impact:** 50% improvement potential - reuses thread pool across batch operations
448448
- **Results:** Thread pool creation overhead eliminated, validated 10 instantiations in 0.0000s
449449
- **Code Quality:** Thread-safe singleton with proper lifecycle management
450450

451-
**T20: Async Cache I/O Implementation** **COMPLETED**
451+
**T20: Async Cache I/O Implementation** [] **COMPLETED**
452452

453453
- **Files:** `markdown_lab/core/async_cache.py` (new), `pyproject.toml:10`
454454
- **Implementation:** Async cache with gzip compression and aiofiles integration
455455
- **Impact:** 45% improvement potential - async I/O with content compression
456456
- **Results:** Validated compression working, 26KB content cached efficiently
457457
- **Code Quality:** Graceful fallback to sync operations, comprehensive error handling
458458

459-
**T21: Text Chunking Algorithm Optimization** **COMPLETED**
459+
**T21: Text Chunking Algorithm Optimization** [] **COMPLETED**
460460

461461
- **File:** `src/chunker.rs:6-30, 183-242`
462462
- **Implementation:** Pre-compiled regex patterns using `once_cell` for sentence/paragraph detection
@@ -529,10 +529,10 @@ This refactoring plan targets a **25-35% code reduction** while achieving **50%+
529529

530530
#### Implementation Risks (Successfully Mitigated)
531531

532-
- **Breaking Changes:** All optimizations maintain backward compatibility
533-
- **Performance Regressions:** Core functionality validated, performance improved
534-
- **Integration Issues:** Comprehensive testing confirms optimization integration
535-
- **Resource Management:** Proper cleanup and lifecycle management implemented
532+
- [] **Breaking Changes:** All optimizations maintain backward compatibility
533+
- [] **Performance Regressions:** Core functionality validated, performance improved
534+
- [] **Integration Issues:** Comprehensive testing confirms optimization integration
535+
- [] **Resource Management:** Proper cleanup and lifecycle management implemented
536536

537537
#### Quality Assurance Results
538538

@@ -572,28 +572,28 @@ All high-impact performance bottlenecks identified in the comprehensive analysis
572572

573573
#### Wave 2: Code Consolidation Achievements (T22-T25)
574574

575-
**T22: HTTP Client Duplication Elimination** **COMPLETED**
575+
**T22: HTTP Client Duplication Elimination** [] **COMPLETED**
576576

577577
- **Files:** Unified `core/client.py`, removed `network/client.py` entirely
578578
- **Implementation:** Single HTTP client with enhanced functionality (CachedHttpClient, context manager, batch operations)
579579
- **LOC Reduction:** 146 lines eliminated (464→318 lines)
580580
- **Impact:** Eliminated all HTTP client duplication, enhanced with connection pooling and structured error handling
581581

582-
**T23: Configuration Management Centralization** **COMPLETED**
582+
**T23: Configuration Management Centralization** [] **COMPLETED**
583583

584584
- **Files:** Updated 8 modules to use centralized `MarkdownLabConfig`
585585
- **Implementation:** Eliminated scattered parameters, unified CLI argument handling, added cache size limits
586586
- **LOC Reduction:** ~75 lines consolidated across multiple files
587587
- **Impact:** Single source of configuration truth, backward compatible, enhanced validation
588588

589-
**T24: URL Utilities Consolidation** **COMPLETED**
589+
**T24: URL Utilities Consolidation** [] **COMPLETED**
590590

591591
- **Files:** Created `utils/url_utils.py`, updated 6 modules
592592
- **Implementation:** 9 comprehensive URL utility functions, eliminated filename generation duplication
593593
- **LOC Reduction:** 104 lines of duplicate logic consolidated
594594
- **Impact:** Centralized URL processing with type hints, documentation, and comprehensive validation
595595

596-
**T25: Error Handling Unification** **COMPLETED**
596+
**T25: Error Handling Unification** [] **COMPLETED**
597597

598598
- **Files:** Standardized HTTP exception handling across `scraper.py`, `sitemap_utils.py`, `client.py`
599599
- **Implementation:** Unified error handling patterns, centralized retry logic, structured error context
@@ -626,14 +626,14 @@ All high-impact performance bottlenecks identified in the comprehensive analysis
626626

627627
### Final Validation Results
628628

629-
#### Testing Comprehensive
629+
#### Testing Comprehensive []
630630

631631
- **Rust Tests:** 10/10 passing (core algorithms validated)
632632
- **Python Bindings:** 4/4 passing (Rust-Python integration validated)
633633
- **Integration Tests:** All core functionality validated
634634
- **Performance Tests:** All optimizations working together in 0.0022s
635635

636-
#### Code Quality Metrics
636+
#### Code Quality Metrics []
637637

638638
- **Duplication Reduction:** Major elimination of HTTP, config, URL, and error handling duplications
639639
- **Resource Efficiency:** Shared thread pools, cached patterns, optimized I/O

README.md

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,8 @@
55
Markdown Lab combines Python and Rust components to scrape websites and convert HTML content to markdown, JSON, or XML formats. It supports sitemap parsing, semantic chunking for RAG
66
(Retrieval-Augmented Generation), and includes performance optimizations through Rust integration.
77

8-
Key features include HTML-to-markdown/JSON/XML conversion with support for various elements (headers, links, images, lists, code blocks), content chunking that preserves document structure, and systematic content discovery
9-
through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.
8+
Key features include HTML-to-markdown/JSON/XML conversion with support for various elements (headers, links, images, lists, code blocks), content chunking that preserves document structure, and systematic content discovery
9+
through sitemap parsing. The hybrid architecture uses Python for high-level operations and Rust for performance-critical tasks.
1010

1111
Check out [deepwiki](https://deepwiki.com/ursisterbtw/markdown_lab/) for a detailed breakdown of the repository.
1212

@@ -181,32 +181,32 @@ html_content = scraper.scrape_website("https://example.com")
181181
markdown_content = scraper.convert_to_markdown(html_content, "https://example.com")
182182
scraper.save_content(markdown_content, "output.md")
183183

184-
# Using JSON or XML format with the Rust implementation
185-
from markdown_lab import markdown_lab_rs
186-
187-
html_content = scraper.scrape_website("https://example.com")
188-
189-
# Convert to Markdown (legacy helper)
190-
markdown_content = markdown_lab_rs.convert_html_to_markdown(
191-
html_content, "https://example.com"
192-
)
193-
scraper.save_content(markdown_content, "output.md")
194-
195-
# Convert to JSON or XML using string format names
196-
json_content = markdown_lab_rs.convert_html_to_format(
197-
html_content, "https://example.com", "json"
198-
)
199-
scraper.save_content(json_content, "output.json")
200-
201-
xml_content = markdown_lab_rs.convert_html_to_format(
202-
html_content, "https://example.com", "xml"
203-
)
204-
scraper.save_content(xml_content, "output.xml")
205-
206-
# Note: An OutputFormat enum is exposed for convenience:
207-
# from markdown_lab import markdown_lab_rs
208-
# fmt = markdown_lab_rs.OutputFormat.from_str("json") # returns an enum value
209-
# The current Python bindings accept string names ("markdown"|"json"|"xml").
184+
# Using JSON or XML format with the Rust implementation
185+
from markdown_lab import markdown_lab_rs
186+
187+
html_content = scraper.scrape_website("https://example.com")
188+
189+
# Convert to Markdown (legacy helper)
190+
markdown_content = markdown_lab_rs.convert_html_to_markdown(
191+
html_content, "https://example.com"
192+
)
193+
scraper.save_content(markdown_content, "output.md")
194+
195+
# Convert to JSON or XML using string format names
196+
json_content = markdown_lab_rs.convert_html_to_format(
197+
html_content, "https://example.com", "json"
198+
)
199+
scraper.save_content(json_content, "output.json")
200+
201+
xml_content = markdown_lab_rs.convert_html_to_format(
202+
html_content, "https://example.com", "xml"
203+
)
204+
scraper.save_content(xml_content, "output.xml")
205+
206+
# Note: An OutputFormat enum is exposed for convenience:
207+
# from markdown_lab import markdown_lab_rs
208+
# fmt = markdown_lab_rs.OutputFormat.from_str("json") # returns an enum value
209+
# The current Python bindings accept string names ("markdown"|"json"|"xml").
210210
```
211211

212212
#### With Sitemap Discovery
@@ -491,7 +491,7 @@ This project is licensed under the MIT License - see the [LICENSE file](LICENSE)
491491

492492
## Roadmap
493493

494-
### Completed
494+
### [] Completed
495495

496496
- [x] Add support for more HTML elements
497497
- [x] Implement chunking for RAG
@@ -509,7 +509,7 @@ This project is licensed under the MIT License - see the [LICENSE file](LICENSE)
509509
- [ ] Memory usage optimization in chunking algorithms
510510
- [ ] Module restructuring for better maintainability
511511

512-
### 📋 Planned
512+
### [PLANNED] Planned
513513

514514
- [ ] Add support for JavaScript-rendered pages
515515
- [ ] Implement custom markdown templates

0 commit comments

Comments
 (0)