intelli3text is the text-processing backbone of the broader intelli3 project (a classification-oriented research/engineering effort).
It ingests texts from Web/PDF/DOCX/TXT, performs cleaning and multilingual normalization (PT/EN/ES), applies paragraph-level language identification (LID), and produces an auditable PDF report (raw → cleaned → normalized), ready for downstream classification tasks.
This work is part of my Master’s research, advised by Sidgley Camargo de Andrade (advisor) and Clodis Boscarioli (co-advisor).
- Docs: https://jeffersonspeck.github.io/intelli3text/
- PyPI: https://pypi.org/project/intelli3text/
- Repository: https://github.com/jeffersonspeck/intelli3text
- Acquire: extract main content from the web (and read local PDF/DOCX/TXT).
- Clean: remove boilerplate, linebreak artifacts, and markup noise.
- Detect language (per paragraph): fastText LID (
lid.176.ftz) for robust PT/EN/ES routing. - Normalize: spaCy-based normalization pipeline for stable, comparable text.
- Export: generate an auditable PDF and structured outputs for classification pipelines in intelli3.
- Frictionless install.
pip install intelli3textdeclares and enforcesfasttext>=0.9.2.
On first run, models are auto-downloaded (fastText LID and spaCy) and then cached/embedded for offline operation. - Reproducible by default. Pinned binaries and install-time model bootstrap minimize OS/WSL/environment drift.
- Paragraph granularity. LID and normalization operate per-paragraph, improving quality on mixed-language sources.
- Auditable outputs. PDF report includes raw → cleaned → normalized views to support inspection and research traceability.
- Usage Manual
- Why this project?
- Key features
- Requirements
- Installation
- Quick start (CLI)
- CLI examples
- Python usage (API)
- Language identification (LID)
- spaCy models & normalization
- Cleaning pipeline
- PDF export
- Cache, auto-downloads & offline mode
- Architecture & Design Patterns
- Design Science Research (DSR)
- Binary compatibility (NumPy/Thinc/spaCy)
- Performance tips
- Extensibility
- Troubleshooting
- Publishing to PyPI
- Roadmap
- License
- How to cite
In research and production, common needs include:
- Ingest text from heterogeneous sources (web, PDFs, DOCX, TXT);
- Clean and normalize the content;
- Lemmatize and remove stopwords;
- Detect language accurately, including bilingual documents;
- Export results with traceability (PDF that shows normalized, cleaned, and raw text).
intelli3text is built to be plug-and-play: pip install and go — no native toolchains, no manual compiles, no painful environment setup.
- Ingestion: URL (HTML), PDF (
pdfminer.six), DOCX (python-docx), TXT. - Cleaning: Unicode fixes (
ftfy), noise removal (clean-text), PDF-specific line-break & hyphenation heuristics. - Paragraph-level LID: fastText LID (176 languages) with tolerant fallback.
- spaCy normalization: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
- PDF export: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
- Auto-download on first run:
lid.176.bin(fastText LID);- spaCy models for PT/EN/ES (
lg→md→sm) with offline fallback.
- CLI & Python API: use from shell or embed in code.
- Python 3.9+
- Internet only on first run (to download models). After that, it works offline.
- To avoid binary mismatches, the package pins compatible versions of
numpy,thinc, andspacy.
pip install intelli3text
# or from a local repo:
# pip install .No extra scripts. On first execution, required models are fetched to a local cache automatically.
intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdfOutput:
- JSON to
stdoutwithlanguage_global,cleaned,normalized, and a list ofparagraphs. - A PDF report at
output.pdf.
-
Local PDF:
intelli3text "./my_paper.pdf" --export-pdf report.pdf -
Choose spaCy model size:
intelli3text "URL" --nlp-size md # options: lg (default) | md | sm
-
Select cleaners:
intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks -
Save JSON to file:
intelli3text "URL" --json-out result.json -
Use CLD3 as primary (if installed as extra):
pip install intelli3text[cld3] intelli3text "URL" --lid-primary cld3 --lid-fallback none
Full CLI reference: see Docs → CLI on the website: https://jeffersonspeck.github.io/intelli3text/
from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
cleaners=["ftfy", "clean_text", "pdf_breaks"],
lid_primary="fasttext", # or "cld3" if you installed the extra
lid_fallback=None, # or "cld3"
nlp_model_pref="lg", # "lg" | "md" | "sm"
export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")
print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])More samples (including safe-to-import examples): Docs → Examples.
-
Primary: fastText LID (
lid.176.bin) auto-downloaded on first use. -
Tolerant: if
fasttextis unavailable, the pipeline won’t crash — it returns"pt"with confidence0.0as a safe fallback. -
Accuracy: detection is per paragraph;
language_globalis the most frequent. -
Optional:
pycld3via extra:pip install intelli3text[cld3] # CLI: --lid-primary cld3 --lid-fallback none
-
Size preference:
lg→md→sm. -
If the model is missing, the library tries to download it.
-
Offline: falls back to
spacy.blank(<lang>)with asentencizer(no crash). -
Normalization includes:
- tokenization;
- dropping stopwords/punctuation/whitespace;
- lemmatization (when the model has a lexicon);
- joining lemmas.
Default order (--cleaners ftfy,clean_text,pdf_breaks):
- FTFY: fixes Unicode glitches.
- clean-text: removes URLs/emails/phones; keeps numbers/punctuation by default.
- pdf_breaks: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).
You can customize the list/order via CLI or API.
The report includes:
-
Summary (global language, total paragraphs),
-
Global Normalized Text (optional),
-
Per-paragraph table (language, confidence, normalized preview),
-
Per-paragraph sections showing:
- normalized,
- cleaned,
- raw.
Library: ReportLab.
-
Default cache directory:
~/.cache/intelli3text/Override via env var:INTELLI3TEXT_CACHE_DIR=/your/custom/path -
Auto-download on first use:
lid.176.bin(fastText LID),- spaCy models PT/EN/ES in order
lg→md→sm.
-
Offline behavior:
- LID returns fallback
"pt", 0.0if fastText is unavailable; - spaCy uses
blank()(functional, but without full lexical features).
- LID returns fallback
Applied patterns:
-
Builder:
PipelineBuildercomposes extractors, cleaners, LID, normalizer, and exporters from declarative config. -
Strategy:
- Extractors (Web/PDF/DOCX/TXT) implement
IExtractor. - Cleaners implement
ICleaner, chained viaCleanerChain. - Language Detectors implement a simple interface (
FastTextLID,CLD3LID). - Normalizer implements
INormalizer(SpacyNormalizerhere). - Exporters implement
IExporter(PDFExporterhere).
- Extractors (Web/PDF/DOCX/TXT) implement
-
Factory/Registry: lazy loading of spaCy models by lang/size with fallbacks.
-
Facade: CLI and
Pipeline.process()offer a simple entry point.
Package layout (summary)
src/intelli3text/
__init__.py
__main__.py # CLI
config.py # Intelli3Config (parameters)
utils.py # cache/download helpers
builder.py # PipelineBuilder (Builder)
pipeline.py # Pipeline (Facade)
extractors/ # Strategy
base.py
web_trafilatura.py
file_pdfminer.py
file_docx.py
file_text.py
cleaners/ # Strategy + Chain of Responsibility
base.py
chain.py
unicode_ftfy.py
clean_text.py
pdf_linebreaks.py
lid/ # Strategy
base.py
fasttext_lid.py
# (optional) cld3_lid.py
nlp/
base.py
registry.py # Factory/Registry (spaCy models + fallback)
spacy_normalizer.py # Strategy
export/
base.py
pdf_reportlab.py # Strategy
Artifact. A production-oriented NLP pipeline for ingestion, cleaning, paragraph-level language identification (LID), normalization, and PDF export, designed for reproducibility (binary pins, install-time model bootstrap) and trivial installation. This aligns with DSR’s emphasis on building useful artifacts that extend human and organizational capabilities. :contentReference[oaicite:0]{index=0}
Problem. Heterogeneous sources (Web/PDF/DOCX/TXT), bilingual/multilingual content, and environment friction (native deps, wheels, OS/WSL divergences) often break reproducibility and degrade text quality via boilerplate/noise. Prior work highlights the importance of robust boilerplate removal and main-content extraction for downstream NLP quality. :contentReference[oaicite:1]{index=1}
Design.
- Acquisition & cleaning: Web extraction via Trafilatura (main text, comments, metadata) plus jusText-style boilerplate filtering; both are well-studied choices for reliable textual corpora. :contentReference[oaicite:2]{index=2}
- Language ID: fastText LID model (recognizes 176 languages) with install-time download/embedding to remove runtime network dependency. :contentReference[oaicite:3]{index=3}
- Normalization: spaCy pipeline (industrial-strength NLP; v2+ with Bloom embeddings/CNNs) with pinned versions for deterministic behavior across environments. :contentReference[oaicite:4]{index=4}
- Reproducibility: strict dependency pinning and build hooks; artifact packaged with the LID model to guarantee availability at install time, consistent with DSR guidance on rigor and verifiability. :contentReference[oaicite:5]{index=5}
Demonstration. Command-line interface and Python API across Web/PDF/DOCX/TXT; LID for PT/EN/ES using fastText; auditable PDF report that shows raw, cleaned, and normalized views. :contentReference[oaicite:6]{index=6}
Evaluation.
- Technical robustness: empirical tests across user-site installs, WSL, and Windows; deterministic packaging validated by install-time model embedding. (Engineering claim; methodology aligned with DSR evaluation guidance.) :contentReference[oaicite:7]{index=7}
- Quality: LID confidence/coverage supported by the fastText 176-language models; cleaning quality supported by established extractors (Trafilatura/jusText). :contentReference[oaicite:8]{index=8}
Contributions.
- Engineering: Builder/Strategy/Factory patterns to decouple extractors, cleaners, LID, and normalizers for reuse. (Standard software-engineering patterns applied to the artifact.)
- DSR grounding: Follows Hevner et al.’s design-science guidelines (relevance, rigor, design evaluation) and Peffers et al.’s DSRM (problem identification → artifact design → evaluation → communication). :contentReference[oaicite:9]{index=9}
Notes on verification:
- DSR foundations are confirmed via MISQ (Hevner et al., 2004) and the DSRM (Peffers et al., 2007). (MISQ)
- Trafilatura demo paper (ACL 2021) and docs confirm main-content extraction with comments/metadata. (ACL Anthology)
- jusText origins and efficacy for boilerplate removal are documented in Pomikálek’s thesis. (Informações da Universidade)
- fastText LID page confirms 176-language models (
lid.176.*). (fastText) - spaCy v2 architecture (Bloom embeddings/CNNs) is documented in Honnibal & Montani. (Sentometrics Research)
To avoid the classic numpy.dtype size changed error:
-
We pin compatible versions in
pyproject.toml. -
If you already had other global packages and hit this error:
pip uninstall -y spacy thinc numpypip cache purgepip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"pip install --user --no-cache-dir intelli3text(or-e .from the local repo)
Tip: always use the same Python that runs
intelli3text(checkhead -1 ~/.local/bin/intelli3text).
- Paragraph length: controlled by
paragraph_min_chars(default 30) andlid_min_chars(default 60). - LID sample cap: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
- spaCy model size:
smis lighter;lggives better quality (default).
- New sources: implement
IExtractorand register inPipelineBuilder. - New cleaners: implement
ICleanerand map it inNAME2CLEANER. - New LIDs: implement the interface under
lid/base.py. - Exporters: implement
IExporter(e.g., JSONL/CSV/HTML), expose option in CLI/Builder.
-
Trafilatura ‘unidecode’ warning: already handled — we depend on
Unidecode. -
No Internet on first run:
- LID: fallback
"pt", 0.0. - spaCy:
spacy.blank(<lang>). - Later, with Internet, run again to fetch full models.
- LID: fallback
-
ModuleNotFoundError: fasttext:- We depend on
fasttext-wheel(prebuilt wheels). - Reinstall:
pip install fasttext-wheel.
- We depend on
More tips and parameter-by-parameter guidance: https://jeffersonspeck.github.io/intelli3text/
- Exporters: HTML/Markdown with paragraph navigation.
- Quality metrics (lexical density, diversity, etc.).
- More languages via custom spaCy models.
- Optional normalization using Stanza.
MIT — you’re free to use, modify and distribute.
Note: the original upstream licenses of third-party models and libraries still apply.
Speck, J. (2025). intelli3text: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: https://github.com/jeffersonspeck/intelli3text