Skip to content

jeffersonspeck/intelli3text

Repository files navigation

intelli3text

intelli3text is the text-processing backbone of the broader intelli3 project (a classification-oriented research/engineering effort).
It ingests texts from Web/PDF/DOCX/TXT, performs cleaning and multilingual normalization (PT/EN/ES), applies paragraph-level language identification (LID), and produces an auditable PDF report (raw → cleaned → normalized), ready for downstream classification tasks.

This work is part of my Master’s research, advised by Sidgley Camargo de Andrade (advisor) and Clodis Boscarioli (co-advisor).

What this module does (in the intelli3 ecosystem)

  • Acquire: extract main content from the web (and read local PDF/DOCX/TXT).
  • Clean: remove boilerplate, linebreak artifacts, and markup noise.
  • Detect language (per paragraph): fastText LID (lid.176.ftz) for robust PT/EN/ES routing.
  • Normalize: spaCy-based normalization pipeline for stable, comparable text.
  • Export: generate an auditable PDF and structured outputs for classification pipelines in intelli3.

How it works (design choices)

  • Frictionless install. pip install intelli3text declares and enforces fasttext>=0.9.2.
    On first run, models are auto-downloaded (fastText LID and spaCy) and then cached/embedded for offline operation.
  • Reproducible by default. Pinned binaries and install-time model bootstrap minimize OS/WSL/environment drift.
  • Paragraph granularity. LID and normalization operate per-paragraph, improving quality on mixed-language sources.
  • Auditable outputs. PDF report includes raw → cleaned → normalized views to support inspection and research traceability.

Table of Contents


Why this project?

In research and production, common needs include:

  1. Ingest text from heterogeneous sources (web, PDFs, DOCX, TXT);
  2. Clean and normalize the content;
  3. Lemmatize and remove stopwords;
  4. Detect language accurately, including bilingual documents;
  5. Export results with traceability (PDF that shows normalized, cleaned, and raw text).

intelli3text is built to be plug-and-play: pip install and go — no native toolchains, no manual compiles, no painful environment setup.


Key features

  • Ingestion: URL (HTML), PDF (pdfminer.six), DOCX (python-docx), TXT.
  • Cleaning: Unicode fixes (ftfy), noise removal (clean-text), PDF-specific line-break & hyphenation heuristics.
  • Paragraph-level LID: fastText LID (176 languages) with tolerant fallback.
  • spaCy normalization: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
  • PDF export: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
  • Auto-download on first run:
    • lid.176.bin (fastText LID);
    • spaCy models for PT/EN/ES (lg→md→sm) with offline fallback.
  • CLI & Python API: use from shell or embed in code.

Requirements

  • Python 3.9+
  • Internet only on first run (to download models). After that, it works offline.
  • To avoid binary mismatches, the package pins compatible versions of numpy, thinc, and spacy.

Installation

pip install intelli3text
# or from a local repo:
# pip install .

No extra scripts. On first execution, required models are fetched to a local cache automatically.


Quick start (CLI)

intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf

Output:

  • JSON to stdout with language_global, cleaned, normalized, and a list of paragraphs.
  • A PDF report at output.pdf.

CLI examples

  • Local PDF:

    intelli3text "./my_paper.pdf" --export-pdf report.pdf
  • Choose spaCy model size:

    intelli3text "URL" --nlp-size md
    # options: lg (default) | md | sm
  • Select cleaners:

    intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks
  • Save JSON to file:

    intelli3text "URL" --json-out result.json
  • Use CLD3 as primary (if installed as extra):

    pip install intelli3text[cld3]
    intelli3text "URL" --lid-primary cld3 --lid-fallback none

Full CLI reference: see Docs → CLI on the website: https://jeffersonspeck.github.io/intelli3text/


Python usage (API)

from intelli3text import PipelineBuilder, Intelli3Config

cfg = Intelli3Config(
    cleaners=["ftfy", "clean_text", "pdf_breaks"],
    lid_primary="fasttext",         # or "cld3" if you installed the extra
    lid_fallback=None,              # or "cld3"
    nlp_model_pref="lg",            # "lg" | "md" | "sm"
    export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)

pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")

print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])

More samples (including safe-to-import examples): Docs → Examples.


Language identification (LID)

  • Primary: fastText LID (lid.176.bin) auto-downloaded on first use.

  • Tolerant: if fasttext is unavailable, the pipeline won’t crash — it returns "pt" with confidence 0.0 as a safe fallback.

  • Accuracy: detection is per paragraph; language_global is the most frequent.

  • Optional: pycld3 via extra:

    pip install intelli3text[cld3]
    # CLI: --lid-primary cld3 --lid-fallback none

spaCy models & normalization

  • Size preference: lgmdsm.

  • If the model is missing, the library tries to download it.

  • Offline: falls back to spacy.blank(<lang>) with a sentencizer (no crash).

  • Normalization includes:

    • tokenization;
    • dropping stopwords/punctuation/whitespace;
    • lemmatization (when the model has a lexicon);
    • joining lemmas.

Cleaning pipeline

Default order (--cleaners ftfy,clean_text,pdf_breaks):

  1. FTFY: fixes Unicode glitches.
  2. clean-text: removes URLs/emails/phones; keeps numbers/punctuation by default.
  3. pdf_breaks: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).

You can customize the list/order via CLI or API.


PDF export

The report includes:

  • Summary (global language, total paragraphs),

  • Global Normalized Text (optional),

  • Per-paragraph table (language, confidence, normalized preview),

  • Per-paragraph sections showing:

    • normalized,
    • cleaned,
    • raw.

Library: ReportLab.


Cache, auto-downloads & offline mode

  • Default cache directory: ~/.cache/intelli3text/ Override via env var: INTELLI3TEXT_CACHE_DIR=/your/custom/path

  • Auto-download on first use:

    • lid.176.bin (fastText LID),
    • spaCy models PT/EN/ES in order lg→md→sm.
  • Offline behavior:

    • LID returns fallback "pt", 0.0 if fastText is unavailable;
    • spaCy uses blank() (functional, but without full lexical features).

Architecture & Design Patterns

Applied patterns:

  • Builder: PipelineBuilder composes extractors, cleaners, LID, normalizer, and exporters from declarative config.

  • Strategy:

    • Extractors (Web/PDF/DOCX/TXT) implement IExtractor.
    • Cleaners implement ICleaner, chained via CleanerChain.
    • Language Detectors implement a simple interface (FastTextLID, CLD3LID).
    • Normalizer implements INormalizer (SpacyNormalizer here).
    • Exporters implement IExporter (PDFExporter here).
  • Factory/Registry: lazy loading of spaCy models by lang/size with fallbacks.

  • Facade: CLI and Pipeline.process() offer a simple entry point.

Package layout (summary)

src/intelli3text/
  __init__.py
  __main__.py            # CLI
  config.py              # Intelli3Config (parameters)
  utils.py               # cache/download helpers
  builder.py             # PipelineBuilder (Builder)
  pipeline.py            # Pipeline (Facade)

  extractors/            # Strategy
    base.py
    web_trafilatura.py
    file_pdfminer.py
    file_docx.py
    file_text.py

  cleaners/              # Strategy + Chain of Responsibility
    base.py
    chain.py
    unicode_ftfy.py
    clean_text.py
    pdf_linebreaks.py

  lid/                   # Strategy
    base.py
    fasttext_lid.py
    # (optional) cld3_lid.py

  nlp/
    base.py
    registry.py          # Factory/Registry (spaCy models + fallback)
    spacy_normalizer.py  # Strategy

  export/
    base.py
    pdf_reportlab.py     # Strategy

Design Science Research (DSR)

Artifact. A production-oriented NLP pipeline for ingestion, cleaning, paragraph-level language identification (LID), normalization, and PDF export, designed for reproducibility (binary pins, install-time model bootstrap) and trivial installation. This aligns with DSR’s emphasis on building useful artifacts that extend human and organizational capabilities. :contentReference[oaicite:0]{index=0}

Problem. Heterogeneous sources (Web/PDF/DOCX/TXT), bilingual/multilingual content, and environment friction (native deps, wheels, OS/WSL divergences) often break reproducibility and degrade text quality via boilerplate/noise. Prior work highlights the importance of robust boilerplate removal and main-content extraction for downstream NLP quality. :contentReference[oaicite:1]{index=1}

Design.

  • Acquisition & cleaning: Web extraction via Trafilatura (main text, comments, metadata) plus jusText-style boilerplate filtering; both are well-studied choices for reliable textual corpora. :contentReference[oaicite:2]{index=2}
  • Language ID: fastText LID model (recognizes 176 languages) with install-time download/embedding to remove runtime network dependency. :contentReference[oaicite:3]{index=3}
  • Normalization: spaCy pipeline (industrial-strength NLP; v2+ with Bloom embeddings/CNNs) with pinned versions for deterministic behavior across environments. :contentReference[oaicite:4]{index=4}
  • Reproducibility: strict dependency pinning and build hooks; artifact packaged with the LID model to guarantee availability at install time, consistent with DSR guidance on rigor and verifiability. :contentReference[oaicite:5]{index=5}

Demonstration. Command-line interface and Python API across Web/PDF/DOCX/TXT; LID for PT/EN/ES using fastText; auditable PDF report that shows raw, cleaned, and normalized views. :contentReference[oaicite:6]{index=6}

Evaluation.

  • Technical robustness: empirical tests across user-site installs, WSL, and Windows; deterministic packaging validated by install-time model embedding. (Engineering claim; methodology aligned with DSR evaluation guidance.) :contentReference[oaicite:7]{index=7}
  • Quality: LID confidence/coverage supported by the fastText 176-language models; cleaning quality supported by established extractors (Trafilatura/jusText). :contentReference[oaicite:8]{index=8}

Contributions.

  • Engineering: Builder/Strategy/Factory patterns to decouple extractors, cleaners, LID, and normalizers for reuse. (Standard software-engineering patterns applied to the artifact.)
  • DSR grounding: Follows Hevner et al.’s design-science guidelines (relevance, rigor, design evaluation) and Peffers et al.’s DSRM (problem identification → artifact design → evaluation → communication). :contentReference[oaicite:9]{index=9}

Notes on verification:

  • DSR foundations are confirmed via MISQ (Hevner et al., 2004) and the DSRM (Peffers et al., 2007). (MISQ)
  • Trafilatura demo paper (ACL 2021) and docs confirm main-content extraction with comments/metadata. (ACL Anthology)
  • jusText origins and efficacy for boilerplate removal are documented in Pomikálek’s thesis. (Informações da Universidade)
  • fastText LID page confirms 176-language models (lid.176.*). (fastText)
  • spaCy v2 architecture (Bloom embeddings/CNNs) is documented in Honnibal & Montani. (Sentometrics Research)

Binary compatibility (NumPy/Thinc/spaCy)

To avoid the classic numpy.dtype size changed error:

  • We pin compatible versions in pyproject.toml.

  • If you already had other global packages and hit this error:

    1. pip uninstall -y spacy thinc numpy
    2. pip cache purge
    3. pip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"
    4. pip install --user --no-cache-dir intelli3text (or -e . from the local repo)

Tip: always use the same Python that runs intelli3text (check head -1 ~/.local/bin/intelli3text).


Performance tips

  • Paragraph length: controlled by paragraph_min_chars (default 30) and lid_min_chars (default 60).
  • LID sample cap: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
  • spaCy model size: sm is lighter; lg gives better quality (default).

Extensibility

  • New sources: implement IExtractor and register in PipelineBuilder.
  • New cleaners: implement ICleaner and map it in NAME2CLEANER.
  • New LIDs: implement the interface under lid/base.py.
  • Exporters: implement IExporter (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.

Troubleshooting

  • Trafilatura ‘unidecode’ warning: already handled — we depend on Unidecode.

  • No Internet on first run:

    • LID: fallback "pt", 0.0.
    • spaCy: spacy.blank(<lang>).
    • Later, with Internet, run again to fetch full models.
  • ModuleNotFoundError: fasttext:

    • We depend on fasttext-wheel (prebuilt wheels).
    • Reinstall: pip install fasttext-wheel.

More tips and parameter-by-parameter guidance: https://jeffersonspeck.github.io/intelli3text/


Roadmap

  • Exporters: HTML/Markdown with paragraph navigation.
  • Quality metrics (lexical density, diversity, etc.).
  • More languages via custom spaCy models.
  • Optional normalization using Stanza.

License

MIT — you’re free to use, modify and distribute.

Note: the original upstream licenses of third-party models and libraries still apply.


How to cite

Speck, J. (2025). intelli3text: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: https://github.com/jeffersonspeck/intelli3text

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages