Skip to content

Conversation

@VaaishnaviS
Copy link

Fixes #4676

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁
  • Updated documentation pages (if applicable)
  • Updated CHANGELOG.rst (if applicable)

I have implemented a whitelist-based "Safe Gate" for gibberish detection.

The Problem: The normalize() function was stripping non-alphanumeric characters like ©, (c), and @. This caused the Markov chain model to see only fragments of legal strings, leading to high "gibberish" scores.

The Fix: Added a COPYRIGHT_INDICATORS list that is checked before normalization occurs. If a match is found, the string is immediately flagged as "not gibberish," bypassing the math model entirely. This is more robust and faster for legal text.

Signed-off-by: VaaishnaviS [email protected]

@VaaishnaviS VaaishnaviS force-pushed the enhance-gibberish-detection-indicators branch from 64a53cc to 2ebca1f Compare January 14, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Copyright detection regression after implementing gibberish detection

1 participant