Skip to content

make readability kotlin#1964

Open
Tlaster wants to merge 4 commits intomasterfrom
feature/native-readability
Open

make readability kotlin#1964
Tlaster wants to merge 4 commits intomasterfrom
feature/native-readability

Conversation

@Tlaster
Copy link
Copy Markdown
Contributor

@Tlaster Tlaster commented Apr 4, 2026

No description provided.

@Tlaster Tlaster marked this pull request as ready for review April 6, 2026 11:49
@Tlaster Tlaster requested a review from Copilot April 6, 2026 11:49
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates the Kotlin Multiplatform port of Mozilla Readability by introducing core public data models/options, a readerability pre-check, and expanding the conformance fixture set (HTML + expected metadata) used by tests.

Changes:

  • Added Kotlin implementations for regex constants, article models/options, and isProbablyReaderable heuristic.
  • Added extensive test fixtures (source.html, expected.html, expected-metadata.json) across many real-world pages.
  • Added KMP + Android build configuration for the readability module.

Reviewed changes

Copilot reviewed 94 out of 413 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
readability/src/commonTest/resources/test-pages/firefox-nightly-blog/expected.html Adds expected readability output fixture for Firefox Nightly blog page.
readability/src/commonTest/resources/test-pages/firefox-nightly-blog/expected-metadata.json Adds expected extracted metadata fixture for Firefox Nightly blog page.
readability/src/commonTest/resources/test-pages/engadget/expected.html Adds expected readability output fixture for Engadget page.
readability/src/commonTest/resources/test-pages/engadget/expected-metadata.json Adds expected extracted metadata fixture for Engadget page.
readability/src/commonTest/resources/test-pages/embedded-videos/source.html Adds source HTML fixture for embedded video handling.
readability/src/commonTest/resources/test-pages/embedded-videos/expected.html Adds expected readability output fixture for embedded video handling.
readability/src/commonTest/resources/test-pages/embedded-videos/expected-metadata.json Adds expected extracted metadata fixture for embedded video handling.
readability/src/commonTest/resources/test-pages/ehow-2/expected.html Adds expected readability output fixture for eHow page (variant 2).
readability/src/commonTest/resources/test-pages/ehow-2/expected-metadata.json Adds expected extracted metadata fixture for eHow page (variant 2).
readability/src/commonTest/resources/test-pages/ehow-1/expected.html Adds expected readability output fixture for eHow page (variant 1).
readability/src/commonTest/resources/test-pages/ehow-1/expected-metadata.json Adds expected extracted metadata fixture for eHow page (variant 1).
readability/src/commonTest/resources/test-pages/ebb-org/expected.html Adds expected readability output fixture for ebb.org page.
readability/src/commonTest/resources/test-pages/ebb-org/expected-metadata.json Adds expected extracted metadata fixture for ebb.org page.
readability/src/commonTest/resources/test-pages/dropbox-blog/expected-metadata.json Adds expected extracted metadata fixture for Dropbox blog page.
readability/src/commonTest/resources/test-pages/dev418/source.html Adds source HTML fixture for dev418 image/figure list cases.
readability/src/commonTest/resources/test-pages/dev418/expected.html Adds expected readability output fixture for dev418 image/figure list cases.
readability/src/commonTest/resources/test-pages/dev418/expected-metadata.json Adds expected extracted metadata fixture for dev418.
readability/src/commonTest/resources/test-pages/data-url-image/expected-metadata.json Adds expected extracted metadata fixture for data URL image page.
readability/src/commonTest/resources/test-pages/daringfireball-1/source.html Adds source HTML fixture for Daring Fireball page.
readability/src/commonTest/resources/test-pages/daringfireball-1/expected.html Adds expected readability output fixture for Daring Fireball page.
readability/src/commonTest/resources/test-pages/daringfireball-1/expected-metadata.json Adds expected extracted metadata fixture for Daring Fireball page.
readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/source.html Adds source HTML fixture for parsing comments inside <script>.
readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/expected.html Adds expected readability output for comment-inside-script parsing case.
readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/expected-metadata.json Adds expected extracted metadata for comment-inside-script parsing case.
readability/src/commonTest/resources/test-pages/cnn/expected.html Adds expected readability output fixture for CNN page.
readability/src/commonTest/resources/test-pages/cnn/expected-metadata.json Adds expected extracted metadata fixture for CNN page.
readability/src/commonTest/resources/test-pages/cnet/expected.html Adds expected readability output fixture for CNET page.
readability/src/commonTest/resources/test-pages/cnet/expected-metadata.json Adds expected extracted metadata fixture for CNET page.
readability/src/commonTest/resources/test-pages/cnet-svg-classes/expected-metadata.json Adds expected extracted metadata fixture for CNET SVG-classes page.
readability/src/commonTest/resources/test-pages/clean-links/expected-metadata.json Adds expected extracted metadata fixture for clean-links case.
readability/src/commonTest/resources/test-pages/citylab-1/expected.html Adds expected readability output fixture for CityLab page.
readability/src/commonTest/resources/test-pages/citylab-1/expected-metadata.json Adds expected extracted metadata fixture for CityLab page.
readability/src/commonTest/resources/test-pages/buzzfeed-1/expected.html Adds expected readability output fixture for BuzzFeed page.
readability/src/commonTest/resources/test-pages/buzzfeed-1/expected-metadata.json Adds expected extracted metadata fixture for BuzzFeed page.
readability/src/commonTest/resources/test-pages/bug-1255978/expected.html Adds expected readability output fixture for regression case bug-1255978.
readability/src/commonTest/resources/test-pages/bug-1255978/expected-metadata.json Adds expected extracted metadata fixture for regression case bug-1255978.
readability/src/commonTest/resources/test-pages/breitbart/expected.html Adds expected readability output fixture for Breitbart page.
readability/src/commonTest/resources/test-pages/breitbart/expected-metadata.json Adds expected extracted metadata fixture for Breitbart page.
readability/src/commonTest/resources/test-pages/blogger/expected.html Adds expected readability output fixture for Blogger page.
readability/src/commonTest/resources/test-pages/blogger/expected-metadata.json Adds expected extracted metadata fixture for Blogger page.
readability/src/commonTest/resources/test-pages/bbc-1/expected.html Adds expected readability output fixture for BBC page.
readability/src/commonTest/resources/test-pages/bbc-1/expected-metadata.json Adds expected extracted metadata fixture for BBC page.
readability/src/commonTest/resources/test-pages/basic-tags-cleaning/source.html Adds source HTML fixture for basic tag cleaning.
readability/src/commonTest/resources/test-pages/basic-tags-cleaning/expected.html Adds expected readability output fixture for basic tag cleaning.
readability/src/commonTest/resources/test-pages/basic-tags-cleaning/expected-metadata.json Adds expected extracted metadata fixture for basic tag cleaning.
readability/src/commonTest/resources/test-pages/base-url/source.html Adds source HTML fixture for base URL normalization.
readability/src/commonTest/resources/test-pages/base-url/expected.html Adds expected readability output fixture for base URL normalization.
readability/src/commonTest/resources/test-pages/base-url/expected-metadata.json Adds expected extracted metadata fixture for base URL normalization.
readability/src/commonTest/resources/test-pages/base-url-base-element/source.html Adds source HTML fixture for <base> URL handling.
readability/src/commonTest/resources/test-pages/base-url-base-element/expected.html Adds expected readability output fixture for <base> URL handling.
readability/src/commonTest/resources/test-pages/base-url-base-element/expected-metadata.json Adds expected extracted metadata fixture for <base> URL handling.
readability/src/commonTest/resources/test-pages/base-url-base-element-relative/source.html Adds source HTML fixture for relative <base> handling.
readability/src/commonTest/resources/test-pages/base-url-base-element-relative/expected.html Adds expected readability output fixture for relative <base> handling.
readability/src/commonTest/resources/test-pages/base-url-base-element-relative/expected-metadata.json Adds expected extracted metadata fixture for relative <base> handling.
readability/src/commonTest/resources/test-pages/article-author-tag/expected.html Adds expected readability output fixture for article author tag handling.
readability/src/commonTest/resources/test-pages/article-author-tag/expected-metadata.json Adds expected extracted metadata fixture for article author tag handling.
readability/src/commonTest/resources/test-pages/ars-1/expected.html Adds expected readability output fixture for Ars Technica page.
readability/src/commonTest/resources/test-pages/ars-1/expected-metadata.json Adds expected extracted metadata fixture for Ars Technica page.
readability/src/commonTest/resources/test-pages/archive-of-our-own/expected.html Adds expected readability output fixture for AO3 page.
readability/src/commonTest/resources/test-pages/archive-of-our-own/expected-metadata.json Adds expected extracted metadata fixture for AO3 page.
readability/src/commonTest/resources/test-pages/aktualne/expected.html Adds expected readability output fixture for aktualne.cz page.
readability/src/commonTest/resources/test-pages/aktualne/expected-metadata.json Adds expected extracted metadata fixture for aktualne.cz page.
readability/src/commonTest/resources/test-pages/aclu/expected.html Adds expected readability output fixture for ACLU page.
readability/src/commonTest/resources/test-pages/aclu/expected-metadata.json Adds expected extracted metadata fixture for ACLU page.
readability/src/commonTest/resources/test-pages/005-unescape-html-entities/source.html Adds source HTML fixture for HTML entity unescape edge cases.
readability/src/commonTest/resources/test-pages/005-unescape-html-entities/expected.html Adds expected readability output fixture for HTML entity unescape edge cases.
readability/src/commonTest/resources/test-pages/005-unescape-html-entities/expected-metadata.json Adds expected extracted metadata fixture for HTML entity unescape edge cases.
readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/source.html Adds source HTML fixture for space-separated meta properties parsing.
readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/expected.html Adds expected readability output fixture for space-separated meta properties parsing.
readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/expected-metadata.json Adds expected extracted metadata fixture for space-separated meta properties parsing.
readability/src/commonTest/resources/test-pages/003-metadata-preferred/source.html Adds source HTML fixture for preferred metadata selection precedence.
readability/src/commonTest/resources/test-pages/003-metadata-preferred/expected.html Adds expected readability output fixture for preferred metadata selection precedence.
readability/src/commonTest/resources/test-pages/003-metadata-preferred/expected-metadata.json Adds expected extracted metadata fixture for preferred metadata selection precedence.
readability/src/commonTest/resources/test-pages/002/expected-metadata.json Adds expected extracted metadata fixture for test page 002.
readability/src/commonTest/resources/test-pages/001/source.html Adds source HTML fixture for test page 001.
readability/src/commonTest/resources/test-pages/001/expected.html Adds expected readability output fixture for test page 001.
readability/src/commonTest/resources/test-pages/001/expected-metadata.json Adds expected extracted metadata fixture for test page 001.
readability/src/commonMain/kotlin/dev/dimension/flare/readability/RegExps.kt Introduces centralized regex definitions used by the Readability algorithm.
readability/src/commonMain/kotlin/dev/dimension/flare/readability/ReadabilityOptions.kt Adds public options model for configuring Readability parsing behavior.
readability/src/commonMain/kotlin/dev/dimension/flare/readability/IsProbablyReaderable.kt Adds readerability heuristic + visibility check helpers.
readability/src/commonMain/kotlin/dev/dimension/flare/readability/ArticleMetadata.kt Adds internal metadata holder used during parsing.
readability/src/commonMain/kotlin/dev/dimension/flare/readability/Article.kt Adds public article result data model.
readability/build.gradle.kts Configures the readability module as a Kotlin Multiplatform library with dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +7 to +16
kotlin {
jvmToolchain(libs.versions.java.get().toInt())
explicitApi()
applyDefaultHierarchyTemplate()
androidLibrary {
compileSdk = libs.versions.compileSdk.get().toInt()
namespace = "dev.dimension.flare.readability"
minSdk = libs.versions.minSdk.get().toInt()
}
jvm()
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kotlin Multiplatform DSL here appears to be missing an Android target declaration (androidTarget()), and compileSdk/namespace/minSdk are typically configured in the Android Gradle Plugin android {} block (not under kotlin {}). As written, this is likely to fail Gradle configuration or produce no Android target; move Android SDK/namespace config to android {} and add androidTarget() (or the correct Android target DSL your Kotlin version supports).

Copilot uses AI. Check for mistakes.

// See: https://schema.org/Article
val jsonLdArticleTypes =
Regex("""^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$""")
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is not correctly anchored for the full alternation: ^ only applies to Article and $ only applies to APIReference. That can cause false positives (e.g., matching strings with extra prefixes/suffixes). Wrap the alternation in a group and anchor the whole expression (e.g., ^(...|...)$).

Suggested change
Regex("""^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$""")
Regex("""^(Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference)$""")

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +47
* Disable JSON-LD metadata extraction.
*/
public val disableJSONLD: Boolean = false,
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public API property name disableJSONLD is not idiomatic Kotlin for acronyms (it should typically be disableJsonLd). Consider renaming for consistency/readability; if this is a compatibility concern with upstream Readability.js option names, consider providing a deprecated alias to avoid an API break.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +29
val style = node.attr("style")
if (style.isNotEmpty() && Regex("display\\s*:\\s*none", RegexOption.IGNORE_CASE).containsMatchIn(style)) {
return false
}
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a new Regex instance on every isNodeVisible call. Since this is likely executed many times during scoring, consider hoisting the regex to a private val (or into RegExps) so it’s compiled once.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants