Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Pull request overview
Adds/updates the Kotlin Multiplatform port of Mozilla Readability by introducing core public data models/options, a readerability pre-check, and expanding the conformance fixture set (HTML + expected metadata) used by tests.
Changes:
- Added Kotlin implementations for regex constants, article models/options, and
isProbablyReaderableheuristic. - Added extensive test fixtures (
source.html,expected.html,expected-metadata.json) across many real-world pages. - Added KMP + Android build configuration for the
readabilitymodule.
Reviewed changes
Copilot reviewed 94 out of 413 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| readability/src/commonTest/resources/test-pages/firefox-nightly-blog/expected.html | Adds expected readability output fixture for Firefox Nightly blog page. |
| readability/src/commonTest/resources/test-pages/firefox-nightly-blog/expected-metadata.json | Adds expected extracted metadata fixture for Firefox Nightly blog page. |
| readability/src/commonTest/resources/test-pages/engadget/expected.html | Adds expected readability output fixture for Engadget page. |
| readability/src/commonTest/resources/test-pages/engadget/expected-metadata.json | Adds expected extracted metadata fixture for Engadget page. |
| readability/src/commonTest/resources/test-pages/embedded-videos/source.html | Adds source HTML fixture for embedded video handling. |
| readability/src/commonTest/resources/test-pages/embedded-videos/expected.html | Adds expected readability output fixture for embedded video handling. |
| readability/src/commonTest/resources/test-pages/embedded-videos/expected-metadata.json | Adds expected extracted metadata fixture for embedded video handling. |
| readability/src/commonTest/resources/test-pages/ehow-2/expected.html | Adds expected readability output fixture for eHow page (variant 2). |
| readability/src/commonTest/resources/test-pages/ehow-2/expected-metadata.json | Adds expected extracted metadata fixture for eHow page (variant 2). |
| readability/src/commonTest/resources/test-pages/ehow-1/expected.html | Adds expected readability output fixture for eHow page (variant 1). |
| readability/src/commonTest/resources/test-pages/ehow-1/expected-metadata.json | Adds expected extracted metadata fixture for eHow page (variant 1). |
| readability/src/commonTest/resources/test-pages/ebb-org/expected.html | Adds expected readability output fixture for ebb.org page. |
| readability/src/commonTest/resources/test-pages/ebb-org/expected-metadata.json | Adds expected extracted metadata fixture for ebb.org page. |
| readability/src/commonTest/resources/test-pages/dropbox-blog/expected-metadata.json | Adds expected extracted metadata fixture for Dropbox blog page. |
| readability/src/commonTest/resources/test-pages/dev418/source.html | Adds source HTML fixture for dev418 image/figure list cases. |
| readability/src/commonTest/resources/test-pages/dev418/expected.html | Adds expected readability output fixture for dev418 image/figure list cases. |
| readability/src/commonTest/resources/test-pages/dev418/expected-metadata.json | Adds expected extracted metadata fixture for dev418. |
| readability/src/commonTest/resources/test-pages/data-url-image/expected-metadata.json | Adds expected extracted metadata fixture for data URL image page. |
| readability/src/commonTest/resources/test-pages/daringfireball-1/source.html | Adds source HTML fixture for Daring Fireball page. |
| readability/src/commonTest/resources/test-pages/daringfireball-1/expected.html | Adds expected readability output fixture for Daring Fireball page. |
| readability/src/commonTest/resources/test-pages/daringfireball-1/expected-metadata.json | Adds expected extracted metadata fixture for Daring Fireball page. |
| readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/source.html | Adds source HTML fixture for parsing comments inside <script>. |
| readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/expected.html | Adds expected readability output for comment-inside-script parsing case. |
| readability/src/commonTest/resources/test-pages/comment-inside-script-parsing/expected-metadata.json | Adds expected extracted metadata for comment-inside-script parsing case. |
| readability/src/commonTest/resources/test-pages/cnn/expected.html | Adds expected readability output fixture for CNN page. |
| readability/src/commonTest/resources/test-pages/cnn/expected-metadata.json | Adds expected extracted metadata fixture for CNN page. |
| readability/src/commonTest/resources/test-pages/cnet/expected.html | Adds expected readability output fixture for CNET page. |
| readability/src/commonTest/resources/test-pages/cnet/expected-metadata.json | Adds expected extracted metadata fixture for CNET page. |
| readability/src/commonTest/resources/test-pages/cnet-svg-classes/expected-metadata.json | Adds expected extracted metadata fixture for CNET SVG-classes page. |
| readability/src/commonTest/resources/test-pages/clean-links/expected-metadata.json | Adds expected extracted metadata fixture for clean-links case. |
| readability/src/commonTest/resources/test-pages/citylab-1/expected.html | Adds expected readability output fixture for CityLab page. |
| readability/src/commonTest/resources/test-pages/citylab-1/expected-metadata.json | Adds expected extracted metadata fixture for CityLab page. |
| readability/src/commonTest/resources/test-pages/buzzfeed-1/expected.html | Adds expected readability output fixture for BuzzFeed page. |
| readability/src/commonTest/resources/test-pages/buzzfeed-1/expected-metadata.json | Adds expected extracted metadata fixture for BuzzFeed page. |
| readability/src/commonTest/resources/test-pages/bug-1255978/expected.html | Adds expected readability output fixture for regression case bug-1255978. |
| readability/src/commonTest/resources/test-pages/bug-1255978/expected-metadata.json | Adds expected extracted metadata fixture for regression case bug-1255978. |
| readability/src/commonTest/resources/test-pages/breitbart/expected.html | Adds expected readability output fixture for Breitbart page. |
| readability/src/commonTest/resources/test-pages/breitbart/expected-metadata.json | Adds expected extracted metadata fixture for Breitbart page. |
| readability/src/commonTest/resources/test-pages/blogger/expected.html | Adds expected readability output fixture for Blogger page. |
| readability/src/commonTest/resources/test-pages/blogger/expected-metadata.json | Adds expected extracted metadata fixture for Blogger page. |
| readability/src/commonTest/resources/test-pages/bbc-1/expected.html | Adds expected readability output fixture for BBC page. |
| readability/src/commonTest/resources/test-pages/bbc-1/expected-metadata.json | Adds expected extracted metadata fixture for BBC page. |
| readability/src/commonTest/resources/test-pages/basic-tags-cleaning/source.html | Adds source HTML fixture for basic tag cleaning. |
| readability/src/commonTest/resources/test-pages/basic-tags-cleaning/expected.html | Adds expected readability output fixture for basic tag cleaning. |
| readability/src/commonTest/resources/test-pages/basic-tags-cleaning/expected-metadata.json | Adds expected extracted metadata fixture for basic tag cleaning. |
| readability/src/commonTest/resources/test-pages/base-url/source.html | Adds source HTML fixture for base URL normalization. |
| readability/src/commonTest/resources/test-pages/base-url/expected.html | Adds expected readability output fixture for base URL normalization. |
| readability/src/commonTest/resources/test-pages/base-url/expected-metadata.json | Adds expected extracted metadata fixture for base URL normalization. |
| readability/src/commonTest/resources/test-pages/base-url-base-element/source.html | Adds source HTML fixture for <base> URL handling. |
| readability/src/commonTest/resources/test-pages/base-url-base-element/expected.html | Adds expected readability output fixture for <base> URL handling. |
| readability/src/commonTest/resources/test-pages/base-url-base-element/expected-metadata.json | Adds expected extracted metadata fixture for <base> URL handling. |
| readability/src/commonTest/resources/test-pages/base-url-base-element-relative/source.html | Adds source HTML fixture for relative <base> handling. |
| readability/src/commonTest/resources/test-pages/base-url-base-element-relative/expected.html | Adds expected readability output fixture for relative <base> handling. |
| readability/src/commonTest/resources/test-pages/base-url-base-element-relative/expected-metadata.json | Adds expected extracted metadata fixture for relative <base> handling. |
| readability/src/commonTest/resources/test-pages/article-author-tag/expected.html | Adds expected readability output fixture for article author tag handling. |
| readability/src/commonTest/resources/test-pages/article-author-tag/expected-metadata.json | Adds expected extracted metadata fixture for article author tag handling. |
| readability/src/commonTest/resources/test-pages/ars-1/expected.html | Adds expected readability output fixture for Ars Technica page. |
| readability/src/commonTest/resources/test-pages/ars-1/expected-metadata.json | Adds expected extracted metadata fixture for Ars Technica page. |
| readability/src/commonTest/resources/test-pages/archive-of-our-own/expected.html | Adds expected readability output fixture for AO3 page. |
| readability/src/commonTest/resources/test-pages/archive-of-our-own/expected-metadata.json | Adds expected extracted metadata fixture for AO3 page. |
| readability/src/commonTest/resources/test-pages/aktualne/expected.html | Adds expected readability output fixture for aktualne.cz page. |
| readability/src/commonTest/resources/test-pages/aktualne/expected-metadata.json | Adds expected extracted metadata fixture for aktualne.cz page. |
| readability/src/commonTest/resources/test-pages/aclu/expected.html | Adds expected readability output fixture for ACLU page. |
| readability/src/commonTest/resources/test-pages/aclu/expected-metadata.json | Adds expected extracted metadata fixture for ACLU page. |
| readability/src/commonTest/resources/test-pages/005-unescape-html-entities/source.html | Adds source HTML fixture for HTML entity unescape edge cases. |
| readability/src/commonTest/resources/test-pages/005-unescape-html-entities/expected.html | Adds expected readability output fixture for HTML entity unescape edge cases. |
| readability/src/commonTest/resources/test-pages/005-unescape-html-entities/expected-metadata.json | Adds expected extracted metadata fixture for HTML entity unescape edge cases. |
| readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/source.html | Adds source HTML fixture for space-separated meta properties parsing. |
| readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/expected.html | Adds expected readability output fixture for space-separated meta properties parsing. |
| readability/src/commonTest/resources/test-pages/004-metadata-space-separated-properties/expected-metadata.json | Adds expected extracted metadata fixture for space-separated meta properties parsing. |
| readability/src/commonTest/resources/test-pages/003-metadata-preferred/source.html | Adds source HTML fixture for preferred metadata selection precedence. |
| readability/src/commonTest/resources/test-pages/003-metadata-preferred/expected.html | Adds expected readability output fixture for preferred metadata selection precedence. |
| readability/src/commonTest/resources/test-pages/003-metadata-preferred/expected-metadata.json | Adds expected extracted metadata fixture for preferred metadata selection precedence. |
| readability/src/commonTest/resources/test-pages/002/expected-metadata.json | Adds expected extracted metadata fixture for test page 002. |
| readability/src/commonTest/resources/test-pages/001/source.html | Adds source HTML fixture for test page 001. |
| readability/src/commonTest/resources/test-pages/001/expected.html | Adds expected readability output fixture for test page 001. |
| readability/src/commonTest/resources/test-pages/001/expected-metadata.json | Adds expected extracted metadata fixture for test page 001. |
| readability/src/commonMain/kotlin/dev/dimension/flare/readability/RegExps.kt | Introduces centralized regex definitions used by the Readability algorithm. |
| readability/src/commonMain/kotlin/dev/dimension/flare/readability/ReadabilityOptions.kt | Adds public options model for configuring Readability parsing behavior. |
| readability/src/commonMain/kotlin/dev/dimension/flare/readability/IsProbablyReaderable.kt | Adds readerability heuristic + visibility check helpers. |
| readability/src/commonMain/kotlin/dev/dimension/flare/readability/ArticleMetadata.kt | Adds internal metadata holder used during parsing. |
| readability/src/commonMain/kotlin/dev/dimension/flare/readability/Article.kt | Adds public article result data model. |
| readability/build.gradle.kts | Configures the readability module as a Kotlin Multiplatform library with dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| kotlin { | ||
| jvmToolchain(libs.versions.java.get().toInt()) | ||
| explicitApi() | ||
| applyDefaultHierarchyTemplate() | ||
| androidLibrary { | ||
| compileSdk = libs.versions.compileSdk.get().toInt() | ||
| namespace = "dev.dimension.flare.readability" | ||
| minSdk = libs.versions.minSdk.get().toInt() | ||
| } | ||
| jvm() |
There was a problem hiding this comment.
The Kotlin Multiplatform DSL here appears to be missing an Android target declaration (androidTarget()), and compileSdk/namespace/minSdk are typically configured in the Android Gradle Plugin android {} block (not under kotlin {}). As written, this is likely to fail Gradle configuration or produce no Android target; move Android SDK/namespace config to android {} and add androidTarget() (or the correct Android target DSL your Kotlin version supports).
|
|
||
| // See: https://schema.org/Article | ||
| val jsonLdArticleTypes = | ||
| Regex("""^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$""") |
There was a problem hiding this comment.
This regex is not correctly anchored for the full alternation: ^ only applies to Article and $ only applies to APIReference. That can cause false positives (e.g., matching strings with extra prefixes/suffixes). Wrap the alternation in a group and anchor the whole expression (e.g., ^(...|...)$).
| Regex("""^Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference$""") | |
| Regex("""^(Article|AdvertiserContentArticle|NewsArticle|AnalysisNewsArticle|AskPublicNewsArticle|BackgroundNewsArticle|OpinionNewsArticle|ReportageNewsArticle|ReviewNewsArticle|Report|SatiricalArticle|ScholarlyArticle|MedicalScholarlyArticle|SocialMediaPosting|BlogPosting|LiveBlogPosting|DiscussionForumPosting|TechArticle|APIReference)$""") |
| * Disable JSON-LD metadata extraction. | ||
| */ | ||
| public val disableJSONLD: Boolean = false, |
There was a problem hiding this comment.
Public API property name disableJSONLD is not idiomatic Kotlin for acronyms (it should typically be disableJsonLd). Consider renaming for consistency/readability; if this is a compatibility concern with upstream Readability.js option names, consider providing a deprecated alias to avoid an API break.
| val style = node.attr("style") | ||
| if (style.isNotEmpty() && Regex("display\\s*:\\s*none", RegexOption.IGNORE_CASE).containsMatchIn(style)) { | ||
| return false | ||
| } |
There was a problem hiding this comment.
This creates a new Regex instance on every isNodeVisible call. Since this is likely executed many times during scoring, consider hoisting the regex to a private val (or into RegExps) so it’s compiled once.
No description provided.