fix: translate Swedish content in non-Swedish articles, use content-based titles#451
fix: translate Swedish content in non-Swedish articles, use content-based titles#451
Conversation
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
|
Now all PR:s for issues above have been merged. Analyse all issues and current state and complete any remaining tasks by create github issues and assign to copilot . Analyse differeans areas for copilot to improve |
|
@copilot |
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…based titles
- Enhance translateSwedishContent to translate bare <span lang="sv"> spans
for non-Swedish articles (not just data-translate spans)
- Use generateContentTitle() in generate-news-enhanced.ts for all 3 article
types instead of hardcoded static titles ("Battle Lines This Week")
- Fix renderMotionEntry and other renderers to always wrap Swedish titles in
svSpan() so they get translated by translateSwedishContent post-processor
- Batch-fix 91 old 2026-02-22 articles removing data-translate markers and
translating Swedish text
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| // For non-Swedish articles: translate ALL <span lang="sv"> spans in one pass | ||
| return html.replace(ANY_SV_SPAN_REGEX, (_match: string, attrs: string, inner: string): string => { | ||
| // Remove data-translate marker if present, keep lang="sv" for accessibility | ||
| const cleanedAttrs = attrs.replace(/\s*data-translate=(?:"true"|'true')/, '').trim(); | ||
| const translatedInner = translatePhrase(inner, targetLang); |
There was a problem hiding this comment.
The new ANY_SV_SPAN_REGEX at line 718 correctly matches all <span lang="sv"> elements, but the translatePhrase function it calls (line 737) uses longest-prefix matching that produces broken hybrid translations. When dictionary contains "Interpellations" but the span contains "Interpellationssvar", prefix matching returns translation + "svar", creating text like "质询ssvar" (Chinese + Swedish suffix). All Swedish compound words in spans need complete dictionary entries.
There was a problem hiding this comment.
Fixed in 528faf1. translatePhrase now has a word boundary check that prevents prefix matching from splitting compound Swedish words. The character after a prefix match must be a word boundary ([\s.,;:!?\-–—/()0-9]) or end of string — letters like "ssvar" in "Interpellationssvar" no longer trigger a prefix match on "interpellation".
Additionally added complete dictionary entries for "interpellationssvar", "sammanträde", and "återrapportering" so these compound words get exact-match translations.
scripts/data-transformers.ts
Outdated
| const looksSwedish = /[åäöÅÄÖ]/.test(titleText) || /^med anledning av /i.test(titleText); | ||
| const titleHtml = (motion.titel || looksSwedish) |
There was a problem hiding this comment.
The Swedish detection regex /[åäöÅÄÖ]/ and /^med anledning av /i is too simplistic and will miss many Swedish titles. It will fail to detect Swedish text that doesn't contain these specific characters or this specific phrase. A more robust approach would check if motion.titel is populated (indicating Swedish API source) or use language detection. This could lead to Swedish text being rendered without the svSpan() wrapper, preventing translation.
There was a problem hiding this comment.
Fixed in 528faf1. Replaced the fragile regex /[åäöÅÄÖ]/ with a simple X.titel presence check. Since titel is only populated by the Swedish Riksdag API, its presence reliably indicates Swedish content — no character-level heuristics needed. Applied to all 6 rendering locations (events, documents, reports, propositions, motions, generic).
news/2026-02-22-month-ahead-zh.html
Outdated
|
|
||
| <div class="document-entry"> | ||
| <h4><a href="https://riksdagen.se/sv/dokument-och-lagar/dokument/HDC120260324ip/" target="_blank" rel="noopener noreferrer"><span data-translate="true" lang="sv">Interpellationssvar</span></a></h4> | ||
| <h4><a href="https://riksdagen.se/sv/dokument-och-lagar/dokument/HDC120260324ip/" target="_blank" rel="noopener noreferrer"><span lang="sv">质询ssvar</span></a></h4> |
There was a problem hiding this comment.
The translated text contains partial translations with Swedish suffixes still attached. For example, "质询ssvar" (Chinese + "ssvar"), "استجوابssvar" (Arabic + "ssvar"), "interpellasjonssvar" appears correct. The translation system is splitting compound Swedish words but only translating part of them, leaving Swedish morphemes like "ssvar", "s sammanträde" appended to translated committee names. This creates broken hybrid text that is neither Swedish nor the target language.
There was a problem hiding this comment.
Fixed in 528faf1 + 6696585. All 91 old 2026-02-22 articles have been repaired:
- "质询ssvar" → "质询答复" (ZH), "ردود الاستجواب" (AR), etc.
- "Committee on Civil Affairss meeting" → "Committee on Civil Affairs meeting"
- "国防委员会s 会议" → "国防委员会 会议"
The root cause (prefix matching splitting compound words) is fixed in translatePhrase so new articles won't have this issue.
|
@copilot apply changes based on the comments in this thread |
…ee meetings - Add word boundary check in translatePhrase prefix matching to prevent splitting compound words (e.g., "Interpellationssvar" no longer becomes "质询ssvar") - Add multi-segment translation with Swedish possessive 's' stripping for phrases like "Arbetsmarknadsutskottets sammanträde" - Add dictionary entries for "interpellationssvar", "sammanträde", "återrapportering" - Simplify Swedish detection to use X.titel presence (API source indicator) instead of fragile regex character checks - Fix 91 old articles with broken hybrid translations - Add 2 new tests for compound word protection Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
All 3 review comments addressed in commits 528faf1 and 6696585:
|
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
| /** | ||
| * Matches any `<span …lang="sv"…>…</span>` span — both bare and with data-translate. | ||
| * Used for non-Swedish articles to translate ALL Swedish text in a single pass. | ||
| */ | ||
| const ANY_SV_SPAN_REGEX = | ||
| /<span\s+((?=[^>]*lang="sv")[^>]*)>([\s\S]*?)<\/span>/g; |
There was a problem hiding this comment.
The new ANY_SV_SPAN_REGEX/logic translates all <span …lang="sv"…> spans for non-Swedish targets, but the surrounding documentation in this file still frames the post-processing step as only handling data-translate="true" markers. Please align the nearby JSDoc/comments with the updated behavior to avoid future regressions based on outdated assumptions.
There was a problem hiding this comment.
Updated in d071edc. The translateSwedishContent JSDoc now documents the dual-path behavior (Swedish → strip markers only; non-Swedish → translate ALL <span lang="sv"> spans). The TRANSLATABLE_SV_SPAN_REGEX JSDoc clarifies it's only used for Swedish articles, with a cross-reference to ANY_SV_SPAN_REGEX for non-Swedish.
|
@copilot apply changes based on the comments in this thread |
…hing behavior Update function and regex documentation to reflect that non-Swedish articles now process ALL <span lang="sv"> spans (not just data-translate ones). Clarify that TRANSLATABLE_SV_SPAN_REGEX is only used for Swedish articles, while ANY_SV_SPAN_REGEX handles non-Swedish translation. Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Done in d071edc — aligned all JSDoc/comments in |
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
translatePhraseprefix matching to require word boundaries (prevents "质询ssvar" hybrids)X.titelpresence instead of fragile regex🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.