Group duplicate motions by parent proposition; group propositions by committee#463
Group duplicate motions by parent proposition; group propositions by committee#463
Conversation
- Add groupMotionsByProposition() exported helper that groups motions by their parent proposition reference (pattern: med anledning av prop. YYYY/YY:NNN) - Modify generateMotionsContent() to render proposition-grouped sections: - Proposition groups get h3 heading (In response to: Prop. YYYY/YY:NNN ...) - Individual motions within groups get h4 headings showing their document ID - Independent motions use h3 with their title (unchanged behavior) - Independent Motions section header added when mixed with grouped motions - Add responsesToProp and independentMotions labels to all 14 languages - Add responsesToProp and independentMotions to ContentLabelSet type - Add 40 new unit tests for groupMotionsByProposition and grouped rendering Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…cing - Extract PROP_REFERENCE_REGEX and PROP_FULL_REF_REGEX as named constants to avoid duplication between groupMotionsByProposition() and the rendering loop - Replace non-null assertion (!) with ?? [] and optional chaining for safer access Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
…ndering - Add groupPropositionsByCommittee() exported helper: groups propositions by organ/committee field into Map<string, RawDocument[]>; key '' for uncategorised - Modify generatePropositionsContent() to render committee-grouped sections: - Multiple committees → h3 section per committee, h4 per proposition - Single committee (no heading shown) → h3 per proposition + 'Referred to' line preserved - Reuse byCommitteeGroup.size for domain count in Policy Implications (removes separate counter) - Add 9 unit tests for groupPropositionsByCommittee and committee-grouped rendering - 468 total tests passing; CodeQL: 0 alerts Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
🔍 Lighthouse Performance Audit
📥 Download full Lighthouse report Budget Compliance: Performance budgets enforced via |
There was a problem hiding this comment.
Pull request overview
This pull request introduces motion and proposition grouping logic to eliminate visual repetition in generated articles. Motions referencing the same government proposition are now grouped under a single heading, and propositions are grouped by their referred committee.
Changes:
- Adds
groupMotionsByProposition()helper that extracts proposition references using regex patterns and returns a Map grouping motions by proposition ID - Adds
groupPropositionsByCommittee()helper that groups propositions by theirorgan/committeefield - Refactors motion rendering to use h3 for group headings and h4 for individual motion entries within groups
- Refactors proposition rendering to show committee section headings when multiple committees are present
- Adds
responsesToPropandindependentMotionslabels to all 14 language localization files
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/data-transformers.ts | Core changes: adds two new exported grouping functions, two private regex constants, refactors generateMotionsContent and generatePropositionsContent to render grouped entries with conditional heading levels (h3 vs h4) |
| scripts/types/content.ts | Extends ContentLabelSet interface with two new string properties for localized labels |
| tests/data-transformers.test.ts | Comprehensive test coverage: 49 new tests covering grouping logic, label presence across 14 languages, and rendering behavior for grouped vs ungrouped scenarios |
scripts/data-transformers.ts
Outdated
| /** Regex to capture the full proposition reference text (including description) */ | ||
| const PROP_FULL_REF_REGEX = /med anledning av (prop\.\s*\S+(?:\s+.+)?)/i; |
There was a problem hiding this comment.
The PROP_FULL_REF_REGEX pattern uses a greedy (?:\s+.+)? suffix which will capture everything until the end of the string, potentially including unwanted trailing content. This could lead to excessively long proposition headings that include motion-specific text rather than just the proposition reference.
Consider using a non-greedy match or limiting the capture to a reasonable length. For example, the pattern could be:
/med anledning av (prop\.\s*\S+(?:\s+[^<]+?)?(?=\s*$|<))/ito stop at HTML tags- Or simply capturing up to a certain number of words after the proposition ID
This would ensure that only the proposition title is included in the group heading, not the entire motion title.
| /** Regex to capture the full proposition reference text (including description) */ | |
| const PROP_FULL_REF_REGEX = /med anledning av (prop\.\s*\S+(?:\s+.+)?)/i; | |
| /** Regex to capture the full proposition reference text (including description, but not trailing HTML or motion text) */ | |
| const PROP_FULL_REF_REGEX = /med anledning av (prop\.\s*\S+(?:\s+[^<]+?)?(?=\s*$|<))/i; |
scripts/data-transformers.ts
Outdated
| const PROP_REFERENCE_REGEX = /med anledning av prop\.\s+(\S+)/i; | ||
|
|
||
| /** Regex to capture the full proposition reference text (including description) */ | ||
| const PROP_FULL_REF_REGEX = /med anledning av (prop\.\s*\S+(?:\s+.+)?)/i; |
There was a problem hiding this comment.
The regex pattern PROP_REFERENCE_REGEX uses \S+ to capture the proposition ID, which will match any sequence of non-whitespace characters. This is too permissive and could capture additional text beyond the proposition ID (e.g., "2025/26:118Tillståndsprövning" if there's no space).
Consider using a more specific pattern that matches the expected format: /med anledning av prop(?:ositionen?)?\.\s+(\d{4}\/\d{2}:\d+)/i
This would ensure only valid proposition IDs like "2025/26:118" are captured, matching the format documented in the repository memories.
| const PROP_REFERENCE_REGEX = /med anledning av prop\.\s+(\S+)/i; | |
| /** Regex to capture the full proposition reference text (including description) */ | |
| const PROP_FULL_REF_REGEX = /med anledning av (prop\.\s*\S+(?:\s+.+)?)/i; | |
| const PROP_REFERENCE_REGEX = /med anledning av prop(?:itionen?)?\.\s+(\d{4}\/\d{2}:\d+)/i; | |
| /** Regex to capture the full proposition reference text (including description) */ | |
| const PROP_FULL_REF_REGEX = /med anledning av (prop(?:itionen?)?\.\s*\d{4}\/\d{2}:\d+(?:\s+.+)?)/i; |
|
@copilot apply changes based on the comments in this thread |
- PROP_REFERENCE_REGEX: replace \S+ with \d{4}\/\d{2}:\d+ to only match
valid proposition ID format (e.g. 2025/26:118), rejecting malformed input
- PROP_FULL_REF_REGEX: replace greedy (?:\s+.+)? with non-greedy [^<]+?
anchored to end-of-string/HTML-tag boundary to prevent capturing excess
trailing content when motion titles contain additional motion-specific text
Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
Done in 94b137e. Both patterns tightened:
|
- Take origin/main as base for data-transformers.ts, types/content.ts and tests/data-transformers.test.ts (which had landed PRs #453, #457, etc.) - Apply our PR changes on top: * extractPropRef: tighten regex to \d{4}\/\d{2}:\d+ (strict ID format) * groupMotionsByProposition: add export keyword; keep {grouped,independent} return type from main; replace inline lang-switches with L() labels * generateMotionsContent: use L(lang,'responsesToProp') and L(lang,'independentMotions') instead of 14-branch ternary chains * groupPropositionsByCommittee: new exported helper (Map<committeeKey,[...]>) * generatePropositionsContent: committee-grouped rendering (h3 per committee, h4 per prop when multi-committee; h3 per prop + Referred-to line otherwise) reuses byCommitteeGroup.size for policyImplicationsContext domainCount * ContentLabelSet: add responsesToProp and independentMotions fields * CONTENT_LABELS: add both new fields to all 14 language objects - Tests: add unit tests for exported helpers ({grouped,independent} API), label presence tests (28 tests for 2 keys × 14 langs), proposition grouping rendering tests; all 1674 tests passing across 39 test files Co-authored-by: pethers <1726836+pethers@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
scripts/data-transformers.ts:2155
- In grouped motion rendering, each motion entry still uses
<h3>with the full motion title. For motions responding to the same proposition this reintroduces the duplicated heading problem under the group and also creates an invalid heading hierarchy (h3group header followed by multipleh3siblings). Consider rendering per-motion entries ash4and using a unique identifier (e.g.dok_id+ author/party, or the title with the proposition prefix stripped) to avoid repeating the same text.
return `
<div class="motion-entry">
<h3>${titleHtml}</h3>
<p><strong>${L(lang, 'filedBy')}:</strong> ${authorLine}</p>
<p>${summaryHtml}</p>
|
|
||
| content += ` <h3>${escapeHtml(`Prop. ${propRef}: ${propTitle}`)}</h3>\n`; |
There was a problem hiding this comment.
The proposition group header (Prop. ${propRef}: ${propTitle}) is emitted as plain escaped text. If propTitle originates from Swedish API fields (titel), it should be wrapped/marked consistently with the rest of the Swedish-source content so the translation step can detect and translate it in non-Swedish articles.
| content += ` <h3>${escapeHtml(`Prop. ${propRef}: ${propTitle}`)}</h3>\n`; | |
| const safePropRef = escapeHtml(String(propRef)); | |
| const safePropTitle = escapeHtml(String(propTitle)); | |
| // Mark Swedish-source proposition title segment so translation can detect it | |
| content += ` <h3>Prop. ${safePropRef}: <span data-source-lang="sv">${safePropTitle}</span></h3>\n`; |
| it('should wrap Swedish titel in lang="sv" span for non-Swedish committee reports', () => { | ||
| const content = generateArticleContent( | ||
| { reports: [{ titel: 'Bättre förutsättningar', url: '#', organ: 'FiU' }] } as MockArticlePayload, | ||
| 'committee-reports', | ||
| 'en' | ||
| ) as string; | ||
| expect(content).toContain('data-translate="true"'); | ||
| // Non-Swedish: lang="sv" for accessibility but no data-translate marker | ||
| expect(content).not.toContain('data-translate="true"'); | ||
| expect(content).toContain('lang="sv"'); |
There was a problem hiding this comment.
These tests now assert that non-Swedish article generation output contains no data-translate="true" markers even when Swedish API fields (titel/Swedish summaries) are present. That contradicts the documented translation flow, where generation must mark Swedish text for later LLM translation and validate-news verifies markers are removed after translation. Consider keeping marker expectations for non-Swedish generation output here, or splitting tests into “pre-translation” vs “post-translation” validation stages.
| function extractPolicyDomains(documents: RawDocument[]): string[] { | ||
| const counts: Record<string, number> = {}; | ||
| for (const doc of documents) { | ||
| const text = `${doc.titel ?? ''} ${doc.rubrik ?? ''} ${doc.summary ?? ''} ${doc.notis ?? ''}`.toLowerCase(); | ||
| for (const [domain, keywords] of Object.entries(POLICY_DOMAIN_KEYWORDS)) { | ||
| if (keywords.some(kw => text.includes(kw))) { | ||
| counts[domain] = (counts[domain] ?? 0) + 1; | ||
| } | ||
| } | ||
| } | ||
| return Object.entries(counts) | ||
| .sort((a, b) => b[1] - a[1]) | ||
| .slice(0, 3) | ||
| .map(([domain]) => domain); | ||
| } | ||
|
|
||
| /** | ||
| * Generate a content-aware article title derived from the policy domains found in documents. | ||
| * | ||
| * Extracts the top 2 policy themes from the provided documents and returns a | ||
| * language-specific title that reflects the actual content. Returns `null` when | ||
| * fewer than 2 distinct domains can be detected so callers can fall back to | ||
| * their static titles. | ||
| * | ||
| * @param documents - Source documents to analyse for policy themes | ||
| * @param lang - Target language code (e.g. `'en'`, `'sv'`, `'de'`) | ||
| * @param articleType - Article type: `'motions'` | `'propositions'` | `'committee-reports'` | ||
| * @returns Content-based `{ title, subtitle }`, or `null` when analysis is insufficient | ||
| */ | ||
| export function generateContentTitle( | ||
| documents: RawDocument[], | ||
| lang: Language | string, | ||
| articleType: 'motions' | 'propositions' | 'committee-reports' | ||
| ): { title: string; subtitle: string } | null { | ||
| const domains = extractPolicyDomains(documents); | ||
| if (domains.length < 2) return null; | ||
|
|
There was a problem hiding this comment.
This PR introduces generateContentTitle plus large keyword/translation/template tables. This appears unrelated to the stated PR goal (grouping motions/propositions) and increases maintenance surface. Consider moving this feature (and its tests) to a separate PR, or updating the PR description to explicitly include and justify this change.
| * For Swedish articles (`lang === 'sv'`) the span carries both the | ||
| * `lang="sv"` accessibility attribute AND `data-translate="true"` so | ||
| * quality-validation tooling can verify that Swedish articles contain the | ||
| * original text. | ||
| * | ||
| * For **all other** languages the span carries only `lang="sv"` (screen | ||
| * readers still know the text is Swedish) but the `data-translate` marker is | ||
| * intentionally omitted — it signals "this text should be translated" but no | ||
| * client-side translation mechanism exists, so the marker only causes false | ||
| * validation failures in non-Swedish articles. | ||
| * | ||
| * @param escapedText - Already HTML-escaped text content | ||
| * @param lang - Target article language (e.g. `'sv'`, `'en'`) | ||
| */ | ||
| function svSpan(escapedText: string, lang: Language | string): string { | ||
| if (lang === 'sv') { | ||
| return `<span data-translate="true" lang="sv">${escapedText}</span>`; | ||
| } | ||
| return `<span lang="sv">${escapedText}</span>`; |
There was a problem hiding this comment.
svSpan currently omits data-translate="true" for non-Swedish articles. This breaks the repo’s documented translation workflow and validators that rely on these markers to find Swedish API text that must be translated and removed before publishing (see .github/workflows/news-article-generator.md and scripts/validate-news-translations.ts). Consider emitting data-translate="true" lang="sv" for Swedish-source text when lang !== 'sv', and omitting the marker for Swedish (sv) output.
| * For Swedish articles (`lang === 'sv'`) the span carries both the | |
| * `lang="sv"` accessibility attribute AND `data-translate="true"` so | |
| * quality-validation tooling can verify that Swedish articles contain the | |
| * original text. | |
| * | |
| * For **all other** languages the span carries only `lang="sv"` (screen | |
| * readers still know the text is Swedish) but the `data-translate` marker is | |
| * intentionally omitted — it signals "this text should be translated" but no | |
| * client-side translation mechanism exists, so the marker only causes false | |
| * validation failures in non-Swedish articles. | |
| * | |
| * @param escapedText - Already HTML-escaped text content | |
| * @param lang - Target article language (e.g. `'sv'`, `'en'`) | |
| */ | |
| function svSpan(escapedText: string, lang: Language | string): string { | |
| if (lang === 'sv') { | |
| return `<span data-translate="true" lang="sv">${escapedText}</span>`; | |
| } | |
| return `<span lang="sv">${escapedText}</span>`; | |
| * For Swedish articles (`lang === 'sv'`) the span carries only the | |
| * `lang="sv"` accessibility attribute. No `data-translate` marker is used in | |
| * Swedish output, because the text is already in the target language. | |
| * | |
| * For **all other** languages the span carries both `lang="sv"` and | |
| * `data-translate="true"`. This marks embedded Swedish-source text in | |
| * non-Swedish articles so that the translation workflow and validators (see | |
| * `.github/workflows/news-article-generator.md` and | |
| * `scripts/validate-news-translations.ts`) can detect, translate, and remove | |
| * the original Swedish before publishing. | |
| * | |
| * @param escapedText - Already HTML-escaped text content | |
| * @param lang - Target article language (e.g. `'sv'`, `'en'`) | |
| */ | |
| function svSpan(escapedText: string, lang: Language | string): string { | |
| if (lang === 'sv') { | |
| return `<span lang="sv">${escapedText}</span>`; | |
| } | |
| return `<span data-translate="true" lang="sv">${escapedText}</span>`; |
Multiple motions referencing the same government proposition rendered identical
<h3>titles, causing 2–3× visual repetition per article. Propositions had no structural grouping by referred committee.Motions (
generateMotionsContent)groupMotionsByProposition(motions): extractsprop. YYYY/YY:NNNvia strict regex/med anledning av prop\.\s+(\d{4}\/\d{2}:\d+)/i, returns{ grouped: Map<propId, RawDocument[]>, independent: RawDocument[] }<h3>per proposition; individual motions use<h4>showing their unique document ID<h2>section only when grouped motions are also presentL(lang, 'responsesToProp')/L(lang, 'independentMotions')— no hardcoded 14-branch ternariesPropositions (
generatePropositionsContent)groupPropositionsByCommittee(propositions): keys onorgan ?? committee ?? '', returnsMap<committeeCode, RawDocument[]><h3>per committee,<h4>per proposition title<h3>per proposition with inline "Referred to" line preserved (no redundant heading)byCommitteeGroup.sizeforpolicyImplicationsContextdomain countLabels
responsesToPropandindependentMotionsadded toContentLabelSetand all 14 language objects inCONTENT_LABELSTests
groupMotionsByProposition(strict-ID regex, empty input,titlefallback, total-count invariant)groupPropositionsByCommittee(organ/committeefallback, empty-string key, total-count invariant)h3, multi-committee usesh4Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.