Skip to content

Conversation

@AllanOXDi
Copy link
Member

@AllanOXDi AllanOXDi commented Dec 4, 2025

Summary

This PR improves how the Studio editor handles copy-paste from Microsoft Word so that pasted content more closely matches the formatting that Studio itself supports.

References

#4325

Reviewer guidance

Copy and paste from these sample docs MS Word 2019. LibreOffice Writer v25.8.1.1. Google Doc to exercise editor and verify if:

  • Nested lists show correct indentation
  • Bullet sizes are consistent and appropriately sized
  • First-level bullets are discs
  • Second-level bullets are circles
  • Third-level bullets are squares
  • Strike-through does NOT extend into nested list items
  • Nested lists retain proper indentation even when strike-through is used

Copy link
Member

@nucleogenesis nucleogenesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll spin it up tomorrow but overall the code looks great overall. Left a suggestion and a couple of questions in the meantime.

}
} catch (err) {
editor.value.chain().focus().insertContent(clipboardAccessFailed$()).run();
return handlePasteNoFormat();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this call here necessary if the one 2 lines down is outside of the if block altogether (or vice-versa?)

Copy link
Member Author

@AllanOXDi AllanOXDi Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! The inner return handlePasteNoFormat() was redundant, as the outer fallback already covers both cases. Thanks

return `$$${latex || ''}$$`;
};

export function sanitizePastedHTML(html) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is specific to MS word then maybe putting that in the name is a good idea a la sanitizeMSWordHTML or something?

Copy link
Member Author

@AllanOXDi AllanOXDi Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the sanitizer initially targeted MS Word, but we’ve expanded it to handle copy-paste issues from Google Docs and LibreOffice as well (e.g., strike-through bleed, nested list normalization). And since it now applies to all external HTML paste sources, renaming it to sanitizeMSWordHTML would be misleading. Keeping the more general name sanitizePastedHTML better reflects its broader use and sounds good to me.

const items = list.querySelectorAll(':scope > li');
items.forEach(item => {
const nestedLists = Array.from(item.children).filter(
child => child.tagName === 'UL' || child.tagName === 'OL',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these uppercase? Is that coming from the pasted text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uppercase comes from the DOM API itself. Browsers normalize element.tagName to uppercase for all HTML elements, regardless of how they appear in the pasted HTML. https://developer.mozilla.org/en-US/docs/Web/API/Element/tagName.
https://html.spec.whatwg.org/multipage/dom.html#htmlelement

HTML elements have an uppercase local name.

So I think checking against 'UL' and 'OL' is the correct and standard way to identify list elements in sanitized HTML

@radinamatic
Copy link
Member

radinamatic commented Dec 5, 2025

Looking forward to checking this out on unstable! 🎉

Copy link
Member

@nucleogenesis nucleogenesis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@radinamatic could you give this image a look and share any thoughts you have? Do you think this is worth getting onto unstable? I'm not using Microsoft Word but I am seeing inconsistencies between filetypes and where I open and copy the contents so I'm not sure we could ever really enumerate all of the various formats we might run into.

I opened the various testing files @AllanOXDi shared in the reviewer guidance and made examples of them below in the question & answer boxes of an exercise for demonstration purposes. I put which filetype it was and where I opened it (and then copied it from) as the first line in each box -- then below that is the pasted text from that filetype/program combination.

I did do the sort of "happy paths" where files were opened in places they're most compatible but also mixed and matched things like opened in a LibreOffice file in Google Docs and DOCX in LibreOffice and so forth -- although I'm not sure the acceptance criteria.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants