-
Notifications
You must be signed in to change notification settings - Fork 256
Improve support for copy-paste from Microsoft Word #5595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
nucleogenesis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll spin it up tomorrow but overall the code looks great overall. Left a suggestion and a couple of questions in the meantime.
| } | ||
| } catch (err) { | ||
| editor.value.chain().focus().insertContent(clipboardAccessFailed$()).run(); | ||
| return handlePasteNoFormat(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this call here necessary if the one 2 lines down is outside of the if block altogether (or vice-versa?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! The inner return handlePasteNoFormat() was redundant, as the outer fallback already covers both cases. Thanks
| return `$$${latex || ''}$$`; | ||
| }; | ||
|
|
||
| export function sanitizePastedHTML(html) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is specific to MS word then maybe putting that in the name is a good idea a la sanitizeMSWordHTML or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the sanitizer initially targeted MS Word, but we’ve expanded it to handle copy-paste issues from Google Docs and LibreOffice as well (e.g., strike-through bleed, nested list normalization). And since it now applies to all external HTML paste sources, renaming it to sanitizeMSWordHTML would be misleading. Keeping the more general name sanitizePastedHTML better reflects its broader use and sounds good to me.
| const items = list.querySelectorAll(':scope > li'); | ||
| items.forEach(item => { | ||
| const nestedLists = Array.from(item.children).filter( | ||
| child => child.tagName === 'UL' || child.tagName === 'OL', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these uppercase? Is that coming from the pasted text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The uppercase comes from the DOM API itself. Browsers normalize element.tagName to uppercase for all HTML elements, regardless of how they appear in the pasted HTML. https://developer.mozilla.org/en-US/docs/Web/API/Element/tagName.
https://html.spec.whatwg.org/multipage/dom.html#htmlelement
HTML elements have an uppercase local name.
So I think checking against 'UL' and 'OL' is the correct and standard way to identify list elements in sanitized HTML
|
Looking forward to checking this out on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@radinamatic could you give this image a look and share any thoughts you have? Do you think this is worth getting onto unstable? I'm not using Microsoft Word but I am seeing inconsistencies between filetypes and where I open and copy the contents so I'm not sure we could ever really enumerate all of the various formats we might run into.
I opened the various testing files @AllanOXDi shared in the reviewer guidance and made examples of them below in the question & answer boxes of an exercise for demonstration purposes. I put which filetype it was and where I opened it (and then copied it from) as the first line in each box -- then below that is the pasted text from that filetype/program combination.
I did do the sort of "happy paths" where files were opened in places they're most compatible but also mixed and matched things like opened in a LibreOffice file in Google Docs and DOCX in LibreOffice and so forth -- although I'm not sure the acceptance criteria.
Summary
This PR improves how the Studio editor handles copy-paste from Microsoft Word so that pasted content more closely matches the formatting that Studio itself supports.
References
#4325
Reviewer guidance
Copy and paste from these sample docs MS Word 2019. LibreOffice Writer v25.8.1.1. Google Doc to exercise editor and verify if: