spaCy Not Recognizing Biological Terms with Special Symbols #13890
Unanswered
valerivankov
asked this question in
Help: Coding & Implementations
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I am using spaCy to process texts about human medical and biology data. However, I notice that it is processing symbols and punctuation marks that are part of biomedical term as partial separate entities instead of as a whole term. For instance, 'poly(A)' is separated as 'poly(A' and ')', 'H(2)O(2)' is separated as 'H(2)O(2' and ')', and in 'pre mRNA (U)GCAUG consensus sequences' the '(U)GCAUG' becomes 'U)GCAUG.'
Here are additional examples of this issue:
Interestingly, some are with () while others are with - or +.
I understand that spaCy treats the parentheses at the start and end of words as prefixes or suffixes, respectively, which makes sense if these are standalone entities, such as ‘111)’ or ‘[Photosynthesis’. How do we modify spaCy such that it recognizes words that contain symbols and punctuation marks as “whole words” such as the ones above?
Thanks!
cc: @k-blenman
Beta Was this translation helpful? Give feedback.
All reactions