-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
If I understand the logic right, I think whatever is in 'fields' of 'item' is converted to string, cleaned, and all the composing words are returned in an array.
panosc-search-scoring/app/ml/preprocessItemsText.py
Lines 96 to 119 in 5d35342
| def preprocessItemText(item): | |
| """ | |
| extract the meaningful fields from the item (which is passed in as a pandas dataframe row) | |
| Convert them in a string, using json.dumps | |
| and run all the preprocess steps as highlighted in the PaNOSC search scoring report | |
| """ | |
| # check if input item is a string | |
| # if it is not, we assume that it is a panda dataframe row | |
| outstring = item if isinstance(item,str) else json.dumps(item['fields']) | |
| outstring = outstring.lower() | |
| outstring = removePunctuation(outstring,punctuation_symbols) | |
| outstring = removeStopWords(outstring) | |
| outstring = removeApostrophy(outstring) | |
| outstring = removeUnneededSpaces(outstring) | |
| outstring = convertSentence2Numbers(outstring) | |
| outstring = removeStopWords(outstring) | |
| outstring = stemmatize(outstring,stemmer) | |
| outstring = removePunctuation(outstring,punctuation_symbols) | |
| outstring = removeUnneededSpaces(outstring) | |
| outstring = removeShortWords(outstring) | |
| return outstring.split(' ') |
If this is correct (not too sure if I understood correctly though), I don't see the value of allowing item['fields'] to be a dictionary and not simply restricting it to a list.
Metadata
Metadata
Assignees
Labels
No labels