Skip to content

item['fields'] in json format #22

@minottic

Description

@minottic

If I understand the logic right, I think whatever is in 'fields' of 'item' is converted to string, cleaned, and all the composing words are returned in an array.

def preprocessItemText(item):
"""
extract the meaningful fields from the item (which is passed in as a pandas dataframe row)
Convert them in a string, using json.dumps
and run all the preprocess steps as highlighted in the PaNOSC search scoring report
"""
# check if input item is a string
# if it is not, we assume that it is a panda dataframe row
outstring = item if isinstance(item,str) else json.dumps(item['fields'])
outstring = outstring.lower()
outstring = removePunctuation(outstring,punctuation_symbols)
outstring = removeStopWords(outstring)
outstring = removeApostrophy(outstring)
outstring = removeUnneededSpaces(outstring)
outstring = convertSentence2Numbers(outstring)
outstring = removeStopWords(outstring)
outstring = stemmatize(outstring,stemmer)
outstring = removePunctuation(outstring,punctuation_symbols)
outstring = removeUnneededSpaces(outstring)
outstring = removeShortWords(outstring)
return outstring.split(' ')

If this is correct (not too sure if I understood correctly though), I don't see the value of allowing item['fields'] to be a dictionary and not simply restricting it to a list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions