item['fields'] in json format

If I understand the logic right, I think whatever is in 'fields' of 'item' is converted to string, cleaned, and all the composing words are returned in an array. 
https://github.com/panosc-eu/panosc-search-scoring/blob/5d35342e5148676ffdfb43765ec19aaadae82a4f/app/ml/preprocessItemsText.py#L96-L119

If this is correct (not too sure if I understood correctly though), I don't see the value of allowing item['fields'] to be a dictionary and not simply restricting it to a list. 



	def preprocessItemText(item):
	"""
	extract the meaningful fields from the item (which is passed in as a pandas dataframe row)
	Convert them in a string, using json.dumps
	and run all the preprocess steps as highlighted in the PaNOSC search scoring report
	"""

	# check if input item is a string
	# if it is not, we assume that it is a panda dataframe row
	outstring = item if isinstance(item,str) else json.dumps(item['fields'])

	outstring = outstring.lower()
	outstring = removePunctuation(outstring,punctuation_symbols)
	outstring = removeStopWords(outstring)
	outstring = removeApostrophy(outstring)
	outstring = removeUnneededSpaces(outstring)
	outstring = convertSentence2Numbers(outstring)
	outstring = removeStopWords(outstring)
	outstring = stemmatize(outstring,stemmer)
	outstring = removePunctuation(outstring,punctuation_symbols)
	outstring = removeUnneededSpaces(outstring)
	outstring = removeShortWords(outstring)

	return outstring.split(' ')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

item['fields'] in json format #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

item['fields'] in json format #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions