GitHub

Research Paper Summarization using Contribution Sentences

Dataset

NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph.

Prerequisite

Clone repo and install requirements.txt in a Python>=3.7.0 environment

git clone https://github.com/GouthamVicky/ResearchPaperSummarization  # clone
cd prerequisite
pip install -r requirements.txt  # install

Refer this Link to install Docker - Installation steps

To Start the Kafka Producer , Consumer and Grobid Container

cd prerequisite
sudo docker-compose up -d

Propsosed Solution

In the Training Dataset for every Research paper, The Raw Text has been extracted from the PDF using Grobid and passed to Stanza which provides formatted text in the text file format and contribution sentences from the paper has been annoted and stored as a seperate text file

Our Task is to build model to classify the contribution sentences from the paper and generate a summary using the contribution sentences

Required Dataset

[articlename].pdf # scholarly article pdf
[articlename]-Grobid-out.txt # plaintext output from the Grobid parser
[articlename]-Stanza-out.txt # plaintext preprocessed output from Stanza
sentences.txt # annotated Contribution sentences in the file

Part 1 - Model Training and evaluation

Model Training Notebook This notebook contains the training and evaluvation code to build the model using NLPContributionGraph dataset
Model Evaluvation used for evaluvating the generated summary by combining classifed contribution sentences
Research paper used for evaluvation of the trained model

Model Card

Model Card Link - https://huggingface.co/Goutham-Vignesh/ContributionSentClassification-scibert

Performs sentence classification to determine whether a given sentence is a contribution sentence or not from the research paper

Model Description

Model type: text-classification
Language(s) (NLP): EN
Finetuned from model: allenai/scibert_scivocab_uncased

Use the code below to get started with the model.

from transformers import pipeline
from transformers import BertTokenizer, BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("Goutham-Vignesh/ContributionSentClassification-scibert")

tokenizer=BertTokenizer.from_pretrained('Goutham-Vignesh/ContributionSentClassification-scibert')
text_classification = pipeline('text-classification', model=model, tokenizer=tokenizer)

Part 2 - Message broker based system using Kafka

Producer Pipeline

Producer will pass message (PDF Text and Abstract of the paper ) on topic ContributeSentences to the Consumer
PDF Files directory containing PDF files to be passed to the consumer
Gradio Parser is used to extract Abstract and paragraph text from the PDF using beautiful soup parsed on Gradio XML output

Consumer Pipeline

Consumer which hosts the ML model will extract the contribution statements and generates summary along with various ROUGE scores by comparison with abstract of the paper.

Usage

Place the PDF inside the pdfFiles
Open a Terminal and run the following command to intialize the producer/publisher

cd MessageBrokerSystem/producer/
python producerkafka.py

Open New Terminal and run the following command to start the consumer pipeline which hosts the ML model

cd MessageBrokerSystem/consumer/
python consumerkafka.py

Sample output of the consumer

   {
     'filename': 'P19-1106.pdf', 
     'Abstract': 'Automatically validating a research artefact is one of the frontiers in Artificial Intelligence (AI) that directly brings it close to competing with human intellect and intuition. Although criticized sometimes, the existing peer review system still stands as the benchmark of research validation. The present-day peer review process is not straightforward and demands profound domain knowledge, expertise, and intelligence of human reviewer(s), which is somewhat elusive with the current state of AI. However, the peer review texts, which contains rich sentiment information of the reviewer, reflecting his/her overall attitude towards the research in the paper, could be a valuable entity to predict the acceptance or rejection of the manuscript under consideration. Here in this work, we investigate the role of reviewers sentiments embedded within peer review texts to predict the peer review outcome. Our proposed deep neural architecture takes into account three channels of information: the paper, the corresponding reviews, and the review polarity to predict the overall recommendation score as well as the final decision. We achieve significant performance improvement over the baselines (∼ 29% error reduction) proposed in a recently released dataset of peer reviews. An AI of this kind could assist the editors/program chairs as an additional layer of confidence in the final decision making, especially when non-responding/missing reviewers are frequent in present day peer review.', 
     'generatedSummary': 'Automatically validating a research artefact is one of the frontiers in Artificial Intelligence ( AI ) that directly brings it close to competing with human intellect and intuition .Automatically validating a research artefact is one of the frontiers in Artificial Intelligence ( AI ) that directly brings it close to competing with human intellect and intuition .The evaluation shows that our proposed model successfully outperforms the earlier reported results in PeerRead .Finally , we fuse the extracted review sentiment feature and joint paper + review representation together to generate the overall recommendation score ( Decision - Level Fusion ) using the affine transformation asWe employ a grid search for hyperparameter optimization .For Task 1 , F is 256 , l is 5 . ReLU is the non-linear function g( ) , learning rate is 0.007 .We train the model with SGD optimizer , set momentum as 0.9 and batch size as 32 .We keep dropout at 0.5 .Again we train the model with Adam Optimizer , keep the batch size as 64 and use 0.7 as the dropout rate to prevent overfitting .With only using review + sentiment information , we are still able to outperform Kang et al. ( 2018 ) by a margin of 11 % in terms of RMSE .However , we also find that our approach with only Review + Sentiment performs inferior to the Paper + Review method in Kang et al. ( 2018 ) for ACL 2017 .Transformation based Models of Video SequencesAct are the output activations from the final layer of MLP Senti which are augmented to the decision layer for final recommendation score prediction .',
     'rouge_scores': {'rouge1': 0.4008438818565401, 'rouge2': 0.14830508474576273, 'rougeL': 0.24050632911392406, 'rougeLsum': 0.24050632911392406}
   }

Sample output of the Message Broker System

Screencast.from.2023-03-04.23-52-11.webm

The model was trained using Google Colab Pro and kafka message system has been implemented and tested on NVIDIA GeForce RTX™ 2060 Machine

GPU is recommended for faster inference for Kafka message system

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
MessageBrokerSystem		MessageBrokerSystem
Notebooks		Notebooks
prerequisite		prerequisite
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Research Paper Summarization using Contribution Sentences

Dataset

Prerequisite

Propsosed Solution

Part 1 - Model Training and evaluation

Model Card

Model Card Link - https://huggingface.co/Goutham-Vignesh/ContributionSentClassification-scibert

Model Description

Part 2 - Message broker based system using Kafka

Producer Pipeline

Consumer Pipeline

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

GouthamVicky/ResearchPaperSummarization

Folders and files

Latest commit

History

Repository files navigation

Research Paper Summarization using Contribution Sentences

Dataset

Prerequisite

Propsosed Solution

Part 1 - Model Training and evaluation

Model Card

Model Card Link - https://huggingface.co/Goutham-Vignesh/ContributionSentClassification-scibert

Model Description

Part 2 - Message broker based system using Kafka

Producer Pipeline

Consumer Pipeline

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages