Skip to content

GouthamVicky/LLM-LongDoc-Summary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Long Document Summarization

Problem Statement

Develop a customized LLM model that can generate a summary of a given Research document.

Proposed Solution

  • Preprocess the article's context
  • Generate an extractive summary using sentence-transformer
  • Generate a prompt dataset
  • Fine-tune the Falcon Model (1B)
  • Evaluate the Model output
  • Serve the finetuned LLM model using VLLM
  • Containerize the inference pipeline and deploy the API’s endpoint with Streamlit integrated UI

Arxiv dataset for summarization

Dataset for summarization of long documents.
Huggingface Link - ccdv/arxiv-summarization

Data Fields

  • id: paper id
  • article: a string containing the body of the paper
  • abstract: a string containing the abstract of the paper

Part 1 - Preprocessing Steps

  • Tokenization: The input document is tokenized into sentences using NLTK's sentence tokenizer. This step divides the document into manageable segments.

  • Generate Contextual Embeddings: Sentence Transformers are used to create contextual embeddings for each sentence. paraphrase-MiniLM-L3-v2 has been used due to its speed and efficiency, as indicated in the Sentence Transformers documentation

  • Sentence Clustering: Sentences are clustered using K-means which groups similar sentences together, helping to identify important information from the article.

  • Representative Sentence Selection: The sentence closest to the centroid of each cluster is chosen as the representation sentence for that group. This ensures that the most informative sentence in each topic cluster is included in the summary.

  • Create Extractive Summary: The extractive summary is generated by combining these sentences

  • Generating the Prompt for finetuning: The extractive summary obtained in the previous step is combined with the document's abstract. This combined text serves as the prompt for the language model.

Part 2 - Finetuning Model

  • Download Falcon Model (1B) Model Source - Paper: https://arxiv.org/abs/2306.01116

  • using Bitsandbytes load the model in 4bit with nf4 Quantization Technique Paper - https://arxiv.org/pdf/2305.14314.pdf

  • Import PEFT and pass the Lora config to finetune the Falcon model with the generated prompt. By using QLoRA, the model parameters are frozen, resulting in fine-tuning with a reduced parameter set. This, in turn, leads to reduced GPU consumption and a shortened training time.

  • Finetune the model and save the adapter/checkpoint to the huggingface hub

  • Merge the LORA adapter and push the model to huggingface hub.

Train/Loss Graph

train_loss

Finetuning Notebook - Link

Model Card

Model Card Link - GouthamVignesh/falcon-arxiv-long-summary-1B (Lora adapter & safetensor)

Model Card Link - GouthamVignesh/falcon-long-summary-ckpt (checkpoint bin file)

Model Description

Finetuned Falcon-1B model for document summarization task

Part 2 - Evaluvaton

  • Load Dataset and Initialize Lists.
  • Initialize Sentence Transformer Model.
  • Calculate BLEU, ROUGE, and Semantic Similarity Scores.
  • Iterate and Evaluate Each Example.
  • Calculate Average BLEU, ROUGE, and Semantic Similarity Scores.
  • Combine Scores into a Hybrid Score.
  • Interpret the Results.

Part 3 - Inference Comparision

  • Inference comparison between hugging face generation pipeline and VLLM Inference
  • Inference Google Colab Notebook Link
Metric Hugging Face Pipeline VLM Pipeline
Total Time Taken 10.04 seconds 15.21 seconds
Tokens Generated 324 1024
Tokens Generated Per Second 32.26 tokens/second 67.34 tokens/second
GPU Memory Usage (MB) 4587 33140 MB

Inference Comparion chart -

download

Part 4 - Streamlit Model Deployment

  • Build and run the docker Image to run the streamlit application
git clone https://github.com/GouthamVicky/LLM-LongDoc-Summary.git 
cd inference
sudo docker build -t falconapp .
docker run --gpus all -p 8000:8000 -p 8501:8501 falconapp
  • To run without Docker
git clone https://github.com/GouthamVicky/LLM-LongDoc-Summary.git 
cd inference
pip install -r requirements.txt
uvicorn fastapi_app:app --port 8000 & streamlit run streamlit_app.py

Sample Demo

Screen.Recording.2023-11-08.at.5.25.13.PM.1.mp4

Future Improvements

  • Exclude the extractive summary process from the preprocessing step and fine-tune the model with the Extended context window limit using Rotary Embeddings and Flash attention to improve the handling of longer documents. Reference - https://arxiv.org/pdf/2306.15595.pdf
  • LongLoRA can be used to extend the context length during fine-tuning while maintaining high performance and low complexity. The core idea behind LongLoRA is Shift Short Attention (S2 Attention) which splits the content into smaller parts. S2-Attn introduces token shifting by half of the group size, ensuring smooth information exchange between adjacent groups.Reference - https://github.com/dvlab-research/LongLoRA
  • Finetune the model which has AWQ support to enhance the inference speed of the model on resource-constrained edge platforms Reference - https://github.com/mit-han-lab/llm-awq#awq-model-zoo

About

PEFT & LORA finetuning of Falcon LLM model that can generate a summary of a given Research document.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages