Develop a customized LLM model that can generate a summary of a given Research document.
- Preprocess the article's context
- Generate an extractive summary using sentence-transformer
- Generate a prompt dataset
- Fine-tune the Falcon Model (1B)
- Evaluate the Model output
- Serve the finetuned LLM model using VLLM
- Containerize the inference pipeline and deploy the API’s endpoint with Streamlit integrated UI
Dataset for summarization of long documents.
Huggingface Link - ccdv/arxiv-summarization
id: paper idarticle: a string containing the body of the paperabstract: a string containing the abstract of the paper
-
Tokenization: The input document is tokenized into sentences using NLTK's sentence tokenizer. This step divides the document into manageable segments.
-
Generate Contextual Embeddings: Sentence Transformers are used to create contextual embeddings for each sentence. paraphrase-MiniLM-L3-v2 has been used due to its speed and efficiency, as indicated in the Sentence Transformers documentation
-
Sentence Clustering: Sentences are clustered using K-means which groups similar sentences together, helping to identify important information from the article.
-
Representative Sentence Selection: The sentence closest to the centroid of each cluster is chosen as the representation sentence for that group. This ensures that the most informative sentence in each topic cluster is included in the summary.
-
Create Extractive Summary: The extractive summary is generated by combining these sentences
-
Generating the Prompt for finetuning: The extractive summary obtained in the previous step is combined with the document's abstract. This combined text serves as the prompt for the language model.
-
Download Falcon Model (1B) Model Source - Paper: https://arxiv.org/abs/2306.01116
-
using Bitsandbytes load the model in 4bit with nf4 Quantization Technique Paper - https://arxiv.org/pdf/2305.14314.pdf
-
Import PEFT and pass the Lora config to finetune the Falcon model with the generated prompt. By using QLoRA, the model parameters are frozen, resulting in fine-tuning with a reduced parameter set. This, in turn, leads to reduced GPU consumption and a shortened training time.
-
Finetune the model and save the adapter/checkpoint to the huggingface hub
-
Merge the LORA adapter and push the model to huggingface hub.
Finetuning Notebook - Link
Model Card Link - GouthamVignesh/falcon-arxiv-long-summary-1B (Lora adapter & safetensor)
Model Card Link - GouthamVignesh/falcon-long-summary-ckpt (checkpoint bin file)
Finetuned Falcon-1B model for document summarization task
- Model type: CasualLM
- Language(s) (NLP): EN
- Finetuned from model: tiiuae/falcon-rw-1b
- Load Dataset and Initialize Lists.
- Initialize Sentence Transformer Model.
- Calculate BLEU, ROUGE, and Semantic Similarity Scores.
- Iterate and Evaluate Each Example.
- Calculate Average BLEU, ROUGE, and Semantic Similarity Scores.
- Combine Scores into a Hybrid Score.
- Interpret the Results.
- Inference comparison between hugging face generation pipeline and VLLM Inference
- Inference Google Colab Notebook Link
| Metric | Hugging Face Pipeline | VLM Pipeline |
|---|---|---|
| Total Time Taken | 10.04 seconds | 15.21 seconds |
| Tokens Generated | 324 | 1024 |
| Tokens Generated Per Second | 32.26 tokens/second | 67.34 tokens/second |
| GPU Memory Usage (MB) | 4587 | 33140 MB |
- Build and run the docker Image to run the streamlit application
git clone https://github.com/GouthamVicky/LLM-LongDoc-Summary.git
cd inference
sudo docker build -t falconapp .
docker run --gpus all -p 8000:8000 -p 8501:8501 falconapp- To run without Docker
git clone https://github.com/GouthamVicky/LLM-LongDoc-Summary.git
cd inference
pip install -r requirements.txt
uvicorn fastapi_app:app --port 8000 & streamlit run streamlit_app.py- Navigate to localhosthttp://localhost:8501/
Screen.Recording.2023-11-08.at.5.25.13.PM.1.mp4
- Exclude the extractive summary process from the preprocessing step and fine-tune the model with the Extended context window limit using Rotary Embeddings and Flash attention to improve the handling of longer documents. Reference - https://arxiv.org/pdf/2306.15595.pdf
- LongLoRA can be used to extend the context length during fine-tuning while maintaining high performance and low complexity. The core idea behind LongLoRA is Shift Short Attention (S2 Attention) which splits the content into smaller parts. S2-Attn introduces token shifting by half of the group size, ensuring smooth information exchange between adjacent groups.Reference - https://github.com/dvlab-research/LongLoRA
- Finetune the model which has AWQ support to enhance the inference speed of the model on resource-constrained edge platforms Reference - https://github.com/mit-han-lab/llm-awq#awq-model-zoo

