Skip to content

MrCosta57/TitleQuill

Repository files navigation

TitleQuill: Unified Framework for Titles and Keywords Generation using Pre-Trained Model

HuggingFace PyTorch WeightsAndBiases

Alt text

Description

This repository contains the implementation of TitleQuill, a novel approach for keyword extraction and title generation, reinterpreted as two forms of summarization. The project leverages the Flan-T5 model, fine-tuned using two distinct strategies: simultaneous training on both tasks and divided task training with combined losses. The approach is built on the T5 idea, framing both tasks as text-to-text transformations, enabling the use of a single model for both. The repository includes scripts for model training, data preparation, and evaluation, along with pre-trained model checkpoints and instructions for reproducing the experiments

Some examples

Original Title Generated Title Generated Keywords
Efficient Estimation of Word Representations in Vector Space A novel model for computing continuous vector representations of words from large data sets neural networks
Adam: A Method for Stochastic Optimization An algorithm for first-order gradient-based optimization of stochastic objective functions convex optimization, gradient-based optimization, optimal convergence, first-order optimization
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A new language representation model for deep bidirectional representations from transformers bidirectional representation, transformers, language representation, language representation model
Improving Language Understanding by Generative Pre-Training Generative pre-training of a language model for natural language understanding task-agnostic model, natural language, fine-tuning, task-aware input transformations

Deployment

Only CPU inference is supported in the Docker container

  1. Build the Docker image

    docker build -t titlequill .
  2. Run the Docker container

    docker run -p 8501:8501 titlequill
  3. Open the browser and go to http://localhost:8501

Development

Installation

# [OPTIONAL] Create conda environment
conda create -n myenv python=3.11
conda activate myenv

# Install requirements
# CPU
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
# CUDA
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118

Download data

Using the following command is possible to download the dataset used in the project. The script apply also a post-processing to the files, changing the extension from .txt to .jsonl adjust their names properly.

python src/datamodule/download.py
Command Line Arguments for download.py

--data_dir

Directory to save the dataset

--url

URL to download the dataset

--old_ext_postproc

Old extension of the files to postprocess

--new_ext_postproc

New extension of the files to postprocess

Data statistics

The statistics about the dataset can be obtained using the following command:

python src/datamodule/stats.py
Command Line Arguments for stats.py

--data_dir

Directory to save the dataset

--out_dir

Directory to save the plots

How to run

All the scripts can be configured by modifying the configuration files in the configs/ directory. The configuration files are written in YAML format. The scripts parameters can be overridden from the command line.

The configuration files are validated using Hydra, a powerful configuration management tool. For more information about Hydra, please refer to the official documentation

Training TitleQuill

python src/run_titlequill.py

Other scripts

  • Baseline
python src/run_baseline.py
  • Qwen2
python src/run_qwen2.py
  • TextRank
python src/run_textrank.py

Demo

The project includes a Streamlit GUI for the TitleQuill model. To run the GUI, execute the following command:

# Activate your environment
streamlit run src/app.py

Model weights

The models weights can be obtained from this link. Please place the entire folder containing model and tokenizer in the output/ directory.

References

@misc{flan_t5,
      title={Scaling Instruction-Finetuned Language Models}, 
      author={Hyung Won Chung and Le Hou and Shayne Longpre and Barret Zoph and Yi Tay and William Fedus and Yunxuan Li and Xuezhi Wang and Mostafa Dehghani and Siddhartha Brahma and Albert Webson and Shixiang Shane Gu and Zhuyun Dai and Mirac Suzgun and Xinyun Chen and Aakanksha Chowdhery and Alex Castro-Ros and Marie Pellat and Kevin Robinson and Dasha Valter and Sharan Narang and Gaurav Mishra and Adams Yu and Vincent Zhao and Yanping Huang and Andrew Dai and Hongkun Yu and Slav Petrov and Ed H. Chi and Jeff Dean and Jacob Devlin and Adam Roberts and Denny Zhou and Quoc V. Le and Jason Wei},
      year={2022},
      eprint={2210.11416},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2210.11416}, 
}

@inproceedings{text_rank,
    title = "{T}ext{R}ank: Bringing Order into Text",
    author = "Mihalcea, Rada  and
      Tarau, Paul",
    editor = "Lin, Dekang  and
      Wu, Dekai",
    booktitle = "Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing",
    month = jul,
    year = "2004",
    address = "Barcelona, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W04-3252",
    pages = "404--411",
}


@misc{qwen2,
      title={Qwen Technical Report}, 
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      year={2023},
      eprint={2309.16609},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2309.16609}, 
}

About

A unified approach for keyword extraction and title generation using the Flan-T5 model, fine-tuned with dual strategies to improve both tasks through shared learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages