TitleQuill: Unified Framework for Titles and Keywords Generation using Pre-Trained Model

Description

This repository contains the implementation of TitleQuill, a novel approach for keyword extraction and title generation, reinterpreted as two forms of summarization. The project leverages the Flan-T5 model, fine-tuned using two distinct strategies: simultaneous training on both tasks and divided task training with combined losses. The approach is built on the T5 idea, framing both tasks as text-to-text transformations, enabling the use of a single model for both. The repository includes scripts for model training, data preparation, and evaluation, along with pre-trained model checkpoints and instructions for reproducing the experiments

Some examples

Original Title	Generated Title	Generated Keywords
Efficient Estimation of Word Representations in Vector Space	A novel model for computing continuous vector representations of words from large data sets	neural networks
Adam: A Method for Stochastic Optimization	An algorithm for first-order gradient-based optimization of stochastic objective functions	convex optimization, gradient-based optimization, optimal convergence, first-order optimization
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	A new language representation model for deep bidirectional representations from transformers	bidirectional representation, transformers, language representation, language representation model
Improving Language Understanding by Generative Pre-Training	Generative pre-training of a language model for natural language understanding	task-agnostic model, natural language, fine-tuning, task-aware input transformations

Deployment

Only CPU inference is supported in the Docker container

Build the Docker image
```
docker build -t titlequill .
```
Run the Docker container
```
docker run -p 8501:8501 titlequill
```
Open the browser and go to http://localhost:8501

Development

Installation

# [OPTIONAL] Create conda environment
conda create -n myenv python=3.11
conda activate myenv

# Install requirements
# CPU
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cpu
# CUDA
pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu118

Download data

Using the following command is possible to download the dataset used in the project. The script apply also a post-processing to the files, changing the extension from .txt to .jsonl adjust their names properly.

python src/datamodule/download.py

Command Line Arguments for download.py

--data_dir

Directory to save the dataset

--url

URL to download the dataset

--old_ext_postproc

Old extension of the files to postprocess

--new_ext_postproc

New extension of the files to postprocess

Data statistics

The statistics about the dataset can be obtained using the following command:

python src/datamodule/stats.py

Command Line Arguments for stats.py

--data_dir

Directory to save the dataset

--out_dir

Directory to save the plots

How to run

All the scripts can be configured by modifying the configuration files in the configs/ directory. The configuration files are written in YAML format. The scripts parameters can be overridden from the command line.

The configuration files are validated using Hydra, a powerful configuration management tool. For more information about Hydra, please refer to the official documentation

Training TitleQuill

python src/run_titlequill.py

Other scripts

Baseline

python src/run_baseline.py

Qwen2

python src/run_qwen2.py

TextRank

python src/run_textrank.py

Demo

The project includes a Streamlit GUI for the TitleQuill model. To run the GUI, execute the following command:

# Activate your environment
streamlit run src/app.py

Model weights

The models weights can be obtained from this link. Please place the entire folder containing model and tokenizer in the output/ directory.

References

@misc{flan_t5,
      title={Scaling Instruction-Finetuned Language Models}, 
      author={Hyung Won Chung and Le Hou and Shayne Longpre and Barret Zoph and Yi Tay and William Fedus and Yunxuan Li and Xuezhi Wang and Mostafa Dehghani and Siddhartha Brahma and Albert Webson and Shixiang Shane Gu and Zhuyun Dai and Mirac Suzgun and Xinyun Chen and Aakanksha Chowdhery and Alex Castro-Ros and Marie Pellat and Kevin Robinson and Dasha Valter and Sharan Narang and Gaurav Mishra and Adams Yu and Vincent Zhao and Yanping Huang and Andrew Dai and Hongkun Yu and Slav Petrov and Ed H. Chi and Jeff Dean and Jacob Devlin and Adam Roberts and Denny Zhou and Quoc V. Le and Jason Wei},
      year={2022},
      eprint={2210.11416},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2210.11416}, 
}

@inproceedings{text_rank,
    title = "{T}ext{R}ank: Bringing Order into Text",
    author = "Mihalcea, Rada  and
      Tarau, Paul",
    editor = "Lin, Dekang  and
      Wu, Dekai",
    booktitle = "Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing",
    month = jul,
    year = "2004",
    address = "Barcelona, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W04-3252",
    pages = "404--411",
}


@misc{qwen2,
      title={Qwen Technical Report}, 
      author={Jinze Bai and Shuai Bai and Yunfei Chu and Zeyu Cui and Kai Dang and Xiaodong Deng and Yang Fan and Wenbin Ge and Yu Han and Fei Huang and Binyuan Hui and Luo Ji and Mei Li and Junyang Lin and Runji Lin and Dayiheng Liu and Gao Liu and Chengqiang Lu and Keming Lu and Jianxin Ma and Rui Men and Xingzhang Ren and Xuancheng Ren and Chuanqi Tan and Sinan Tan and Jianhong Tu and Peng Wang and Shijie Wang and Wei Wang and Shengguang Wu and Benfeng Xu and Jin Xu and An Yang and Hao Yang and Jian Yang and Shusheng Yang and Yang Yao and Bowen Yu and Hongyi Yuan and Zheng Yuan and Jianwei Zhang and Xingxuan Zhang and Yichang Zhang and Zhenru Zhang and Chang Zhou and Jingren Zhou and Xiaohuan Zhou and Tianhang Zhu},
      year={2023},
      eprint={2309.16609},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2309.16609}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
assets		assets
configs		configs
data		data
figures		figures
logs		logs
notebooks		notebooks
output		output
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TitleQuill: Unified Framework for Titles and Keywords Generation using Pre-Trained Model

Description

Some examples

Deployment

Development

Installation

Download data

--data_dir

--url

--old_ext_postproc

--new_ext_postproc

Data statistics

--data_dir

--out_dir

How to run

Training TitleQuill

Other scripts

Demo

Model weights

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TitleQuill: Unified Framework for Titles and Keywords Generation using Pre-Trained Model

Description

Some examples

Deployment

Development

Installation

Download data

--data_dir

--url

--old_ext_postproc

--new_ext_postproc

Data statistics

--data_dir

--out_dir

How to run

Training TitleQuill

Other scripts

Demo

Model weights

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages