PhenoGPT2

PhenoGPT2 is a LLM–based phenotype recognition system designed for accurate extraction and Human Phenotype Ontology (HPO) normalization of clinically meaningful phenotypic information from real-world medical narratives. The model is fine-tuned using synthetic clinical corpora generated, curated HPO resources, and de-identified clinical notes from MIMIC-IV to improve phenotype detection, contextual attribution, and ontology alignment.

The framework performs HPO normalization and evidence-aware phenotype validation, retaining only phenotypes confirmed for the patient while filtering negated, uncertain, family-history, or literature-referenced mentions.For transparency, the model also extracts supporting text spans for each phenotype, enabling direct user verification. The pipeline also extracts demographic variables (sex, age, and ethnicity) and phenotype onset information, exporting all results in Phenopacket-JSON format for ready-to-use downstream workflows.

PhenoGPT2 is distributed under the MIT License by Wang Genomics Lab.

Installation 🎯

Clone this repository and navigate to PhenoGPT2 folder

git clone https://github.com/WGLab/PhenoGPT2.git
cd PhenoGPT2

Install system/conda dependencies.

conda env create -f environment.yml -y
conda activate phenogpt2
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit -y
pip install --upgrade pip
pip install -r requirements.txt 
python -m spacy download en_core_web_sm
python -m ipykernel install --user --name=phenogpt2 ## this is needed if you want to run PhenoGPT2_Codebook.ipynb

Install Flash-Attention (optional if your system supports)

## Flash-Attention helps faster inference and lower GPU memory but it is not supported in ARM-based systems.
## Make sure to load CUDA module properly before install flash-attn
#module load CUDA/12.1.1 #try to pip install the following line first, if not module load cuda before it
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

Model Download 📥

PhenoGPT2 is built upon Qwen3 8B/Llama 3.1 8B model, so please apply for access first (for Llama).
⚡⚡Qwen3-based PhenoGPT2 performs the best.🏆🏆
OPTIONAL You can download the HPO Aware Pretrain first if you want to fine-tune on your extraction/normalization data or for LoRA variants.
Then, you can just download either PhenoGPT2-Short or PhenoGPT2-EHR (full parameters) for the inference.
If you plan to extract phenotypes from images, also download PhenoGPT2-Vision.
ATTENTION: PhenoGPT2 is in testing. To access the model weights, please contact us.
LLaVA-Med delivers the best performance, but its installation requires manual modifications to the original code, which can be complex. Please contact us if you wish to use the LLaVA-Med version. Otherwise, the fine-tuned Llama 3.2 11B Vision-Instruct offers seamless integration.

Model Descriptions	Module	Base Model	🤗 Huggingface Hub
HPO Aware Pretrain	Text	Llama 3.1 8B	Not release yet
HPO Aware Pretrain	Text	Qwen3 8B	Not release yet
PhenoGPT2-Short	Text	Llama 3.1 8B	Not release yet
PhenoGPT2-Short	Text	Qwen3 8B	Not release yet
PhenoGPT2-EHR (main)	Text	Llama 3.1 8B	Not release yet
PhenoGPT2-EHR (main)	Text	Qwen3 8B	Not release yet
PhenoGPT2-Vision	Vision	LLaVA-Med/Llama	Not release yet
PhenoGPT2-Vision (default)	Vision	Qwen3-VL-30B-A3B-Instruct	Not release yet
PhenoGPT2-Vision	Vision	Llama 3.2 11B Vision-Instruct	Not release yet

If you plan to fine-tune or pretrain the models from scratch, make sure to download the original base model weights from Meta and LLava-Med repos.
Save all models in the ./models

Data Input Guide 🚀

Input files (for inference) should be a dictionary (key: patient id, value: patient meta data) or a list of dictionary. It should either in JSON or PICKLE extension. Each patient dictionary should have the following format:

{
  "pid1": {
    "clinical_note": "A 1-year-old Korean child presents with persistent fever and shortness of breath. He was found with brachycephaly at 5 months old",
    "image": NaN,
    "pid": "pid1"
  },
  "pid2": {
    "clinical_note": "Subject reports chest pain radiating to the left arm. Elevated troponin levels...",
    "image": "image_pid2.png",
    "pid": "pid2"
  }
}

Please see the ./data/example for reference

JSON-formatted answer 📦

Ideally, the output files include the raw results in phenogpt2_repX.json:

{
  "pid1": {'text': {
    "demographics": {
        'age': '1-year-old',
        'sex': 'male',
        'ethnicity': 'Korean'
    },
    "phenotypes": {
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "shortness of breath": {
            'HPO_ID':'HP:0002094', 'onset':'unknown'
        }
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        },
    },
    "filtered_phenotypes":{
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        }
    },
    "negation_analysis": {
      "demographics" : {
        'age': {'evidence': 'supporting texts', 'correct': True/False},
        'sex': {'evidence': 'supporting texts', 'correct': True/False},
        'ethnicity': {'evidence': 'supporting texts', 'correct': True/False},
        },
      "phenotypes": {
        "persistent fever": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'patient'}
        "shortness of breath": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'family'}
        "brachycephaly": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'patient'}
      }
    }
    "pid": "pid1"
  },
  'image':{}
  },
  ...
}

WARNING

However, due to the nature of LLMs, sometimes the generated format does not fit with JSON format. You will receive the "error_response" in the answer instead of (demographics and phenotypes). This means it is a high chance that the JSON format is not properly set due to some repetitve outputs or unexpected string. Hence, it is suggestive that you check them manually or rerun with some modified notes (you can try to denoise the note first).

Inference 🤖

If you want to simply implement PhenoGPT2 on your local machine for inference, the fine-tuned models are saved in the models directory. Make sure to compile your input data as above before running the inference.

Please note that the first run may take some time as it needs to load all the models. Subsequent runs will be significantly faster.

Please use the following command (along with your scheduler system (i.e SLURM)):

bash run_inference.sh -i ./data/example/text_examples.json \
         -o ./example_testing \
         -model_dir ./models/phenogpt2/ \
         -negation_model YOUR_QWEN_MODEL \
         -attn_implementation flash_attention_2 \
         -batch_size 7 \
         -chunk_batch_size 7 \
         -index 0 \
         -negation \
         -text_only \
         -wc 0"

Required Arguments ⚙️

Argument	Description
`-i`, `--input`	Required. Path to your input data. Can be a `.json`, `.pkl`, or a folder containing `.txt` or image files.
`-o`, `--output`	Required. Path to your output data. This is where results will be saved. The directory will be created if it does not exist.

Optional Arguments

Argument	Description
`-model_dir`, `--model_dir`	Path to the base model directory (e.g. a pretrained LLaVA or LLaMA3 model). If not provided, defaults will be used.
`-lora`, `--lora`	Enable this flag if your model is LoRA-adapted.
`-index`, `--index`	Identifier string for saving outputs. Useful for tracking multiple runs.
`-batch_size`, `--batch_size`	Default = 7. This specifies how many samples are processed simultaneously in a batch. Decrease if you do not have enough GPU memory; otherwise, increase to run faster and more efficient.
`-chunk_batch_size`, `--chunk_batch_size`	Default = 7. This specifies how many chunks per batch are processed simultaneously. This is only needed when chunking the notes (when wc != 0). Decrease if you do not have enough GPU memory; otherwise, increase to run faster and more efficient. For A100-40GB, size of 10 still works but almost hits the max memory.
`-negation`, `--negation`	By default, negation filtering is disabled. Use this flag to enable it.
`-negation_model`, `--negation_model`	By default, negation model is Qwen/Qwen3-4B-Instruct-2507. You can try other Qwen models if needed. We found Qwen is very useful in negation detection (with detailed prompt)
`--attn_implementation`	'eager' is used by default. Option: ['flash_attention_2', 'eager', 'spda']. Recommended using Flash Attention 2 as it helps faster inference and lower memory usage. Note: FlashAttention2 may not be supported on arm64/aarch64 platforms.
`--text_only`	Use only the text module of the model, ignoring visual inputs.
`--vision_only`	Use only the vision module, ignoring text inputs.
`-vision`, `--vision`	Choose the vision model. Options: `llava-med` or `llama-vision` (default). It is used along with the text module; otherwise simply use --vision_only instead.
`-wc`, `--wc`	Word count per chunk. Use this to split long text into smaller chunks (default is `0`, meaning no splitting). We recommend using either full length (no split) or 300/384 words per chunk (improving recall) depending on your tasks. You should decrease the chunk_batch_size if encounter OOM.

Pretraining & Fine-tuning 💻

You can reproduce PhenoGPT2 model with your own datasets or other foundation models.

Text Module

You need to pretrain your model on synthetic data compiled from HPO Database to obtain HPO Aware Pretrained Model.
Then, fine-tune HPO Aware Pretrained Model on synthetic train data and validation data compiled from MIMIC-IV and PhenoPackets.
If you want to access the data, please send a message to the developers.

Vision Module

If you want to fine-tune LLaVA-Med model, we recommend following the instructions in LLaVA GitHub, but change the weights to LLaVA-Med
Otherwise, you can use our phenogpt2_vision_training.py to fine-tune LLaMA Vision (or other similar architecture models).

Developers 🧠

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania (qmn103@seas.upenn.edu)

Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia (wangk@chop.edu)

Citations 📜

The publication is preparing! We appreciate your reading! In the meantime, you can cite our Github if used.

@misc{nguyen2026phenogpt2, author = {Quan Minh Nguyen and Kai Wang}, title = {PhenoGPT2}, year = {2026}, howpublished = {\url{https://github.com/WGLab/PhenoGPT2}}, note = {Accessed: 2026-02-11} }

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
PhenoGPT2_Codebook.ipynb		PhenoGPT2_Codebook.ipynb
README.md		README.md
ds_z3.json		ds_z3.json
environment.yml		environment.yml
finetuning_phenogpt2_text.sh		finetuning_phenogpt2_text.sh
finetuning_phenogpt2_vision.sh		finetuning_phenogpt2_vision.sh
inference.py		inference.py
phenogpt2_pretraining.py		phenogpt2_pretraining.py
phenogpt2_training.py		phenogpt2_training.py
phenogpt2_vision_training.py		phenogpt2_vision_training.py
pretrain_hpo_aware.sh		pretrain_hpo_aware.sh
requirements.txt		requirements.txt
run_evaluations.sh		run_evaluations.sh
run_inference.sh		run_inference.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhenoGPT2

Contents ✨

Installation 🎯

Model Download 📥

Data Input Guide 🚀

JSON-formatted answer 📦

Inference 🤖

Required Arguments ⚙️

Optional Arguments

Pretraining & Fine-tuning 💻

Developers 🧠

Citations 📜

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

WGLab/PhenoGPT2

Folders and files

Latest commit

History

Repository files navigation

PhenoGPT2

Contents ✨

Installation 🎯

Model Download 📥

Data Input Guide 🚀

JSON-formatted answer 📦

Inference 🤖

Required Arguments ⚙️

Optional Arguments

Pretraining & Fine-tuning 💻

Developers 🧠

Citations 📜

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages