Skip to content

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models.

License

Notifications You must be signed in to change notification settings

WGLab/PhenoGPT2

Repository files navigation

PhenoGPT2

PhenoGPT2 is a LLM–based phenotype recognition system designed for accurate extraction and Human Phenotype Ontology (HPO) normalization of clinically meaningful phenotypic information from real-world medical narratives. The model is fine-tuned using synthetic clinical corpora generated, curated HPO resources, and de-identified clinical notes from MIMIC-IV to improve phenotype detection, contextual attribution, and ontology alignment.

The framework performs HPO normalization and evidence-aware phenotype validation, retaining only phenotypes confirmed for the patient while filtering negated, uncertain, family-history, or literature-referenced mentions.For transparency, the model also extracts supporting text spans for each phenotype, enabling direct user verification. The pipeline also extracts demographic variables (sex, age, and ethnicity) and phenotype onset information, exporting all results in Phenopacket-JSON format for ready-to-use downstream workflows.

PhenoGPT2 is distributed under the MIT License by Wang Genomics Lab.

Contents ✨

Installation 🎯

  1. Clone this repository and navigate to PhenoGPT2 folder
git clone https://github.com/WGLab/PhenoGPT2.git
cd PhenoGPT2
  1. Install system/conda dependencies.
conda env create -f environment.yml -y
conda activate phenogpt2
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
conda install -c "nvidia/label/cuda-12.8" cuda-toolkit -y
pip install --upgrade pip
pip install -r requirements.txt 
python -m spacy download en_core_web_sm
python -m ipykernel install --user --name=phenogpt2 ## this is needed if you want to run PhenoGPT2_Codebook.ipynb
  1. Install Flash-Attention (optional if your system supports)
## Flash-Attention helps faster inference and lower GPU memory but it is not supported in ARM-based systems.
## Make sure to load CUDA module properly before install flash-attn
#module load CUDA/12.1.1 #try to pip install the following line first, if not module load cuda before it
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.8cxx11abiTRUE-cp311-cp311-linux_x86_64.whl

Model Download 📥

  • PhenoGPT2 is built upon Qwen3 8B/Llama 3.1 8B model, so please apply for access first (for Llama).
  • ⚡⚡Qwen3-based PhenoGPT2 performs the best.🏆🏆
  • OPTIONAL You can download the HPO Aware Pretrain first if you want to fine-tune on your extraction/normalization data or for LoRA variants.
  • Then, you can just download either PhenoGPT2-Short or PhenoGPT2-EHR (full parameters) for the inference.
  • If you plan to extract phenotypes from images, also download PhenoGPT2-Vision.
  • ATTENTION: PhenoGPT2 is in testing. To access the model weights, please contact us.
  • LLaVA-Med delivers the best performance, but its installation requires manual modifications to the original code, which can be complex. Please contact us if you wish to use the LLaVA-Med version. Otherwise, the fine-tuned Llama 3.2 11B Vision-Instruct offers seamless integration.
Model Descriptions Module Base Model 🤗 Huggingface Hub
HPO Aware Pretrain Text Llama 3.1 8B Not release yet
HPO Aware Pretrain Text Qwen3 8B Not release yet
PhenoGPT2-Short Text Llama 3.1 8B Not release yet
PhenoGPT2-Short Text Qwen3 8B Not release yet
PhenoGPT2-EHR (main) Text Llama 3.1 8B Not release yet
PhenoGPT2-EHR (main) Text Qwen3 8B Not release yet
PhenoGPT2-Vision Vision LLaVA-Med/Llama Not release yet
PhenoGPT2-Vision (default) Vision Qwen3-VL-30B-A3B-Instruct Not release yet
PhenoGPT2-Vision Vision Llama 3.2 11B Vision-Instruct Not release yet
  • If you plan to fine-tune or pretrain the models from scratch, make sure to download the original base model weights from Meta and LLava-Med repos.
  • Save all models in the ./models

Data Input Guide 🚀

  • Input files (for inference) should be a dictionary (key: patient id, value: patient meta data) or a list of dictionary. It should either in JSON or PICKLE extension. Each patient dictionary should have the following format:
{
  "pid1": {
    "clinical_note": "A 1-year-old Korean child presents with persistent fever and shortness of breath. He was found with brachycephaly at 5 months old",
    "image": NaN,
    "pid": "pid1"
  },
  "pid2": {
    "clinical_note": "Subject reports chest pain radiating to the left arm. Elevated troponin levels...",
    "image": "image_pid2.png",
    "pid": "pid2"
  }
}
  • Please see the ./data/example for reference

JSON-formatted answer 📦

  • Ideally, the output files include the raw results in phenogpt2_repX.json:
{
  "pid1": {'text': {
    "demographics": {
        'age': '1-year-old',
        'sex': 'male',
        'ethnicity': 'Korean'
    },
    "phenotypes": {
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "shortness of breath": {
            'HPO_ID':'HP:0002094', 'onset':'unknown'
        }
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        },
    },
    "filtered_phenotypes":{
        "persistent fever": {
            'HPO_ID':'HP:0033399', 'onset':'unknown'
        },
        "brachycephaly": {
            'HPO_ID':'HP:0000248', 'onset':'5 months old'
        }
    },
    "negation_analysis": {
      "demographics" : {
        'age': {'evidence': 'supporting texts', 'correct': True/False},
        'sex': {'evidence': 'supporting texts', 'correct': True/False},
        'ethnicity': {'evidence': 'supporting texts', 'correct': True/False},
        },
      "phenotypes": {
        "persistent fever": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'patient'}
        "shortness of breath": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'family'}
        "brachycephaly": {'evidence': 'supporting texts', 'correct': True/False, 'type': 'patient'}
      }
    }
    "pid": "pid1"
  },
  'image':{}
  },
  ...
}

WARNING

However, due to the nature of LLMs, sometimes the generated format does not fit with JSON format. You will receive the "error_response" in the answer instead of (demographics and phenotypes). This means it is a high chance that the JSON format is not properly set due to some repetitve outputs or unexpected string. Hence, it is suggestive that you check them manually or rerun with some modified notes (you can try to denoise the note first).

Inference 🤖

If you want to simply implement PhenoGPT2 on your local machine for inference, the fine-tuned models are saved in the models directory. Make sure to compile your input data as above before running the inference.

Please note that the first run may take some time as it needs to load all the models. Subsequent runs will be significantly faster.

Please use the following command (along with your scheduler system (i.e SLURM)):

bash run_inference.sh -i ./data/example/text_examples.json \
         -o ./example_testing \
         -model_dir ./models/phenogpt2/ \
         -negation_model YOUR_QWEN_MODEL \
         -attn_implementation flash_attention_2 \
         -batch_size 7 \
         -chunk_batch_size 7 \
         -index 0 \
         -negation \
         -text_only \
         -wc 0"

Required Arguments ⚙️

Argument Description
-i, --input Required. Path to your input data. Can be a .json, .pkl, or a folder containing .txt or image files.
-o, --output Required. Path to your output data. This is where results will be saved. The directory will be created if it does not exist.

Optional Arguments

Argument Description
-model_dir, --model_dir Path to the base model directory (e.g. a pretrained LLaVA or LLaMA3 model). If not provided, defaults will be used.
-lora, --lora Enable this flag if your model is LoRA-adapted.
-index, --index Identifier string for saving outputs. Useful for tracking multiple runs.
-batch_size, --batch_size Default = 7. This specifies how many samples are processed simultaneously in a batch. Decrease if you do not have enough GPU memory; otherwise, increase to run faster and more efficient.
-chunk_batch_size, --chunk_batch_size Default = 7. This specifies how many chunks per batch are processed simultaneously. This is only needed when chunking the notes (when wc != 0). Decrease if you do not have enough GPU memory; otherwise, increase to run faster and more efficient. For A100-40GB, size of 10 still works but almost hits the max memory.
-negation, --negation By default, negation filtering is disabled. Use this flag to enable it.
-negation_model, --negation_model By default, negation model is Qwen/Qwen3-4B-Instruct-2507. You can try other Qwen models if needed. We found Qwen is very useful in negation detection (with detailed prompt)
--attn_implementation 'eager' is used by default. Option: ['flash_attention_2', 'eager', 'spda']. Recommended using Flash Attention 2 as it helps faster inference and lower memory usage. Note: FlashAttention2 may not be supported on arm64/aarch64 platforms.
--text_only Use only the text module of the model, ignoring visual inputs.
--vision_only Use only the vision module, ignoring text inputs.
-vision, --vision Choose the vision model. Options: llava-med or llama-vision (default). It is used along with the text module; otherwise simply use --vision_only instead.
-wc, --wc Word count per chunk. Use this to split long text into smaller chunks (default is 0, meaning no splitting). We recommend using either full length (no split) or 300/384 words per chunk (improving recall) depending on your tasks. You should decrease the chunk_batch_size if encounter OOM.

Pretraining & Fine-tuning 💻

You can reproduce PhenoGPT2 model with your own datasets or other foundation models.

Text Module

  1. You need to pretrain your model on synthetic data compiled from HPO Database to obtain HPO Aware Pretrained Model.
  2. Then, fine-tune HPO Aware Pretrained Model on synthetic train data and validation data compiled from MIMIC-IV and PhenoPackets.
  3. If you want to access the data, please send a message to the developers.

Vision Module

  1. If you want to fine-tune LLaVA-Med model, we recommend following the instructions in LLaVA GitHub, but change the weights to LLaVA-Med
  2. Otherwise, you can use our phenogpt2_vision_training.py to fine-tune LLaMA Vision (or other similar architecture models).

Developers 🧠

Quan Minh Nguyen - Bioengineering PhD student at the University of Pennsylvania (qmn103@seas.upenn.edu)

Dr. Kai Wang - Professor of Pathology and Laboratory Medicine at the University of Pennsylvania and Children's Hospital of Philadelphia (wangk@chop.edu)

Citations 📜

The publication is preparing! We appreciate your reading! In the meantime, you can cite our Github if used.

@misc{nguyen2026phenogpt2, author = {Quan Minh Nguyen and Kai Wang}, title = {PhenoGPT2}, year = {2026}, howpublished = {\url{https://github.com/WGLab/PhenoGPT2}}, note = {Accessed: 2026-02-11} }

About

PhenoGPT2 is an advanced phenotype recognition model, leveraging the robust capabilities of large language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published