Skip to content

PNNL-Predictive-Phenomics/biological_relationship_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Biological Relationship Extraction

This repository contains all code and results from the "Protein-Protein Interaction Networks Derived from Classical and Machine Learning-Based Natural Language Processing Tools" publication.

Please cite Degnan et al. 2024 when using code for this repository.

Link to publication: https://pubs.acs.org/doi/full/10.1021/acs.jproteome.4c00535

Repository Structure

Folder Subfolder Description
algorithms --- Scripts to run all algorithms
algorithms BERT_training Code to train BERT models, adapated from Lee et al. 2022
benchmarks --- Contains all output file for the 3 main studies in this publication
benchmarks benchmark1 Results from the study with the GPGP and BioRed datasets
benchmarks benchmark1/.../raw_output/ Raw results from the tools without any processing
benchmarks benchmark1/.../processed_output/ Cleaned "raw output" with truth annotations, following processing by clean_relationships
benchmarks benchmark2 Results from the C. elegans interactome from UniProt study
benchmarks benchmark2/full_vs_title_abstract Mini-study to determine algorithm performance of "full text" versus titles & abstracts only
benchmarks benchmark2/pdf_vs_clean Mini-study to determine algorithm performance of two "full text" methods - pdfs or "clean text"
benchmarks benchmark2/complete_results Results of using each tool to reconstruct the UniProt C. elegans interactome network
benchmarks benchmark2/.../extracted_relationships Raw output of each tool
benchmarks benchmark2/.../binary_relationships Cleaned raw output, "extracted relationships", with unique protein-protein interactions, using clea_relationships
benchmarks benchmark2/.../networks PNG of each network from each study
benchmarks benchmark3 Results from the E. coli PubMed query. Folder structure follows the extracted relationships, binary relationships, and networks folders from benchmark 2
data --- Contains all input files for training and running the NLP tools
data benchmark1 Holds the training data for the BERT datasets, as well as the GPGP and BioRED testing datasets
data benchmark1/training Training datasets from Su & Vijay 2022
data benchmark1/testing Testing datasets, including the in-house GPGP dataset, and BioRed
data benchmark2 Contains the C. elegans interactome and synonyms from UniProt. Also contains csvs of PubMed IDs and whether they were "clean text", PDF, or title and abstract
data benchmark3 Contains the E. coli synonyms from UniProt. Also contains csvs of PubMed IDs and whether they were "clean text", PDF, or title and abstract
plots --- Holds scripts for building network plots
processing --- Scripts for various tasks
processing calculate_metrics Calculate true positive rates, etc. For network metrics, using build_networks.R in the plots folder
processing clean_relationships Script for converting tool outputs to unique protein-protein interactions
processing format_BERT Script for formatting inputs for BERT models
processing format_LLMs Script for formatting inputs for BioGPT, SOLAR, and Gemini
processing pull_papers Script for extracting papers as either "clean text", pdf, or titles and abstracts
processing synonym_table Script for building a synonym table for mapping protein IDs to their common names from UniProt

Algorithm Scripts & Names

Script Name Algorithm Name in Publication
co_occurrence.R Sentence CoOccurrence, Relational Term
fixed_term.R Fixed Term
pubmedmineR_and_cosine.R pubmed.mineR & cosine
TRIPS.py TRIPS
REACH.py REACH
BERT.py PubMedBERT & BioBERT
BioGPT.py BioGPT
SOLAR.py SOLAR
Gemini.py Gemini

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •