Biological Relationship Extraction

This repository contains all code and results from the "Protein-Protein Interaction Networks Derived from Classical and Machine Learning-Based Natural Language Processing Tools" publication.

Please cite Degnan et al. 2024 when using code for this repository.

Link to publication: https://pubs.acs.org/doi/full/10.1021/acs.jproteome.4c00535

Repository Structure

Folder	Subfolder	Description
algorithms	---	Scripts to run all algorithms
algorithms	BERT_training	Code to train BERT models, adapated from Lee et al. 2022
benchmarks	---	Contains all output file for the 3 main studies in this publication
benchmarks	benchmark1	Results from the study with the GPGP and BioRed datasets
benchmarks	benchmark1/.../raw_output/	Raw results from the tools without any processing
benchmarks	benchmark1/.../processed_output/	Cleaned "raw output" with truth annotations, following processing by clean_relationships
benchmarks	benchmark2	Results from the C. elegans interactome from UniProt study
benchmarks	benchmark2/full_vs_title_abstract	Mini-study to determine algorithm performance of "full text" versus titles & abstracts only
benchmarks	benchmark2/pdf_vs_clean	Mini-study to determine algorithm performance of two "full text" methods - pdfs or "clean text"
benchmarks	benchmark2/complete_results	Results of using each tool to reconstruct the UniProt C. elegans interactome network
benchmarks	benchmark2/.../extracted_relationships	Raw output of each tool
benchmarks	benchmark2/.../binary_relationships	Cleaned raw output, "extracted relationships", with unique protein-protein interactions, using clea_relationships
benchmarks	benchmark2/.../networks	PNG of each network from each study
benchmarks	benchmark3	Results from the E. coli PubMed query. Folder structure follows the extracted relationships, binary relationships, and networks folders from benchmark 2
data	---	Contains all input files for training and running the NLP tools
data	benchmark1	Holds the training data for the BERT datasets, as well as the GPGP and BioRED testing datasets
data	benchmark1/training	Training datasets from Su & Vijay 2022
data	benchmark1/testing	Testing datasets, including the in-house GPGP dataset, and BioRed
data	benchmark2	Contains the C. elegans interactome and synonyms from UniProt. Also contains csvs of PubMed IDs and whether they were "clean text", PDF, or title and abstract
data	benchmark3	Contains the E. coli synonyms from UniProt. Also contains csvs of PubMed IDs and whether they were "clean text", PDF, or title and abstract
plots	---	Holds scripts for building network plots
processing	---	Scripts for various tasks
processing	calculate_metrics	Calculate true positive rates, etc. For network metrics, using build_networks.R in the plots folder
processing	clean_relationships	Script for converting tool outputs to unique protein-protein interactions
processing	format_BERT	Script for formatting inputs for BERT models
processing	format_LLMs	Script for formatting inputs for BioGPT, SOLAR, and Gemini
processing	pull_papers	Script for extracting papers as either "clean text", pdf, or titles and abstracts
processing	synonym_table	Script for building a synonym table for mapping protein IDs to their common names from UniProt

Algorithm Scripts & Names

Script Name	Algorithm Name in Publication
co_occurrence.R	Sentence CoOccurrence, Relational Term
fixed_term.R	Fixed Term
pubmedmineR_and_cosine.R	pubmed.mineR & cosine
TRIPS.py	TRIPS
REACH.py	REACH
BERT.py	PubMedBERT & BioBERT
BioGPT.py	BioGPT
SOLAR.py	SOLAR
Gemini.py	Gemini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Biological Relationship Extraction

Repository Structure

Algorithm Scripts & Names

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
algorithms		algorithms
benchmarks		benchmarks
data		data
plots		plots
processing		processing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
disclaimer.txt		disclaimer.txt

License

PNNL-Predictive-Phenomics/biological_relationship_extraction

Folders and files

Latest commit

History

Repository files navigation

Biological Relationship Extraction

Repository Structure

Algorithm Scripts & Names

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages