Skip to content

erfan-nourbakhsh/GCN-PytorchGeometric

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧬 Graph Neural Networks for Molecular Property Prediction

PyTorch PyTorch Geometric RDKit License

A comprehensive hands-on tutorial implementing Graph Convolutional Networks (GCNs) for molecular property prediction using PyTorch Geometric. This project demonstrates how to predict water solubility of chemical compounds directly from their molecular structure using state-of-the-art graph neural networks.

πŸš€ Overview

This project provides a complete implementation of Graph Neural Networks for molecular property prediction, specifically focusing on predicting water solubility using the ESOL (Estimated SOLubility) dataset. The notebook covers everything from data preprocessing to model training and evaluation.

🎯 Key Features

  • Complete GNN Pipeline: From SMILES strings to molecular graphs to predictions
  • ESOL Dataset: 1,128 compounds with water solubility data from MoleculeNet
  • Graph Convolutional Networks: Multi-layer GCN implementation using PyTorch Geometric
  • Molecular Visualization: Interactive molecule rendering with RDKit
  • Performance Analysis: Training loss visualization and model evaluation

πŸ§ͺ What You'll Learn

  • How to convert SMILES representations to molecular graphs
  • Understanding molecular fingerprints and graph-based representations
  • Implementing Graph Convolutional Networks for regression tasks
  • Working with chemical datasets and molecular descriptors
  • Visualizing molecules and training progress

πŸ“Š Dataset

We use the ESOL (Estimated SOLubility) dataset from MoleculeNet:

  • Size: 1,128 chemical compounds
  • Task: Regression (water solubility prediction)
  • Features: 9 node features per atom
  • Target: Solubility values in mol/L
  • Format: SMILES strings converted to molecular graphs

"ESOL is a small dataset consisting of water solubility data for 1128 compounds. The dataset has been used to train models that estimate solubility directly from chemical structures (as encoded in SMILES strings)."

πŸ—οΈ Model Architecture

Our Graph Convolutional Network consists of:

GCN(
  (initial_conv): GCNConv(9, 64)
  (conv1): GCNConv(64, 64)
  (conv2): GCNConv(64, 64)
  (conv3): GCNConv(64, 64)
  (out): Linear(192, 1)
)
  • Input: 9-dimensional node features (atomic properties)
  • Hidden Layers: 4 GCN layers with 64 hidden units each
  • Global Pooling: Combines mean, max, and add pooling
  • Output: Single regression value (solubility prediction)

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.7+
  • CUDA-compatible GPU (recommended)
  • Google Colab or Jupyter Notebook environment

Required Libraries

# Core ML libraries
pip install torch==1.6.0
pip install torchvision==0.7.0

# PyTorch Geometric and dependencies
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv
pip install torch-geometric

# Chemistry libraries
pip install rdkit-pypi

# Visualization and utilities
pip install matplotlib seaborn pandas numpy

Quick Start

  1. Clone the repository:

    git clone https://github.com/erfan-nourbakhsh/GCN-PytorchGeometric.git
    cd GCN-PytorchGeometric
  2. Open the notebook:

    jupyter notebook "GCN-Pytorch Geometric.ipynb"
  3. Run all cells to reproduce the complete pipeline

πŸ“ Project Structure

GCN-PytorchGeometric/
β”œβ”€β”€ GCN-Pytorch Geometric.ipynb    # Main tutorial notebook
β”œβ”€β”€ README.md                      # This file
└── data/                         # Generated during execution
    └── ESOL/                     # ESOL dataset files

πŸ”¬ Technical Deep Dive

Why Graph Neural Networks for Molecules?

Traditional approaches using SMILES strings directly as input have limitations:

  • Grammar Dependency: Models focus on SMILES syntax rather than molecular structure
  • Non-Unique Representation: Same molecule can have multiple valid SMILES strings
  • Limited Chemical Understanding: String-based models miss spatial relationships

Graph Neural Networks solve these issues by:

  • Representing molecules as graphs (atoms = nodes, bonds = edges)
  • Being invariant to atom ordering and notation variations
  • Capturing molecular structure and chemical relationships directly

Graph Representation

Each molecule is converted to a graph where:

  • Nodes: Atoms with features (atomic number, hybridization, etc.)
  • Edges: Chemical bonds with attributes (bond type, stereochemistry)
  • Graph-level target: Molecular property (solubility)

πŸ“ˆ Results & Performance

The model achieves competitive performance on the ESOL dataset:

  • Training: Steady decrease in RMSE loss
  • Architecture: 4-layer GCN with global pooling
  • Features: 9-dimensional atomic descriptors
  • Prediction: Direct molecular property regression

🎨 Visualizations

The notebook includes:

  • Molecular Structures: Interactive 2D molecule rendering
  • Training Progress: Loss curves and convergence analysis
  • Dataset Exploration: Feature distributions and molecular diversity
  • Graph Properties: Node and edge statistics

🀝 Contributing

Contributions are welcome! Here's how you can help:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Ideas for Contributions

  • Add more molecular datasets (BACE, Tox21, etc.)
  • Implement different GNN architectures (GraphSAGE, GAT, etc.)
  • Add molecular descriptor calculations
  • Improve visualization capabilities
  • Add model interpretability features

πŸ“š References & Resources

Key Papers

Libraries & Tools

Tutorials & Guides

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • PyTorch Geometric Team for the excellent graph neural network library
  • RDKit Community for comprehensive cheminformatics tools
  • MoleculeNet for providing standardized molecular datasets
  • Original Research that made this implementation possible

πŸ’¬ Support

If you find this project helpful:

  • ⭐ Star the repository
  • πŸ› Report issues or bugs
  • πŸ’‘ Suggest new features
  • πŸ“– Share with the community

Ready to dive into molecular machine learning? Open the notebook and start exploring the fascinating world of Graph Neural Networks! 🧬✨

About

A hands-on tutorial implementing Graph Convolutional Networks (GCNs) for molecular property prediction using PyTorch Geometric. Predicts water solubility from chemical structures with complete pipeline from SMILES to predictions.

Topics

Resources

License

Stars

Watchers

Forks

Contributors