A comprehensive hands-on tutorial implementing Graph Convolutional Networks (GCNs) for molecular property prediction using PyTorch Geometric. This project demonstrates how to predict water solubility of chemical compounds directly from their molecular structure using state-of-the-art graph neural networks.
This project provides a complete implementation of Graph Neural Networks for molecular property prediction, specifically focusing on predicting water solubility using the ESOL (Estimated SOLubility) dataset. The notebook covers everything from data preprocessing to model training and evaluation.
- Complete GNN Pipeline: From SMILES strings to molecular graphs to predictions
- ESOL Dataset: 1,128 compounds with water solubility data from MoleculeNet
- Graph Convolutional Networks: Multi-layer GCN implementation using PyTorch Geometric
- Molecular Visualization: Interactive molecule rendering with RDKit
- Performance Analysis: Training loss visualization and model evaluation
- How to convert SMILES representations to molecular graphs
- Understanding molecular fingerprints and graph-based representations
- Implementing Graph Convolutional Networks for regression tasks
- Working with chemical datasets and molecular descriptors
- Visualizing molecules and training progress
We use the ESOL (Estimated SOLubility) dataset from MoleculeNet:
- Size: 1,128 chemical compounds
- Task: Regression (water solubility prediction)
- Features: 9 node features per atom
- Target: Solubility values in mol/L
- Format: SMILES strings converted to molecular graphs
"ESOL is a small dataset consisting of water solubility data for 1128 compounds. The dataset has been used to train models that estimate solubility directly from chemical structures (as encoded in SMILES strings)."
Our Graph Convolutional Network consists of:
GCN(
(initial_conv): GCNConv(9, 64)
(conv1): GCNConv(64, 64)
(conv2): GCNConv(64, 64)
(conv3): GCNConv(64, 64)
(out): Linear(192, 1)
)- Input: 9-dimensional node features (atomic properties)
- Hidden Layers: 4 GCN layers with 64 hidden units each
- Global Pooling: Combines mean, max, and add pooling
- Output: Single regression value (solubility prediction)
- Python 3.7+
- CUDA-compatible GPU (recommended)
- Google Colab or Jupyter Notebook environment
# Core ML libraries
pip install torch==1.6.0
pip install torchvision==0.7.0
# PyTorch Geometric and dependencies
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv
pip install torch-geometric
# Chemistry libraries
pip install rdkit-pypi
# Visualization and utilities
pip install matplotlib seaborn pandas numpy-
Clone the repository:
git clone https://github.com/erfan-nourbakhsh/GCN-PytorchGeometric.git cd GCN-PytorchGeometric -
Open the notebook:
jupyter notebook "GCN-Pytorch Geometric.ipynb" -
Run all cells to reproduce the complete pipeline
GCN-PytorchGeometric/
βββ GCN-Pytorch Geometric.ipynb # Main tutorial notebook
βββ README.md # This file
βββ data/ # Generated during execution
βββ ESOL/ # ESOL dataset files
Traditional approaches using SMILES strings directly as input have limitations:
- Grammar Dependency: Models focus on SMILES syntax rather than molecular structure
- Non-Unique Representation: Same molecule can have multiple valid SMILES strings
- Limited Chemical Understanding: String-based models miss spatial relationships
Graph Neural Networks solve these issues by:
- Representing molecules as graphs (atoms = nodes, bonds = edges)
- Being invariant to atom ordering and notation variations
- Capturing molecular structure and chemical relationships directly
Each molecule is converted to a graph where:
- Nodes: Atoms with features (atomic number, hybridization, etc.)
- Edges: Chemical bonds with attributes (bond type, stereochemistry)
- Graph-level target: Molecular property (solubility)
The model achieves competitive performance on the ESOL dataset:
- Training: Steady decrease in RMSE loss
- Architecture: 4-layer GCN with global pooling
- Features: 9-dimensional atomic descriptors
- Prediction: Direct molecular property regression
The notebook includes:
- Molecular Structures: Interactive 2D molecule rendering
- Training Progress: Loss curves and convergence analysis
- Dataset Exploration: Feature distributions and molecular diversity
- Graph Properties: Node and edge statistics
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Add more molecular datasets (BACE, Tox21, etc.)
- Implement different GNN architectures (GraphSAGE, GAT, etc.)
- Add molecular descriptor calculations
- Improve visualization capabilities
- Add model interpretability features
- Semi-Supervised Classification with Graph Convolutional Networks
- MoleculeNet: A Benchmark for Molecular Machine Learning
- Geometric Deep Learning on Graphs and Manifolds
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch Geometric Team for the excellent graph neural network library
- RDKit Community for comprehensive cheminformatics tools
- MoleculeNet for providing standardized molecular datasets
- Original Research that made this implementation possible
If you find this project helpful:
- β Star the repository
- π Report issues or bugs
- π‘ Suggest new features
- π Share with the community
Ready to dive into molecular machine learning? Open the notebook and start exploring the fascinating world of Graph Neural Networks! π§¬β¨