Skip to content

Fix #15: Resolve memory usage, error handling, and checkpointing issues in JetNet GNN pipeline#18

Open
Apprentice2907 wants to merge 2 commits intoML4SCI:mainfrom
Apprentice2907:Fiexed-JetNet-GNN-Diffusion-Pipeline
Open

Fix #15: Resolve memory usage, error handling, and checkpointing issues in JetNet GNN pipeline#18
Apprentice2907 wants to merge 2 commits intoML4SCI:mainfrom
Apprentice2907:Fiexed-JetNet-GNN-Diffusion-Pipeline

Conversation

@Apprentice2907
Copy link

Overview

This PR addresses all issues mentioned in #15 by improving memory efficiency, adding checkpoint functionality, enhancing error tracking, and making hyperparameters configurable.

Issues Fixed

Fixes #15

Changes Made

1. Memory Optimization

  • Reduced max_jets from 50,000 to 3,200 (default)
  • Created configurable Config class for easy adjustment
  • Memory usage reduced by ~15x

2. Enhanced Error Handling

  • Added detailed error type tracking in collect_graph_and_targets()
  • Errors now categorized by type (e.g., no_valid_particles, graph_construction_failed)
  • First 5 errors print full details for debugging
  • Summary breakdown shows count per error type

3. Checkpoint System

  • Autoencoder: Saves checkpoints every 50 epochs
  • Diffusion Model: Saves checkpoints every 50 epochs
  • Automatic resume from last checkpoint if training interrupted
  • Embeddings saved after extraction for reuse
  • Final models and outputs saved automatically

Checkpoint files:

  • checkpoints/autoencoder_epoch_*.pt
  • checkpoints/diffusion_epoch_*.pt
  • checkpoints/embeddings.npy
  • checkpoints/*_final.pth

4. Configurable Hyperparameters

All hyperparameters moved to Config class:

  • GNN architecture (dims, K, layers)
  • Decoder architecture (hidden dims)
  • Training parameters (learning rates, epochs, batch size)
  • Checkpointing settings

Example:

config.max_jets = 5000
config.batch_size = 16
config.autoencoder_epochs = 200

Testing

  • Code runs without errors
  • Memory usage significantly reduced
  • Checkpoints save and load correctly
  • Error logging provides detailed breakdown
  • Configuration changes take effect

Files Changed

  • code.py - Main pipeline with all fixes applied

Additional Notes

  • No breaking changes to existing functionality
  • Backward compatible with existing workflow
  • All original features preserved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JetNet GNN + Diffusion Pipeline: Improvements & Bugs

1 participant