Skip to content

Comments

Set Default Behavior to Stop Training Upon Convergence#16

Merged
michaelmckinsey1 merged 11 commits intoLBANN:mainfrom
michaelmckinsey1:procruns
Feb 19, 2026
Merged

Set Default Behavior to Stop Training Upon Convergence#16
michaelmckinsey1 merged 11 commits intoLBANN:mainfrom
michaelmckinsey1:procruns

Conversation

@michaelmckinsey1
Copy link
Collaborator

@michaelmckinsey1 michaelmckinsey1 commented Feb 6, 2026

  • Set the default behavior of the benchmark to stop training after reaching a target dice score (default of 0.95) instead of training for a certain number of epochs.
  • Create testing config that only runs for 10 epochs

@michaelmckinsey1 michaelmckinsey1 changed the title Enable Checkpoint Interval and Set Default Behavior to Stop Training Upon Convergence Set Default Behavior to Stop Training Upon Convergence Feb 6, 2026
f"val_score of {val_score} is > threshold of 0.95. Benchmark run complete. Wrapping up..."
)
return 0
dice_score_train = dice_sum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually use val_score here in place of dice_sum. dice_sum is the per-batch dice score, whereas val_score is averaged over all batches in an epoch. So just replace this line with dice_score_train = val_score and then this PR is good to go.

@michaelmckinsey1 michaelmckinsey1 merged commit a1f1bef into LBANN:main Feb 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants