This shows up as the top kernel in the "flagship" config (with the exception of 0M instead of 1M microphysics):
Above is from results/nsys/baseline.nsys-rep, commit a0bb502 of this repo.
I've tried a few modifications, which are summarized here. So far, none have sped up the simulation--only improved registers per thread and occupancy metrics.