Expected behaviour
Analysis of a Universe with on-the-fly transformation scales good (reasonable).
Actual behaviour
The scaling performance is really bad even with two cores.
Code
import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd
u = mda.Universe(files['PDB'], files['LONG_TRAJ']) # 9000 frames
fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)
n_jobs = [1, 2, 4, 8, 16, 32, 64]
rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
n_jobs=nj) # timeit
Reason
In some Transformations includes numpy.dot which itself is multi-threaded. So the cores are oversubscribed.
Possible solution
Benchmarking result


- Benchmarking system:
- AMD EPYC 7551 32-Core Processor
- RTX 2080 Ti
- cephfs file system
Currently version of MDAnalysis:
(run python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev
(run python -c "import pmda; print(pmda.__version__)")
(run python -c "import dask; print(dask.__version__)")
Expected behaviour
Analysis of a
Universewith on-the-fly transformation scales good (reasonable).Actual behaviour
The scaling performance is really bad even with two cores.
Code
Reason
In some
Transformationsincludesnumpy.dotwhich itself is multi-threaded. So the cores are oversubscribed.Possible solution
numpy(https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.cupy(https://cupy.dev/) to leverage the GPU power. (only replacing thenumpy.dotoperation of theTransformation)Benchmarking result
Currently version of MDAnalysis:
(run
python -c "import MDAnalysis as mda; print(mda.__version__)") 2.0.0 dev(run
python -c "import pmda; print(pmda.__version__)")(run
python -c "import dask; print(dask.__version__)")