Bad Performance of Parallelization with On-the-fly Transformation

### Expected behaviour
Analysis of a `Universe` with on-the-fly transformation scales good (reasonable).

### Actual behaviour
The scaling performance is really bad even with two cores.

### Code
```python
import MDAnalysis as mda
from MDAnalysis import transformations as trans
from pmda.rms.rmsd import RMSD as parallel_rmsd

u = mda.Universe(files['PDB'], files['LONG_TRAJ']) #  9000 frames

fit_trans = trans.fit_rot_trans(u.atoms, u.atoms)
u.trajectory.add_transformations(fit_trans)

n_jobs = [1, 2, 4, 8, 16, 32, 64]

rmsd = parallel_rmsd(u.atoms, u.atoms)
rmsd.run(n_blocks=nj,
               n_jobs=nj) #  timeit
```

### Reason
In some `Transformations` includes `numpy.dot` which itself is multi-threaded. So the cores are oversubscribed.   

### Possible solution
- define NUM_THREADS=1 for `numpy` (https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads). which is surprisingly faster even for serial (single-core) performance.
- use `cupy`(https://cupy.dev/) to leverage the GPU power. (only replacing the `numpy.dot` operation of the `Transformation`)


### Benchmarking result
![Linear Scaling](https://user-images.githubusercontent.com/24636656/93620252-02009200-f9da-11ea-9b11-40fbee48e912.png)
![RMSD with Transformation Comparison](https://user-images.githubusercontent.com/24636656/93620256-02992880-f9da-11ea-8fa4-0202fb4f2a89.png)

- Benchmarking system: 
  - AMD EPYC 7551 32-Core Processor
  - RTX 2080 Ti
  - cephfs file system



### Currently version of MDAnalysis:
(run `python -c "import MDAnalysis as mda; print(mda.__version__)"`) 2.0.0 dev
(run `python -c "import pmda; print(pmda.__version__)"`) 
(run `python -c "import dask; print(dask.__version__)"`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad Performance of Parallelization with On-the-fly Transformation #144

Expected behaviour

Actual behaviour

Code

Reason

Possible solution

Benchmarking result

Currently version of MDAnalysis:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bad Performance of Parallelization with On-the-fly Transformation #144

Description

Expected behaviour

Actual behaviour

Code

Reason

Possible solution

Benchmarking result

Currently version of MDAnalysis:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions