phred/README.Glacier at main · mch/phred · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
Submitting jobs to PBS on glacier.westgrid.ca
---------------------------------------------

Glacier is an IBM eServer BladeCenter HS20 with 60 chassis, 14 blades
per chassis, for a total of 840 nodes. Each node consists of dual
3.0GHz processors with 2 to 4GB RAM, running RedHat 9.0.

Machines are identified as 'iceM_N', where M is the chassis number
from 1 to 60 and N is the blade number, from 1 to 14.

Within each chassis, blades are connected by a Foundry FastIron 1500 +
Foundry FastIron 800 connected via 8 GigE trunk. Processors in
different chassis have 4 GigE uplinks.

Since the connection between processors in seperate chassis are slow
compared to the connections between processors in a single chassis,
one might as well abort jobs which don't end up concentrated in a
one chassis.

One can request specfic blades using a line like:
#PBS -l nodes=ice10_1:ppn=2+ice10_2:ppn=2

- Environment variables are not automatically created on MPI hosts, so
  it is necessary to set variables like OMP_NUM_THREADS in ~/.login or
  ~/.bashrc instead.

- mpirun must be told which nodes to use. Use the -machinefile switch
  and pass it the name of the node file, which is made available by
  PBS in the environment variable $PBS_NODEFILE. For example:

  mpirun -np 2 -machinefile $PBS_NODEFILE executable [arguments]

- Resources are specified using #PBS -l xxx entries. In particular,
  the nodes=N:ppn=M entry must be chosen carefully. Generally the best
  performace is obtained when one MPI process per node is used, and
  OpenMP is used on each node to utilize the available processors.

  One would assume that the processors should be requested like this:
  #PBS -l nodes=2:ppn=2

  Two nodes are are requested, and both processors on each node should
  be used. The node file generated by PBS is incorrect though, and if
  the processors are requested in this manner, MPI will start both
  processes on one node and the second will not be used at all.

  It is possible to avoid this problem by re-writing the nodes file to
  the correct format. The following script reads the existing node
  file from standard input and writes a corrected one to standard
  output:

----
#!/usr/bin/env python

import os
import sys

nodes = {}

for line in sys.stdin:
    line = line.strip()
    nodes[line] = nodes.get(line, 0) + 1

for n in nodes:
    sys.stdout.write(n + ":" + str(nodes[n]) + "\n")
----

  Save this somewhere like ~/bin/rejigger_nodes.py.

  Do not simply use a line line #PBS -l nodes=2:ppn:1 when if fact you
  intend to use both processors on the node. If the resource manager
  is not informed that your job is using both processors, then it may
  schedule another job on the apparently unused processor. Both
  processes will have to fight for CPU time, and both will be slower
  as a result.

- Sample PBS.sh file:

----
#!/bin/bash
#PBS -S /bin/bash

#PBS -l nodes=2:ppn=2
#PBS -l mem=2gb
#PBS -l walltime=2:00:00
#PBS -M mhughe@uvic.ca
#PBS -m bea
#PBS -N pg_ref_30
#PBS -W x="QOS:parallel"

cd /global/scrach/username/dir/

cat $PBS_NODEFILE | ~/bin/rejigger_nodes.py > nodes.txt

mpirun -np 2 -machinefile nodes.txt /path/to/program args > output.txt 2>&1
----

It may be necessary to use the -B flag to Phred. The MPICH
implementation use on Glacier has a bug that seems to manifest itself
when using lots of non-blocking communication (as Phred does).


** IMPORTANT **
Clean up output files before re-running jobs. None of the files Phred
writes to should exist before it is run. It may also be a good idea to
get rid of the output of rejigger_nodes.py, any PI* files, and any
*.{e,o}* files that exist before restarting things.