GridMPI extends a cluster MPI implementation YAMPII for communication in the Grid environment. GridMPI includes YAMPII, and thus can be used in clusters as well as in the Grid. The YAMPII protocol is used in intra-cluster communication, and the protocols for the Grid is used in inter-cluster communication. Currently, IMPI, Interoperable MPI, is supported for inter-cluster communication. GridMPI can use TCP/IP, platform supplied Vendor MPI, and the PM protocol in some cluster systems as an intra-cluster transport. GridMPI only uses TCP/IP as an inter-cluster transport.
GridMPI follows the IMPI (Interoperable MPI) standard for inter-cluster communication. In the following explanation, some terminology from the IMPI standard is used.
MPI application with IMPI consists of multiple clients and one IMPI server.
A client is a one MPI job, which consists of some number of MPI processes normally started by mpirun. A client typically corresponds to a cluster. Each client is sequentially numbered from 0 to one-less the number of the clients. The IMPI standard limits the maximum number of clients to 32.
An IMPI server is a process to make contacts from clients. A server listens to a TCP/IP port and waits for connection from the clients. A server acts as an information exchange of the clients, who need to know the IP address/port pairs of the other clients. The clients make connections each other after taking information from the server.
An IMPI server does nothing after information exchange, but waits until all clients joining at MPI_Finalize. One server is needed for each run of an MPI application.
An IMPI server is invoked by specifying the number of the clients (M) to use.
$ impi-server -server M
After invoking an IMPI server, it prints an IP address/port pair to stdout to which it is listening. The clients must specify this address/port pair at startup.
An IMPI server finishes at exit of an MPI application normally. Thus, an IMPI server needs to be invoked each time before an application is started.
A client is started with a client number and the IMPI address/port pair.
$ mpirun -client K addr:port -np N ./a.out
K is from 0 to one-less the number of clients. Lower ranks of processes are assigned to ones which are invoked with lower client number K. addr:port is the address/port pair which IMPI server has printed out.
The total number of processes (NPROCS) in an MPI application is the sum of the N of all clients. The number of processes in a client is specified by mpirun with the argument -np N and the configuration file as normal invocations. The name of the configuration file of GridMPI/YAMPII is mpi_conf by default.
The figure below depicts the process structure of GridMPI when two processes are started in each of two clusters.
$ export IMPI_AUTH_NONE=0 $ impi-server -server 2 & $ mpirun -client 0 addr:port -np 2 ./a.out & $ mpirun -client 1 addr:port -np 2 ./a.out
mpirun -client K ... starts a single client. A number of clients are each started by mpirun.
IMPI Protocol +---------+===================+---------+ | | | | +-----|---------|-----+ +-----|---------|-----+ +--------+ | +-------+ +-------+ | | +-------+ +-------+ | | IMPI | | | rank0 | | rank1 | | | | rank2 | | rank3 | | | Server | | +-------+ +-------+ | | +-------+ +-------+ | +--------+ | | | | | | | | | +=========+ | | +=========+ | | YAMPII Protocol | | YAMPII Protocol | +---------------------+ +---------------------+ mpirun -client 0 mpirun -client 1 |
Follow the steps below:
(1) Set the environment variables.
$MPIROOT needs to be set to the installation directory. Commands, include files and libraries of GridMPI are installed under the directory $MPIROOT/bin, $MPIROOT/include, and $MPIROOT/lib.
Add $MPIROOT setting in .profile, .cshrc, etc, and add a path $MPIROOT/bin in the PATH. Assume /opt/gridmpi as MPIROOT in the examples below.
(For sh/bash) $ MPIROOT=/opt/gridmpi; export MPIROOT $ PATH="$MPIROOT/bin:$PATH"; export PATH (For csh/tcsh) % setenv MPIROOT /opt/gridmpi % set path=($MPIROOT/bin $path)
When the cluster environment does not support rsh (remote-shell), it fails because MPI processes are started using rsh in a cluster by default. Set the environment variable _YAMPI_RSH to use ssh. For using ssh, no passphrase setting is needed, or ssh-agent shall be used. See [FAQ].
(For sh/bash) $ _YAMPI_RSH="ssh -x"; export _YAMPI_RSH (For csh/tcsh) % setenv _YAMPI_RSH "ssh -x"
(2) Check the installation.
Check the contents of the directory.
$MPIROOT/bin: mpirun, mpicc, mpif77, mpif90, ... $MPIROOT/include: mpi.h, mpif.h, mpi-1.h, mpi-2.h, mpic++.h $MPIROOT/lib: libmpi.a
Check the command paths.
$ which mpicc $ which mpirun
(3) Compile the application.
$ mpicc mpiprog.c
The default compilers are set ones found at configuration time. They can be changed by the environment variables _YAMPI_CC, _YAMPI_CXX, _YAMPI_F77, and _YAMPI_F90.
(4) Create a configuration file.
mpirun reads a configuration from a file (mpi_conf in the current directory by default). mpi_conf holds a list of host names, one host in each line. It is an error, if the number of hosts is less than the number of processes specified by the -np argument to mpirun.
Contents of mpi_conf:
localhost localhost localhost localhost
mpirun understands some of the extensions of MPICH configuration file, including a command argument for non-SPMD (Single Program Multiple Data) execution.
(5) Start a program in a single cluster.
$ mpirun -np 4 ./a.out
(6) Start a program in multiple clusters.
$ export IMPI_AUTH_NONE=0 ...(*1) $ impi-server -server 2 & ...(*2) $ mpirun -client 0 addr:port -np 2 ./a.out & ...(*3) $ mpirun -client 1 addr:port -np 2 ./a.out ...(*4)
(*1) Setting IMPI_AUTH_NONE specifies not to use any authentication. Both runs of impi-server and mpirun need the same setting.
(*2) Start the IMPI server. Run of the server prints an IP address/port pair to stdout. Pass it to mpirun in the next steps.
(*3, *4) Start MPI processes. Normally, two mpirun invocations are on different clusters.
Follow the steps below:
(1) Set the environment variables.
$MPIROOT needs to be set to the installation directory. Commands, include files and libraries of GridMPI are installed under the directory $MPIROOT/bin, $MPIROOT/include, and $MPIROOT/lib.
Add $MPIROOT setting in .profile, .cshrc, etc, and add a path $MPIROOT/bin in the PATH. Assume /opt/gridmpi as MPIROOT in the examples below.
(For sh/bash) $ MPIROOT=/opt/gridmpi; export MPIROOT $ PATH="$MPIROOT/bin:$PATH"; export PATH (For csh/tcsh) % setenv MPIROOT /opt/gridmpi % set path=($MPIROOT/bin $path)
Step (1) is similar to the step in Using GridMPI on PC Clusters.
(2) Check the installation.
Check the contents of the directory.
$MPIROOT/bin: mpirun, mpicc, mpif77, mpif90, ... $MPIROOT/include: mpi.h, mpif.h, mpi-1.h, mpi-2.h, mpic++.h $MPIROOT/lib: libmpi.a, (or libmpi32.a or libmpi64.a)
Check the command paths.
$ which mpicc $ which mpirun
Check xlc_r, xlC_r, xlf_r, and xlf90_r are in the PATH. Also check the directory /usr/lpp/ppe.poe exists.
(3) Compile the application.
$ mpicc mpiprog.c (32bit default, or configured without --with-binmode) $ mpicc -q32 mpiprog.c (for 32bit) $ mpicc -q64 mpiprog.c (for 64bit)
The default compilers are set ones found at configuration time. They can be changed by the environment variables _YAMPI_CC, _YAMPI_CXX, _YAMPI_F77, and _YAMPI_F90.
(4) Create configuration files.
Contents of host.list1:
node00 node00
Contents of host.list2:
node01 node01
Contents of llfile:
#@job_type=parallel #@resources=ConsumableCpus(2) #@queue
(5) Start a program in a single cluster.
$ mpirun -np 4 ./a.out -llfile llfile
(6) Start a program in multiple clusters.
The following runs two MPI jobs with two processes each.
$ export IMPI_AUTH_NONE=0 ...(*1) $ impi-server -server 2 & ...(*2) $ mpirun -client 0 addr:port -np 2 -c host1.list ./a.out -llfile llfile & ...(*3) $ mpirun -client 1 addr:port -np 2 -c host2.list ./a.out -llfile llfile ...(*4)
(*1) Setting IMPI_AUTH_NONE specifies not to use any authentication. Both runs of impi-server and mpirun need the same setting.
(*2) Start the IMPI server. Run of the server prints an IP address/port pair to stdout. Pass it to mpirun in the next step.
(*3, *4) Start MPI processes. Normally, two mpirun invocations are on different clusters.
NOTE: -llfile llfile is not necessary when LoadLeveler is not used.
mpirun calls the poe command of IBM-MPI internally, and the option -c of mpirun is renamed to -hostfile.
Hitachi f90 compiler is set to aggressive optimization -Os as the site default. Some programs fail due to its aggressive optimization.
GridMPI utilizes IBM-MPI as Vendor MPI, and mpirun calls the poe command of IBM-MPI internally. mpirun passes the arguments after a binary to the poe command intact, which are parsed and consumed by poe at its startup. The following example shows passing a -shared_memory option to poe.
$ mpirun -np N ./a.out -shared_memory yes
Some useful options of POE:
Follow the steps below:
(1) Set the environment variables.
$MPIROOT needs to be set to the installation directory. Commands, include files and libraries of GridMPI are installed under the directory $MPIROOT/bin, $MPIROOT/include, and $MPIROOT/lib.
Add $MPIROOT setting in .profile, .cshrc, etc, and add a path $MPIROOT/bin in the PATH. Assume /opt/gridmpi as MPIROOT in the examples below.
(For sh/bash) $ MPIROOT=/opt/gridmpi; export MPIROOT $ PATH="$MPIROOT/bin:/opt/FSUNaprun/bin:$PATH"; export PATH (For csh/tcsh) % setenv MPIROOT /opt/gridmpi % set path=($MPIROOT/bin /opt/FSUNaprun/bin $path)
(2) Check the installation.
Check the contents of the directory.
$MPIROOT/bin: mpirun, mpicc, mpif77, mpif90, ... $MPIROOT/include: mpi.h, mpif.h, mpi-1.h, mpi-2.h, mpic++.h $MPIROOT/lib: libmpi.so, libmpi_frt.a, libmpi_gmpi.so (or libmpi32.so, libmpi_frt32.a, libmpi_gmpi32.so) (or libmpi64.so, libmpi_frt64.a, libmpi_gmpi64.so)
Check the command paths.
$ which mpicc $ which mpirun
Check c99, FCC, frt, and f90 are in the PATH. Also check /opt/FJSVmpi2/bin/mpiexec exists.
(3) Compile the application.
$ mpicc mpiprog.c (32bit default, or configured without --with-binmode) $ mpicc -q32 mpiprog.c (for 32bit) $ mpicc -q64 -KV9 mpiprog.c (for 64bit)
The default compilers are set ones found at configuration time. They can be changed by the environment variables _YAMPI_CC, _YAMPI_CXX, _YAMPI_F77, and _YAMPI_F90.
(4) (Create configuration files). Configuration files are not needed with Fujitsu MPI -- global setting of the node is used.
(5) Start a program in a single cluster.
$ mpirun -np 4 ./a.out
(6) Start a program in multiple clusters.
The following runs two MPI jobs with two processes each.
$ export IMPI_AUTH_NONE=0 ...(*1) $ impi-server -server 2 & ...(*2) $ mpirun -client 0 addr:port -np 2 ./a.out & ...(*3) $ mpirun -client 1 addr:port -np 2 ./a.out ...(*4)
(*1) Set IMPI_AUTH_NONE to specify not to use any authentication. Both runs of impi-server and mpirun need the same setting.
(*2) Start the IMPI server. Run of the server prints an IP address/port pair to stdout. Pass it to mpirun in the next step.
(*3, *4) Start MPI processes. Normally, two mpirun invocations are on different clusters.
The GridMPI runtime calls mpiexec (/opt/FJSVmpi2/bin/mpiexec) in the Fujitsu MPI environment to start MPI processes. Options to mpirun are translated and passed to the Fujitsu runtime: -np to -n and -c to -nl.
mpirun converts a host-list file passed to the -c option to a node-list acceptable to the -nl option of Fujitsu mpiexec. The contents of a host-list file is matched against to the Fujitsu MPI configuration file, and a hostname is converted to a node number. It is performed by the makenodelist.fjmpi.sh script in the $MPIROOT/bin. Note that the format of a file specified by -c consists of one host per line (no comments allowed), which is different from the format of the configuration file for clusters.
mpirun also accepts the -nl option when configured with Fujitsu MPI, which is passed to mpiexec unmodified. For example, use a line like: -nl 0,0,0,0,0,0,0,0,0,0,...,0. Note that the number of nodes specified by the -nl option needs one more nodes than the value passed to the -np option.
gridmpirun script is a simple frontend to start an impi-server and to start MPI processes via rsh/ssh. gridmpirun starts impi-server in local host, and then calls mpirun using rsh or ssh as specified by the configuration file (impi_conf by default).
The configuration file of gridmpirun can be specified by the -machinefile option.
(1) Create a gridmpirun configuration file.
Contents of impi_conf configuration file:
-np 2 -c host.list1 -np 2 -c host.list2
Contents of llfile:
#@job_type=parallel #@resources=ConsumableCpus(2) #@queue
(2) Start an MPI application.
$ gridmpirun -np 4 -machinefile impi_conf ./a.out -llfile llfile