Contents:
See the description of the Relay: Overview of IMPI Relay and Using Relay.
Obsolete rationale: Requiring global addresses is because the GridMPI implementors judged that relaying/forwarding of messages has impact on performance with the current technology. Also, it is too restricted a relaying/forwarding topology to provide a variety of collective algorithms for experiments.
Inside a cluster, having multiple interfaces is not a problem because the cluster MPI (YAMPI) uses ones specified in a configuration file, where an interface is selected by a given hostname.
IMPI_PORT_RANGE=start:end IMPI_SERVER_PORT_RANGE=start:end
start and end are inclusive.
CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for sh/ksh/bash) env CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for csh/tcsh)
See Installation Procedure, which includes some examples as "Configuration Templates".
The below table lists the related environment variables.
Fujitsu Compiler against GCC-Compiled GridMPI # For Solaris/SPARC _YAMPI_CC='c99' _YAMPI_CXX='FCC' _YAMPI_F77='frt' _YAMPI_F90='f90' _YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT' _YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT' _YAMPI_EXTLIBS=' -L/usr/local/gcc-3.4.3/lib/ -lgcc_s \ -lnsl -lsocket -lpthread -ldl' _YAMPI_CC64FLAG='-KV9' _YAMPI_LD64FLAG='-64'# For Linux/IA32 _YAMPI_CC='fcc' _YAMPI_CXX='FCC' _YAMPI_F77='frt' _YAMPI_F90='f90' _YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT' _YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT' _YAMPI_EXTLIBS=' /usr/FCC/lib/libgcccompat.a \ -lnsl -lpthread -ldl'Note that -D_REENTRANT and -lpthread are not necessary if the Fujitsu compiler supported -mt.
You can store the settings in $HOME/.yampirc, from which mpicc read and execute commands.
Note that the settings above may change w.r.t. the configuration of GridMPI. For example, if thread support is disabled in configuring GridMPI (with -disable-threads), the options such as -lpthread should be omitted.
To make full core dump (not only stack, but include data) in AIX, set "_YAMPI_AIX_FULLCORE" environment variable.
IMPI specifies the following: IMPI_CLIENT_SIZE, IMPI_CLIENT_COLOR, IMPI_HOST_SIZE, and IMPI_HOST_COLOR. They are defined in the IMPI specification.
GridMPI extends the following: YAMPI_PSP_MAXRATE and YAMPI_PSP_MATB. These are related to the PSPacer IP packet pacing module. See pspacer-2.1 for PSPacer. Briefly, YAMPI_PSP_MAXRATE takes the bandwidth of the network for inter-cluster communication (ie, physically available bandwidth). YAMPI_PSP_MATB tells PSPacer about the share of the bandwidth allowed to this local node (ie, limit of the bandwidth).
/tmp/yampiport_uid is a unix domain socket. It is removed after initialization, but it may remain on errors in initialization. /tmp/yampimem.uid.key is a file for mapping shared memory. It is removed after initialization, but it may remain on errors in initialization. /tmp/mpifork.uid is fatal error logging. It will contain some messages which are lost in failing to print on stdout/stderr (due to killing mpirun). It is empty, normally.
To checkpoint a MPI job, press Ctrl-\, which causes a SIGQUIT signal. When checkpointing is done, ckpt.*.out and ckpt.*.out.img files are generated.
$ mpirun -np 2 ./a.out [Ctrl-\]
To restart the MPI job, -restart is passed to mpirun.
$ mpirun -restart -np 2 ./a.out
NOTE (IA32) If you want to use checkpoint with enabling threads, the kernel module proc_ckpt.ko is required to install. This module does not be compiled by default, so you need to compile it manually at a checkpoint/src/module directory.
For more information, see the Checkpoint/Restart Implementation Status page.
% mpirun -np 4 ./a.out mpifork[0000]: mpifork: Command not found.
FIX: mpirun is a script, and it calls the mpifork command. mpifork spreads itself via the remote shell on the remote nodes before starting the application. The above line claims the remote shell failed to exec mpifork. Add setting to the path in an rc-file where the remote shell loads.
There are few environment variables _YAMPI_MPIFORK and _YAMPI_MPIRUN_USE_MPIROOT_BIN, which have control on the behavior.
HINT: Try mpirun -np n hostname. mpirun can start ordinary commands. Or, try mpifork -v -np n hostname. The option -v lets mpifork print trace output. Use -vv or -vvv to make more verbose.
$ gridmpirun gridmpirun.conf Permission denied (publickey,keyboard-interactive). Permission denied (publickey,keyboard-interactive). Permission denied (publickey,keyboard-interactive). Permission denied (publickey,keyboard-interactive).
FIX: If you are using ssh-agent to avoid typing a passphase, then you need to allow forwarding of a secret for ssh. Add the following lines in "~/.ssh/config" or "/etc/ssh/ssh_config":
To verify the setting, issue the following command:Host * ForwardAgent yes
ssh localhost ssh localhost date
RATIONALE: This happens because mpirun forks processes as a tree via rsh/ssh, and thus a secret needs to be forwarded. The name of the forker program is mpifork. Directly calling mpifork -v -np n hostname may print helpful messages.
FIX: Add /opt/FSUNaprun/bin to the PATH.
[xx] YAMPI: fatal error (0x340f): TCPActiveOpen: read() failed on (aa.bb.cc.dd;xxxx): Connection reset by peer. IOT Trap
FIX: Increase the number of listen backlog on the compute nodes. The default value is 128. If you increase it to 10000, issue the following commands:
# echo 10000 > /proc/sys/net/core/somaxconn export _YAMPI_SOMAXCONN=10000
The IEEE conforming behavior is the compiler option. Use the following:
Other processors: Ultra-SPARC follows the IEEE. x86_64 uses the SSE registers by default and follows the IEEE.
GCC/IA64 (version 3) does not have control over the precision and cannot be IEEE conforming. Also note that, Intel CC/IA64 has an option -no-IPF-fma, but it alone does not suffice at -O3.
IBM XLC aggressively optimizes and -qfloat=nomaf alone does not work except at -O (-O2). -qstrict is also needed for -O3 or above (for XLC Version 6).
CG from the NPB (NAS Parallel Benchmarks) fails without these options (verification fails).
Since x86_64 works with small address offsets even though it is run in the 64bit mode by default, GCC and Intel CC on x86_64 needs an option -mcmodel=medium to use 64bit addressing in data accesses.
Recompilation/reinstallation of GridMPI is needed with GCC 3.x to use this option. Otherwise, linking the MPI library fails. It is not needed with Intel CC and GCC 4.x. Disable the PSPacer support (--without-libpsp) when you encounter a failure in compiling "libpsp.c" in the PIC mode. Use the below line for reconfiguration.
# With GCC 3.x CFLAGS="-mcmodel=medium" ./configure --without-libpsp (for sh/ksh/bash) env CFLAGS="-mcmodel=medium" ./configure --without-libpsp (for csh/tcsh)
CG Benchmark does not converge in a heterogeneous environment, with any combinations of Intel IA32, IA64, IBM Power, Sun SPARC. It is due to the floating precision of the processors which have more precision than specified by the IEEE float. Also, some aggressive optimizations need be disabled. See Pitfalls in Heterogeneous Environment.
LU Benchmark badly uses datatypes. It is a simple mistake. Integers are exchanged as double floats. Fix is the following: faq.lu.diff.txt
FT Benchmark fails due to duplicate declaration generated by "sys/setparams.c" with GNU g77. It is a simple mistake. Fix is the following: faq.ft.diff.txt
MG Benchmark uses ambiguous message tags for NPROCS≥16. Tags are assigned for a pair of dimension and direction, but they do not uniquely determine the processes when the mesh structure collapses at the lowest-level. Fix is the following: faq.mg.diff.bar.txt or faq.mg.diff.tag.txt (the first one adds extra barriers, and the other one uses properly fixed tags).
CLASS=D benchmarks are too large for 32bit addressing even on 64bit machines, when the benchmarks are run with a relatively small number of processes. GCC and Intel CC on x86_64 needs an option -mcmodel=medium to use 64bit addressing for data accesses. See Compiling for Large Data.
It is required to enable threads to run the code for Open MP. But, the NPB code fails to use MPI_Init_thread (it uses MPI_Init which is not thread-enabling), because the benchmarks would run without properly enabling threads. GridMPI has an option to make MPI_Init work as though MPI_Init_thread were called. The feature is enabled by setting the environment variable _YAMPI_THREADS to non-zero value.
_YAMPI_THREADS=1; export _YAMPI_THREADS
MEMO: Compiling benchmarks needs -openmp, and static linking is needed when using GCC. The environment variable OMP_NUM_THREADS specifies the number of threads on a process.
Performance Test Suites (mpptest/perftest) from MPICH at ANL fail by timeout. It is because the loop count is set to a too large value in GridMPI/YAMPI. It is calculated depending on the value returned by MPI_Wtick to get a reasonable precision. However, MPI_Wtick of MPICH returns 1e-6, but GridMPI/YAMPI returns a value like 1e-2 (the inverse of HZ). The easiest fix is to replace a call to MPI_Wtick by a small constant in "mpptest.c".
#if 0 wtick = MPI_Wtick(); #else wtick = 1e-6; #endif
See http://www-unix.mcs.anl.gov/mpi/mpich1/download.html for downloading the Performance Test Suites.
GridMPI exactly follows the IMPI specification, and uses the wire protocol and the collective algorithms as defined there. Some parameters can be controlled by the environment variables. Note that the default parameters of TCP/IP and the wire protocol are not suitable for large latency.
_YAMPI_RSIZE=1024
It specifies the message size to switch protocol inside a cluster. _YAMPI_RSIZE makes MPI_Send switched to MPI_Rsend, when the message size equals to the value or larger. GridMPI/YAMPI uses the eager protocol in MPI_Send and uses the rendezvous protocol in MPI_Rsend. Note that the rendezvous protocol uses hand-shaking and starts sending when both the sender and the receiver are ready. It may avoid copying once in a temporary buffer.
_YAMPI_SOCBUF=65536 IMPI_SOCBUF=20000000
These are the numbers of bytes passed to setsockopt (for both send and receive). These values are used for YAMPI and IMPI sockets, respectively. _YAMPI_SOCBUF controls the YAMPI/TCP connections, and IMPI_SOCBUF controls the IMPI connections. When _YAMPI_SOCBUF is specified but not IMPI_SOCBUF, _YAMPI_SOCBUF is used for both. The default is (64*1024) bytes.
IMPI_C_DATALEN=2147483647
IMPI sends messages as chunks (IMPI packets). IMPI_C_DATALEN is the maximum chunk size in bytes. IMPI uses a rendezvous protocol if an MPI message size is larger than this value. Rendezvous means handshaking between the sender and the receiver, and the latency affects the performance very much. Rendezvous can be disabled in effect by making IMPI_C_DATALEN infinitely large (2147483647=0x7fffffff). The default is (64*1024) bytes.
IMPI_H_HIWATER=4000 IMPI_H_ACKMARK=10
IMPI also uses flow control of chunks. These specify the number of chunks sent before getting ACKs. IMPI_H_HIWATER specifies the maximum of the number of chunks. The defaults are IMPI_H_ACKMARK=10 and IMPI_H_HIWATER=20. This value does not have much importance when IMPI_C_DATALEN is set infinitely large.
IMPI_COLL_XSIZE IMPI_COLL_MAXLINEAR
IMPI switches the collective algorithms depending on the message size and the number of processes. The decision is made by these.
The parameters here prefixed by "IMPI_" are defined in the IMPI specification. So, please check the specification, too. Also, the full list of environment variables to control GridMPI are shown in Environment Variables
GridMPI needs some settings to run with Condor, a workload management system for high throughput computing (http://www.cs.wisc.edu/condor/). Following environment variables are at least necessary.
_YAMPI_RSH=$CONDOR_SSH _YAMPI_MPIRUN_SPREAD=0 _YAMPI_MPIRUN_CHDIR=0
_YAMPI_RSH is needed, because Condor uses their own SSH script and using it is necessary.
_YAMPI_MPIRUN_SPREAD=0 disables forking as a tree (it sets spreading factor at the root to infinity). Condor places a configuration file on a starting node, and it reads the configuration file for each invocation of SSH. It means invoking SSH is restricted on the starting node, and forking as a tree cannot be used.
_YAMPI_MPIRUN_CHDIR=0 disables chdir at a node invoked by SSH. GridMPI tries to chdir to the current directory where mpirun is invoked. However, Condor assigns a different working directory for each node and chdirs in their SSH script. Thus, GridMPI need not to chdir by itself.
See faq.condor.txt, the modified script "mp1script" for GridMPI.
The options -Wf,-pvctl,loopcnt=2147483647 or -Wf,-pvctl,vwork=stack to f90/sxf90 may work (note that 2147483647=0x7fffffff). Here, -Wf is a prefix to pass complex options to the compiler.**** 96 Loop count is greater than that assumed by the compiler: loop-count=274625 eln=2342 PROG=input ELN=2366(40004064c) TASKID=1 Called from main_ ELN=275(40000654c)