FAQ, Tips, and Trouble Shooting

Contents:

FAQ
Trouble Shooting
Tips
Odds on Specific Platforms

Frequently Asked Questions

Configuration

Do all hosts need global IP addresses?

No. GridMPI-2.x and later provides the IMPI Relay. Earlier versions needed that each host in the cluster should have an IP global address and be IP reachable.

See the description of the Relay: Overview of IMPI Relay and Using Relay.

Obsolete rationale: Requiring global addresses is because the GridMPI implementors judged that relaying/forwarding of messages has impact on performance with the current technology. Also, it is too restricted a relaying/forwarding topology to provide a variety of collective algorithms for experiments.

How do I select one from multiple network interfaces?

GridMPI uses the default network interface for global communication by default (the default network interface is the one assoicated with a hostname, normally). An interface can be selected by a network address using the environment variable IMPI_NETWORK, in case a host has multiple network interfaces. It is specified by the format of a network address like "163.220.2.0".

Inside a cluster, having multiple interfaces is not a problem because the cluster MPI (YAMPI) uses ones specified in a configuration file, where an interface is selected by a given hostname.

Do I need to specify a network interface of IMPI server?

No. IMPI server listens on an "any-address" port, and thus, selecting a network interface is not necessary. Note that setting of IMPI_NETWORK [FAQ] is for clients (MPI processes), and not for IMPI server. Printed address from IMPI server can simply be ignored, or it can be changed by the command line option -host hostname to the impi-server command. Passing -host just makes the printed address as specifed.

How do I limit a port range?

Ports for communication can be limited to a specified range by two environment variables: IMPI_PORT_RANGE and IMPI_SERVER_PORT_RANGE. IMPI_PORT_RANGE specifies a range of ports to listen to for IMPI connections. Similarly, IMPI_SERVER_PORT_RANGE specifies a range of ports to listen to for IMPI server. See Environment Variables.

IMPI_PORT_RANGE=start:end
IMPI_SERVER_PORT_RANGE=start:end

start and end are inclusive.

Can I change CC in compiling GridMPI?

Yes. To compile GridMPI with a non-default C/C++ compiler, specify the environment variables (CC and others) and invoke "configure". Specify the Fortran and C++ compilers, too, because the compiler driver uses ones found at configuration time. Use CFLAGS to pass options to CC. For example, the following line suffices for using GCC.

CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for sh/ksh/bash)
env CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for csh/tcsh)

See Installation Procedure, which includes some examples as "Configuration Templates".

Can I change a compiler for application programs?

Yes. It is possible to compile application programs using a compiler different from the one used to compile the GridMPI library. They can be changed by the environment variables _YAMPI_CC, _YAMPI_CXX, _YAMPI_F77, and _YAMPI_F90. The default compilers are set at configuration time.

The below table lists the related environment variables.

Fujitsu Compiler against GCC-Compiled GridMPI
# For Solaris/SPARC _YAMPI_CC='c99' _YAMPI_CXX='FCC' _YAMPI_F77='frt' _YAMPI_F90='f90' _YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT' _YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT' _YAMPI_EXTLIBS=' -L/usr/local/gcc-3.4.3/lib/ -lgcc_s \ -lnsl -lsocket -lpthread -ldl' _YAMPI_CC64FLAG='-KV9' _YAMPI_LD64FLAG='-64' # For Linux/IA32 _YAMPI_CC='fcc' _YAMPI_CXX='FCC' _YAMPI_F77='frt' _YAMPI_F90='f90' _YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT' _YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT' _YAMPI_EXTLIBS=' /usr/FCC/lib/libgcccompat.a \ -lnsl -lpthread -ldl' Note that -D_REENTRANT and -lpthread are not necessary if the Fujitsu compiler supported -mt. You can store the settings in $HOME/.yampirc, from which mpicc read and execute commands.

Fujitsu Compiler against GCC-Compiled GridMPI

# For Solaris/SPARC
_YAMPI_CC='c99'
_YAMPI_CXX='FCC'
_YAMPI_F77='frt'
_YAMPI_F90='f90'
_YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT'
_YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT'
_YAMPI_EXTLIBS=' -L/usr/local/gcc-3.4.3/lib/ -lgcc_s \
	 -lnsl -lsocket -lpthread -ldl'
_YAMPI_CC64FLAG='-KV9'
_YAMPI_LD64FLAG='-64'

# For Linux/IA32
_YAMPI_CC='fcc'
_YAMPI_CXX='FCC'
_YAMPI_F77='frt'
_YAMPI_F90='f90'
_YAMPI_EXTCOPT=' -Knouse_rodata -mt -D_REENTRANT'
_YAMPI_EXTFOPT=' -Ar -f2004,1321 -mt -D_REENTRANT'
_YAMPI_EXTLIBS=' /usr/FCC/lib/libgcccompat.a \
	 -lnsl -lpthread -ldl'

Note that -D_REENTRANT and -lpthread are not necessary if the Fujitsu compiler supported -mt.

You can store the settings in $HOME/.yampirc, from which mpicc read and execute commands.

Note that the settings above may change w.r.t. the configuration of GridMPI. For example, if thread support is disabled in configuring GridMPI (with -disable-threads), the options such as -lpthread should be omitted.

Can I stop dumping cores at abortion?

The behavior at abortion is controllable by the environment variable _YAMPI_DUMPCORE. Setting _YAMPI_DUMPCORE=0 calls exit (3c) at aborting situation, and setting _YAMPI_DUMPCORE=1 calls abort (3c). GridMPI dumps cores by default, because abortion is an irregular condition. Also, setting _YAMPI_ABORT_ON_CLOSE=0/1/2 may help to suppress dumping cores. See the Environment Variables for the full list of environment variables.

Can I change naming of core files in IBM AIX?

The naming of core files (such as appending PID and date) can be changed by setting environment variable "CORE_NAMING" in AIX 5.2, or using "chcore" command in AIX5.3. (Setting "CORE_NAMING=true" is enough to generate separate cores for processes).

To make full core dump (not only stack, but include data) in AIX, set "_YAMPI_AIX_FULLCORE" environment variable.

Can I get the configuration of wide-area networks?

No. But you can retrieve the number of clusters and the number of processes in each cluster. IMPI defines two attribute keys of cluster configuration in MPI_COMM_WORLD (IMPI_CLIENT_SIZE and IMPI_CLIENT_COLOR). They indicate which cluster the process (myrank) belongs. Counting the number of processes in clusters is simple (just do allreduce). See faq.clustercolor.c.txt or faq.clustercolorf.f.txt.

What are the new attribute keys?

GridMPI added new predefined attribute keys in the header files ("mpi.h" and "mpif.h"). Four are defined in the IMPI specification, and two are GridMPI extension.

IMPI specifies the following: IMPI_CLIENT_SIZE, IMPI_CLIENT_COLOR, IMPI_HOST_SIZE, and IMPI_HOST_COLOR. They are defined in the IMPI specification.

GridMPI extends the following: YAMPI_PSP_MAXRATE and YAMPI_PSP_MATB. These are related to the PSPacer IP packet pacing module. See pspacer-2.1 for PSPacer. Briefly, YAMPI_PSP_MAXRATE takes the bandwidth of the network for inter-cluster communication (ie, physically available bandwidth). YAMPI_PSP_MATB tells PSPacer about the share of the bandwidth allowed to this local node (ie, limit of the bandwidth).

Some files are created in /tmp. What are they?

GridMPI may leave some files in /tmp. All can be safely removed even during MPI processes are running.

/tmp/yampiport_uid is a unix domain socket. It is removed after initialization, but it may remain on errors in initialization. /tmp/yampimem.uid.key is a file for mapping shared memory. It is removed after initialization, but it may remain on errors in initialization. /tmp/mpifork.uid is fatal error logging. It will contain some messages which are lost in failing to print on stdout/stderr (due to killing mpirun). It is empty, normally.

How do I checkpoint and restart a MPI job?

The checkpointing is enabled by setting the environment variable _YAMPI_CKPT to 1. If the version of Linux kernel is 2.6.x, the sysctl parameter kernel.randomized_va_space should be 0.

To checkpoint a MPI job, press Ctrl-\, which causes a SIGQUIT signal. When checkpointing is done, ckpt.*.out and ckpt.*.out.img files are generated.

$ mpirun -np 2 ./a.out
[Ctrl-\]

To restart the MPI job, -restart is passed to mpirun.

$ mpirun -restart -np 2 ./a.out

NOTE (IA32) If you want to use checkpoint with enabling threads, the kernel module proc_ckpt.ko is required to install. This module does not be compiled by default, so you need to compile it manually at a checkpoint/src/module directory.

For more information, see the Checkpoint/Restart Implementation Status page.

Trouble Shooting

MPIRUN Says "mpifork: Command not found".

CASE: Remote shell fails to start mpifork at a remote node.

% mpirun -np 4 ./a.out
mpifork[0000]: mpifork: Command not found.

FIX: mpirun is a script, and it calls the mpifork command. mpifork spreads itself via the remote shell on the remote nodes before starting the application. The above line claims the remote shell failed to exec mpifork. Add setting to the path in an rc-file where the remote shell loads.

There are few environment variables _YAMPI_MPIFORK and _YAMPI_MPIRUN_USE_MPIROOT_BIN, which have control on the behavior.

HINT: Try mpirun -np n hostname. mpirun can start ordinary commands. Or, try mpifork -v -np n hostname. The option -v lets mpifork print trace output. Use -vv or -vvv to make more verbose.

SSH Says "Permission denied".

CASE: ssh seems to fail to start processes.

$ gridmpirun gridmpirun.conf
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).

FIX: If you are using ssh-agent to avoid typing a passphase, then you need to allow forwarding of a secret for ssh. Add the following lines in "~/.ssh/config" or "/etc/ssh/ssh_config":

Host *
   ForwardAgent yes

To verify the setting, issue the following command:

ssh localhost ssh localhost date

RATIONALE: This happens because mpirun forks processes as a tree via rsh/ssh, and thus a secret needs to be forwarded. The name of the forker program is mpifork. Directly calling mpifork -v -np n hostname may print helpful messages.

mpirun on Fujitsu Solaris8/SPARC64V stops with a message "/opt/FJSVmpi2/bin/mpiexec[15]: aplpg: not found".

CASE: GridMPI on Fujitsu Solaris8/SPARC64V uses Fujitsu MPI as an underlying transport (known as Vendor MPI). mpirun invokes mpiexec of Fujitsu MPI, and mpiexec subsequently invokes aplpg and aprun commands, both of which should be found in the PATH.

FIX: Add /opt/FSUNaprun/bin to the PATH.

When I launch a large MPI job, mpirun stops with a message "TCPActiveOpen: read() failed on (aa.bb.cc.dd;xxxx): Connection reset by peer."

CASE: mpirun seems to fail to start processes.

[xx] YAMPI: fatal error (0x340f): TCPActiveOpen: read() failed on (aa.bb.cc.dd;xxxx):
Connection reset by peer.
IOT Trap

FIX: Increase the number of listen backlog on the compute nodes. The default value is 128. If you increase it to 10000, issue the following commands:

# echo 10000 > /proc/sys/net/core/somaxconn

export _YAMPI_SOMAXCONN=10000

Tips

Pitfalls in Heterogeneous Environment

Machines have different precision in floating point.

Communicating floating point data may cause mismatch in precision, because some processors divert from the IEEE floating format. Intel IA32 (32bit) has extra precision due to its 80bit register format. Intel IA64, IBM Power, Fujitsu SPARC64V, and other have extra precision in the multiply-add (fma) operation. Thus, errors occur in precision in a heterogeneous environment.

The IEEE conforming behavior is the compiler option. Use the following:

Intel IA32 32bit (with GCC): -msse2 -mfpmath=sse
Intel IA64 (with GCC): NO CONTROL
Intel IA64 (with Intel CC): -mp
IBM Power (with XLC): -qfloat=nomaf -qstrict
Fujitsu SPARC64V (with Fujitsu compilers): -Kfast_GP=0
NEC SX series: NO CONTROL (up to SX-8; SX-9 unknown)

Other processors: Ultra-SPARC follows the IEEE. x86_64 uses the SSE registers by default and follows the IEEE.

GCC/IA64 (version 3) does not have control over the precision and cannot be IEEE conforming. Also note that, Intel CC/IA64 has an option -no-IPF-fma, but it alone does not suffice at -O3.

IBM XLC aggressively optimizes and -qfloat=nomaf alone does not work except at -O (-O2). -qstrict is also needed for -O3 or above (for XLC Version 6).

CG from the NPB (NAS Parallel Benchmarks) fails without these options (verification fails).

Long integer types are packed in 64bits.

GridMPI packs long data in 64bits in the external32 format, whereas the MPI-2 standard specifies it should be packed in 32bits. The default non-standard behavior is chosen because sending/receiving using long data (MPI_LONG and MPI_UNSIGNED_LONG) may loose bits on 64bit machines. The standard behavior is selected by setting the environment variable _YAMPI_COMMON_PACK_SIZE.

Compiling for Large Data

Since x86_64 works with small address offsets even though it is run in the 64bit mode by default, GCC and Intel CC on x86_64 needs an option -mcmodel=medium to use 64bit addressing in data accesses.

Recompilation/reinstallation of GridMPI is needed with GCC 3.x to use this option. Otherwise, linking the MPI library fails. It is not needed with Intel CC and GCC 4.x. Disable the PSPacer support (--without-libpsp) when you encounter a failure in compiling "libpsp.c" in the PIC mode. Use the below line for reconfiguration.

# With GCC 3.x
CFLAGS="-mcmodel=medium" ./configure --without-libpsp (for sh/ksh/bash)
env CFLAGS="-mcmodel=medium" ./configure --without-libpsp (for csh/tcsh)

Running NPB (NAS Parallel Benchmarks)

CG Benchmark (NPB2.3/NPB2.4/NPB3.2)

CG Benchmark does not converge in a heterogeneous environment, with any combinations of Intel IA32, IA64, IBM Power, Sun SPARC. It is due to the floating precision of the processors which have more precision than specified by the IEEE float. Also, some aggressive optimizations need be disabled. See Pitfalls in Heterogeneous Environment.

LU Benchmark (NPB2.3/NPB2.4/NPB3.2)

LU Benchmark badly uses datatypes. It is a simple mistake. Integers are exchanged as double floats. Fix is the following: faq.lu.diff.txt

FT Benchmark (NPB3.2)

FT Benchmark fails due to duplicate declaration generated by "sys/setparams.c" with GNU g77. It is a simple mistake. Fix is the following: faq.ft.diff.txt

MG Benchmark (NPB2.3/NPB2.4/NPB3.2)

MG Benchmark uses ambiguous message tags for NPROCS≥16. Tags are assigned for a pair of dimension and direction, but they do not uniquely determine the processes when the mesh structure collapses at the lowest-level. Fix is the following: faq.mg.diff.bar.txt or faq.mg.diff.tag.txt (the first one adds extra barriers, and the other one uses properly fixed tags).

Running CLASS=D Benchmarks

CLASS=D benchmarks are too large for 32bit addressing even on 64bit machines, when the benchmarks are run with a relatively small number of processes. GCC and Intel CC on x86_64 needs an option -mcmodel=medium to use 64bit addressing for data accesses. See Compiling for Large Data.

Running Open MP (NPB-MZ) Benchmarks

It is required to enable threads to run the code for Open MP. But, the NPB code fails to use MPI_Init_thread (it uses MPI_Init which is not thread-enabling), because the benchmarks would run without properly enabling threads. GridMPI has an option to make MPI_Init work as though MPI_Init_thread were called. The feature is enabled by setting the environment variable _YAMPI_THREADS to non-zero value.

_YAMPI_THREADS=1; export _YAMPI_THREADS

MEMO: Compiling benchmarks needs -openmp, and static linking is needed when using GCC. The environment variable OMP_NUM_THREADS specifies the number of threads on a process.

Running the MPICH Performance Test

Performance Test Suites (mpptest/perftest) from MPICH at ANL fail by timeout. It is because the loop count is set to a too large value in GridMPI/YAMPI. It is calculated depending on the value returned by MPI_Wtick to get a reasonable precision. However, MPI_Wtick of MPICH returns 1e-6, but GridMPI/YAMPI returns a value like 1e-2 (the inverse of HZ). The easiest fix is to replace a call to MPI_Wtick by a small constant in "mpptest.c".

#if 0
    wtick = MPI_Wtick();
#else
    wtick = 1e-6;
#endif

See http://www-unix.mcs.anl.gov/mpi/mpich1/download.html for downloading the Performance Test Suites.

Performance Parameters

GridMPI exactly follows the IMPI specification, and uses the wire protocol and the collective algorithms as defined there. Some parameters can be controlled by the environment variables. Note that the default parameters of TCP/IP and the wire protocol are not suitable for large latency.

Protocol Switch

_YAMPI_RSIZE=1024

It specifies the message size to switch protocol inside a cluster. _YAMPI_RSIZE makes MPI_Send switched to MPI_Rsend, when the message size equals to the value or larger. GridMPI/YAMPI uses the eager protocol in MPI_Send and uses the rendezvous protocol in MPI_Rsend. Note that the rendezvous protocol uses hand-shaking and starts sending when both the sender and the receiver are ready. It may avoid copying once in a temporary buffer.

Socket Buffer

_YAMPI_SOCBUF=65536
IMPI_SOCBUF=20000000

These are the numbers of bytes passed to setsockopt (for both send and receive). These values are used for YAMPI and IMPI sockets, respectively. _YAMPI_SOCBUF controls the YAMPI/TCP connections, and IMPI_SOCBUF controls the IMPI connections. When _YAMPI_SOCBUF is specified but not IMPI_SOCBUF, _YAMPI_SOCBUF is used for both. The default is (64*1024) bytes.

IMPI Wire Protocol

IMPI_C_DATALEN=2147483647

IMPI sends messages as chunks (IMPI packets). IMPI_C_DATALEN is the maximum chunk size in bytes. IMPI uses a rendezvous protocol if an MPI message size is larger than this value. Rendezvous means handshaking between the sender and the receiver, and the latency affects the performance very much. Rendezvous can be disabled in effect by making IMPI_C_DATALEN infinitely large (2147483647=0x7fffffff). The default is (64*1024) bytes.

IMPI_H_HIWATER=4000
IMPI_H_ACKMARK=10

IMPI also uses flow control of chunks. These specify the number of chunks sent before getting ACKs. IMPI_H_HIWATER specifies the maximum of the number of chunks. The defaults are IMPI_H_ACKMARK=10 and IMPI_H_HIWATER=20. This value does not have much importance when IMPI_C_DATALEN is set infinitely large.

IMPI Collective Algorithms

IMPI_COLL_XSIZE
IMPI_COLL_MAXLINEAR

IMPI switches the collective algorithms depending on the message size and the number of processes. The decision is made by these.

The parameters here prefixed by "IMPI_" are defined in the IMPI specification. So, please check the specification, too. Also, the full list of environment variables to control GridMPI are shown in Environment Variables

Using with Condor

GridMPI needs some settings to run with Condor, a workload management system for high throughput computing (http://www.cs.wisc.edu/condor/). Following environment variables are at least necessary.

_YAMPI_RSH=$CONDOR_SSH
_YAMPI_MPIRUN_SPREAD=0
_YAMPI_MPIRUN_CHDIR=0

_YAMPI_RSH is needed, because Condor uses their own SSH script and using it is necessary.

_YAMPI_MPIRUN_SPREAD=0 disables forking as a tree (it sets spreading factor at the root to infinity). Condor places a configuration file on a starting node, and it reads the configuration file for each invocation of SSH. It means invoking SSH is restricted on the starting node, and forking as a tree cannot be used.

_YAMPI_MPIRUN_CHDIR=0 disables chdir at a node invoked by SSH. GridMPI tries to chdir to the current directory where mpirun is invoked. However, Condor assigns a different working directory for each node and chdirs in their SSH script. Thus, GridMPI need not to chdir by itself.

See faq.condor.txt, the modified script "mp1script" for GridMPI.

Odds on Specific Platforms

Compiler Options to IBM AIX and Hitachi SR11000

Why -qstaticinline is passed to the IBM XL compilers in mpicc?: It suppresses warning "WARNING: Duplicate symbol" while linking C++ programs. GridMPI defines inline methods in the C++ binding, which generate many warnings at linking. The XL C++ compiler emits an associated external definition for each inline method (it is an ISO specified behavior). See IBM support document.
Why -parallel=0 is passed to the Hitachi f90 compiler in mpicc?: It disables auto-parallelization which is enabled by default (default can be set by site). Aggressive setting makes many programs fail.

Compiler Options to Fujitsu Solaris/SPARC64

Why -Knouse_rodata or -Ar is passed to the Fujitsu compilers in mpicc?: -Knouse_rodata (in C) or -Ar (in Fortran) disables to place constants in read-only area. They are necessary to use Fujitsu MPI and should always be specified.
Why -f2004,1321 is passed to the Fujitsu Fortran in mpicc?: It is just to suppress many warnings in Fortran.

Compiler Options to NEC SX

What options can I use, when a vectorized program aborts on the limit of the number of loops with the message like the following:

**** 96 Loop count is greater than that assumed by the compiler:
loop-count=274625 eln=2342 PROG=input ELN=2366(40004064c) TASKID=1
Called from main_ ELN=275(40000654c)

The options -Wf,-pvctl,loopcnt=2147483647 or -Wf,-pvctl,vwork=stack to f90/sxf90 may work (note that 2147483647=0x7fffffff). Here, -Wf is a prefix to pass complex options to the compiler.

Compiler Options to SPARC

Why -xmemalign=8s is passed to the Sun CC (SPARC) in mpicc?: -xmemalign=8s makes doubles aligned at eight byte boundary. Optimization (with -fast) can potentially make MPI library and user programs incompatible when they are compiled with different optimization levels. It is added to avoid this incompatibility. -mno-unaligned-doubles is passed to GCC.