Curret Release: GridMPI-2.1.3
|
FAQ, Tips, and Trouble Shooting
Contents:
Frequently Asked Questions
Configuration
-
Do all hosts need global IP addresses?
- Yes. Each host in the cluster should have an IP global address and
be IP reachable.
RATIONALE: This is because the GridMPI implementors judged
that relaying/forwarding of messages has impact on performance with
the current technology. Also, it is too restricted a
relaying/forwarding topology to provide a variety of collective
algorithms for experiments.
-
How do I select one from multiple network interfaces?
- GridMPI uses the default nework interface for global communication
by default. An interface can be selected by a network address using
the environment variable IMPI_NETWORK, in case a host has
multiple network interfaces. It is specified by the format of a
network address like "163.220.2.0".
Inside a cluster, having multiple intefaces is not a problem
because the cluster MPI (YAMPII) uses ones specified in a
configuration file, where an inteface is selected by a given hostname.
-
Can I change CC in compiling GridMPI?
- To compile GridMPI with a non-default C/C++ compiler, specify the
environment variables (CC and others) and invoke "configure".
Specify the Fortran and C++ compilers, too, because the compiler
driver uses ones found at configuration time. Use CFLAGS to
pass options to CC. For example, the following line suffices for
using GCC.
CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for sh/bash)
env CC=gcc CXX=g++ F77=g77 F90=g77 ./configure (for csh/tcsh)
-
Can I stop dumping cores at abortion?
- The behavior at abortion is controlable by the environemnt
variable _YAMPI_DUMPCORE. Setting _YAMPI_DUMPCORE=0
calls exit (3c) at aborting situation, and setting
_YAMPI_DUMPCORE=1 calls abort (3c). GridMPI dumps
cores by default, because abortion is an irregular condition. Also,
setting _YAMPI_ABORT_ON_CLOSE=0/1/2 may help to suppress
dumping cores. See the full-list for
the environment variables.
-
Can I change naming of core files in IBM AIX?
- The naming of core files (such as appending PID and date) can be
changed by setting environment variable "CORE_NAMING" in AIX 5.2, or
using "chcore" command in AIX5.3.
To make full core dump (not only stack, but include data) in AIX,
set "_YAMPI_AIX_FULLCORE" environment varaible.
-
Can I get the configuration of wide-area networks?
- No. But you can retrieve the number of clusters and the number of
procs in each cluster. IMPI defines two attributes of cluster
configuration in MPI_COMM_WORLD (IMPI_CLIENT_SIZE and
IMPI_CLIENT_COLOR). They indicate which cluster the proc (myrank)
belongs. Counting the number of procs in clusters is simple (just do
allreduce).
See faq.clustercolor.c.txt
or faq.clustercolorf.f.txt.
Trouble Shooting
-
SSH Says "Permission denied".
- CASE: ssh seems to fail to start processes.
$ gridmpirun -np 4 ./pi
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).
Permission denied (publickey,keyboard-interactive).
FIX: If you are using ssh-agent to avoid typing a
passphase, then you need to allow forwarding of a secret for ssh. Add
the following lines in "~/.ssh/config" or "/etc/ssh/ssh_config":
Host *
ForwardAgent yes
To verify the setting, issue the following command:
ssh localhost ssh localhost date
RATIONALE: This happens because mpirun forks
processes as a tree via rsh/ssh, and thus a secret needs to be
forwarded. The name of the forker program is mpifork.
Directly calling mpifork -v -np n hostname may print
helpful messages.
-
mpirun on Fujitsu Solaris8/SPARC64V stops with a message
"/opt/FJSVmpi2/bin/mpiexec[15]: aplpg: not found".
- CASE: GridMPI on Fujitsu Solaris8/SPARC64V uses Fujitsu MPI
as an underlying transport (known as Vendor MPI). mpirun
invokes mpiexec of Fujitsu MPI, and mpiexec subsequently
invokes aplpg and aprun commands, both of which should
be found in the PATH.
FIX: Add /opt/FSUNaprun/bin to the PATH.
Tips
Pitfalls in Heterogeneous Environemnt
- Machines have different precision in floating point.
- Communicating floating point data may cause mismatch in precision,
because some processors divert from the IEEE floating format. Intel
IA32 (32bit) has extra precision due to its 80bit register format.
IBM Power has extra precision in the multiply-add (fma) operation.
Fujitsu SPARC64V also has extra precision in the multiply-add
operation. Thus, errors occur in precision in a heterogeneous
environment.
The IEEE comforming behavior is the compiler option. Use the
following:
- Intel IA32 (with GCC): -msse2 -mfpmath=sse
- IBM Power (with XLC): -qfloat=nomaf -qstrict
- Fujitsu SPARC64V (with Fujitsu compilers): -Kfast_GP=0
Other processors: Ultra-SPARC follows the IEEE. x86_64 uses the
SSE registers by default and follows the IEEE. IA64 has the
multiply-add operation.
IBM XLC aggressively optimizes and -qfloat=nomaf alone
does not work except at -O (-O2). -qstrict is also
needed for -O3 or above (for XLC Version 6).
CG from the NPB (NAS Parallel Benchmarks) fails without these
options (verification fails).
- Long integer types are packed in 64bits.
- GridMPI packs long data in 64bits in the external32
format, whereas the MPI-2 standard specifies it should be packed
in 32bits. The default non-standard behavior is chosen because
sending/receiving using long data (MPI_LONG and
MPI_UNSIGNED_LONG) may loose bits on 64bit machines. The
standard behavior is selected by setting the environment variable
_YAMPI_COMMON_PACK_SIZE.
Running NPB (NAS Parallel Benchmarks)
CG Benchmark (NPB2.3/NPB2.4/NPB3.2)
CG does not converge in a heterogeneous environemnt, with any
combinations of Intel IA32, IBM Power, Sun SPARC. It is due to the
floating precision of the processors which have more precision than
specified by the IEEE float. Also, some aggressive optimizations need
be disabled.
See Pitfalls in Heterogeneous Environment.
LU Benchmark (NPB2.3/NPB2.4/NPB3.2)
LU badly uses datatypes. It is a simple mistake. Integers are
exchagend as double floats. Fix is the following:
faq.lu.diff.txt
FT Benchmark (NPB3.2)
Compiling with GNU g77 fails due to duplicate declaration generated
by "sys/setparams.c". It is a simple mistake. Fix is the following:
faq.ft.diff.txt
MG Benchmark (NPB2.3/NPB2.4/NPB3.2)
MG uses ambiguous message tags for NPROCS≥16. Tags are assigned
for a pair of dimension and direction, but they do not uniquely
determine the processes when the mesh structure collapses at the
lowest-level. Fix is the following:
faq.mg.diff.bar.txt or
faq.mg.diff.tag.txt
Odds on Specific Platforms
Compiler Options to IBM AIX and Hitachi SR11000
- Why -qstaticinline is passed to the IBM XL compilers in
mpicc?
- It suppresses warning "WARNING: Duplicate symbol" while linking
C++ programs. GridMPI defines inline methods in the C++ binding,
which generate many warnings at linking. The XL C++ compiler emits an
associated external definition for each inline method (it is an ISO
specified behavior). See
IBM support document.
- Why -parallel=0 is passed to the Hitachi f90 compiler in
mpicc?
- It disables auto-parallelization which is enabled by default
(default can be set by site). Aggressive setting makes many programs
fail.
Compiler Options to Fujitsu Solaris/SPARC64
- Why -Knouse_rodata or -Ar is passed to the
Fujitsu compilers in mpicc?
- -Knouse_rodata (in C) or -Ar (in Fortran) disables
to place constants in read-only area. They are necessary to use
Fujitsu MPI and should always be specified.
- Why -f2004,1321 is passed to the Fujitsu Fortran in
mpicc?
- It is just to surpress many warnings in Fortran.
Compiler Options to NEC SX
- What options can I use, when a vectorized program aborts on the
limit of the number of loops with the message like the following:
**** 96 Loop count is greater than that assumed by the compiler:
loop-count=274625 eln=2342 PROG=input ELN=2366(40004064c) TASKID=1
Called from main_ ELN=275(40000654c)
- The options -Wf,-pvctl,loopcnt=2147483647 or
-Wf,-pvctl,vwork=stack to f90/sxf90 may work (note that
2147483647=0x7fffffff). Here, -Wf is a prefix to pass complex
options to the compiler.
($Date: 2006-08-22 03:12:11 $)
|
|