This FAQ is for Open MPI v4.x and earlier.
If you are looking for documentation for Open MPI v5.x and later, please visit docs.open-mpi.org.
Table of contents:
- What is the vader BTL?
- What is the sm BTL?
- How do I specify use of sm for MPI messages?
- How does the sm BTL work?
- Why does my MPI job no longer start when there are too many processes on
one node?
- How do I know what MCA parameters are available for tuning MPI performance?
- How can I tune these parameters to improve performance?
- Where is the file that sm will mmap in?
- Why am I seeing incredibly poor performance with the sm BTL?
- Can I use SysV instead of mmap?
- How much shared memory will my job use?
- How much shared memory do I need?
- How can I decrease my shared-memory usage?
1. What is the vader BTL? |
The vader BTL is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
Beginning with the v1.8 series, the vader BTL replaces the sm BTL unless
the local system lacks the required support or the user specifically requests
the latter be used. At this time, vader requires CMA support which is typically
found in more current kernels. Thus, systems based on older kernels may default
to the slower sm BTL.
The sm BTL (shared-memory Byte Transfer Layer) is a low-latency, high-bandwidth
mechanism for transferring data between two processes via shared memory.
This BTL can only be used between processes executing on the same node.
The sm BTL has high exclusivity. That is, if one process can reach another
process via sm , then no other BTL will be considered for that connection.
Note that with Open MPI v1.3.2, the sm so-called "FIFOs" were reimplemented and
the sizing of the shared-memory area was changed. So, much of this FAQ will
distinguish between releases up to Open MPI v1.3.1 and releases starting with Open MPI v1.3.2.
3. How do I specify use of sm for MPI messages? |
Typically, it is unnecessary to do so; OMPI will use the best BTL available
for each communication.
Nevertheless, you may use the MCA parameter btl . You should also specify the
self BTL for communications between a process and itself. Furthermore, if not all
processes in your job will run on the same, single node, then you also need
to specify a BTL for internode communications. For example:
1
| shell$ mpirun --mca btl self,sm,tcp -np 16 ./a.out |
4. How does the sm BTL work? |
A point-to-point user message is broken up by the PML into fragments.
The sm BTL only has to transfer individual fragments. The steps are:
- The sender pulls a shared-memory fragment out of one of its free lists.
Each process has one free list for smaller (e.g., 4Kbyte) eager
fragments and another free list for larger (e.g., 32Kbyte) max fragments.
- The sender packs the user-message fragment into this shared-memory
fragment, including any header information.
- The sender posts a pointer to this shared fragment into the
appropriate FIFO (first-in-first-out) queue of the receiver.
- The receiver polls its FIFO(s). When it finds a new fragment
pointer, it unpacks data out of the shared-memory fragment and notifies
the sender that the shared fragment is ready for reuse (to be
returned to the sender's free list).
On each node where an MPI job has two or more processes running, the job creates
a file that each process mmap s into its address space. Shared-memory
resources that the job needs — such as FIFOs and fragment free lists — are
allocated from this shared-memory area.
5. Why does my MPI job no longer start when there are too many processes on
one node? |
If you are using Open MPI v1.3.1 or earlier, it is possible that the shared-memory
area set aside for your job was not created large enough. Make sure you're running
in 64-bit mode (compiled with -m64 ) and set the MCA parameter mpool_sm_max_size
to be very large — even several Gbytes. Exactly how large is discussed further
below.
Regardless of which OMPI release you're using, make sure that there is sufficient
space for a large file to back the shared memory — typically in /tmp .
6. How do I know what MCA parameters are available for tuning MPI performance? |
The ompi_info command can display all the parameters available for the
sm BTL and sm mpool:
1
2
| shell$ ompi_info --param btl sm
shell$ ompi_info --param mpool sm |
7. How can I tune these parameters to improve performance? |
Mostly, the default values of the MCA parameters have already
been chosen to give good performance. To improve performance further
is a little bit of an art. Sometimes, it's a matter of trading off
performance for memory.
-
btl_sm_eager_limit :
If message data plus header information fits within this limit, the
message is sent "eagerly" — that is, a sender attempts to write
its entire message to shared buffers without waiting for a receiver to
be ready. Above this size, a sender will only write the first part of
a message, then wait for the receiver to acknowledge its readiness before
continuing. Eager sends can improve performance by decoupling
senders from receivers.
-
btl_sm_max_send_size :
Large messages are sent in fragments of this size. Larger segments
can lead to greater efficiencies, though they could perhaps also
inhibit pipelining between sender and receiver.
-
btl_sm_num_fifos :
Starting in Open MPI v1.3.2, this is the number of FIFOs per receiving
process. By default, there is only one FIFO per process.
Conceivably, if many senders are all sending to the same process and
contending for a single FIFO, there will be congestion. If there are
many FIFOs, however, the receiver must poll more FIFOs to find
incoming messages. Therefore, you might try increasing this
parameter slightly if you have many (at least dozens) of processes
all sending to the same process. For example, if 100 senders are all
contending for a single FIFO for a particular receiver, it may suffice
to increase btl_sm_num_fifos from 1 to 2.
-
btl_sm_fifo_size :
Starting in Open MPI v1.3.2, FIFOs could no longer grow. If you
believe the FIFO is getting congested because a process falls far
behind in reading incoming message fragments, increase this size
manually.
-
btl_sm_free_list_num :
This is the initial number of fragments on each (eager and max) free
list. The free lists can grow in response to resource congestion, but
you can increase this parameter to pre-reserve space for more
fragments.
-
mpool_sm_min_size :
You can reserve headroom for the shared-memory area to grow by
increasing this parameter.
8. Where is the file that sm will mmap in? |
The file will be in the OMPI session directory, which is typically
something like /tmp/openmpi-sessions-myusername@mynodename/* .
The file itself will have the name
shared_mem_pool .mynodename. For example, the full path could be
/tmp/openmpi-sessions-myusername@node0_0/1543/1/shared_mem_pool.node0 .
To place the session directory in a non-default location, use the MCA parameter
orte_tmpdir_base .
9. Why am I seeing incredibly poor performance with the sm BTL? |
The most common problem with the shared memory BTL is when the
Open MPI session directory is placed on a network filesystem (e.g., if
/tmp is not on a local disk). This is because the shared memory BTL
places a memory-mapped file in the Open MPI session directory (see this entry for more details). If the
session directory is located on a network filesystem, the shared
memory BTL latency will be extremely high.
Try not mounting /tmp as a network filesystem, and/or moving the Open
MPI session directory to a local filesystem.
Some users have reported success and possible performance
optimizations with having /tmp mounted as a "tmpfs" filesystem
(i.e., a RAM-based filesystem). However, before configuring
your system this way, be aware of a few items:
- Open MPI writes a few small meta data files into
/tmp and may
therefore consume some extra memory that could have otherwise been
used for application instruction or data state.
- If you use the "filem" system in Open MPI for moving
executables between nodes, these files are stored under
/tmp .
- Open MPI's checkpoint / restart files can also be saved under
/tmp .
- If the Open MPI job is terminated abnormally, there are some
circumstances where files (including memory-mapped shared memory
files) can be left in
/tmp . This can happen, for example, when a
resource manager forcibly kills an Open MPI job and does not give it
the chance to clean up /tmp files and directories.
Some users have reported success with configuring their resource
manager to run a script between jobs to forcibly empty the /tmp
directory.
10. Can I use SysV instead of mmap? |
In the v1.3 and v1.4 Open MPI series, shared memory is established
via mmap . In future releases, there may be an option for using SysV
shared memory.
11. How much shared memory will my job use? |
Your job will create a shared-memory area on each node where
it has two or more processes. This area will be fixed during the
lifetime of your job. Shared-memory allocations (for FIFOs and
fragment free lists) will be made in this area. Here, we look at the
size of that shared-memory area.
If you want just one hard number, then go with approximately 128
Mbytes per node per job, shared by all the job's processes on that
node. That is, an OMPI job will need more than a few Mbytes per node,
but typically less than a few Gbytes.
Better yet, read on.
Up through Open MPI v1.3.1, the shared-memory file would basically be
sized thusly:
1
2
3
| nbytes = n * mpool_sm_per_peer_size
if ( nbytes < mpool_sm_min_size ) nbytes = mpool_sm_min_size
if ( nbytes > mpool_sm_max_size ) nbytes = mpool_sm_max_size |
where n is the number of processes in the job running on that
particular node and the mpool_sm_* are MCA parameters. For small
n , this size is typically excessive. For large n (e.g., 128 MPI
processes on the same node), this size may not be sufficient for the
job to start.
Starting in OMPI v1.3.2, a more sophisticated formula was introduced to
model more closely how much memory was actually needed. That formula
is somewhat complicated and subject to change. It guarantees that
there will be at least enough shared memory for the program to start
up and run. See this FAQ item to see
how much is needed. Alternatively, the motivated user can examine the
OMPI source code to see the formula used — for example, here is the
formula in OMPI commit 463f11f.
OMPI v1.3.2 also uses the MCA parameter mpool_sm_min_size to set a
minimum size — e.g., so that there is not only enough shared memory
for the job to start, but additionally headroom for further
shared-memory allocations (e.g., of more eager or max fragments).
Once the shared-memory area is established, it will not grow further
during the course of the MPI job's run.
12. How much shared memory do I need? |
In most cases, OMPI will start your job with sufficient shared
memory.
Nevertheless, if OMPI doesn't get you enough shared memory (e.g.,
you're using OMPI v1.3.1 or earlier with roughly 128 processes or more
on a single node) or you want to trim shared-memory consumption, you
may want to know how much shared memory is really needed.
As we saw earlier, the shared
memory area contains:
- FIFOs
- eager fragments
- max fragments
In general, you need only enough shared memory for the FIFOs and
fragments that are allocated during MPI_Init() .
Beyond that, you might want additional shared memory for performance
reasons, so that FIFOs and fragment lists can grow if your program's
message traffic encounters resource congestion. Even if there is no
room to grow, however, your correctly written MPI program should still
run to completion in the face of congestion; performance simply degrades
somewhat. Note that while shared-memory resources can grow after
MPI_Init() , they cannot shrink.
So, how much shared memory is needed during MPI_Init() ?
You need approximately the total of:
- FIFOs:
- (≤ Open MPI v1.3.1):
3 × n × n × pagesize
- (≥ Open MPI v1.3.2):
n × btl_sm_num_fifos × btl_sm_fifo_size × sizeof(void *)
- eager fragments:
n × ( 2 × n + btl_sm_free_list_inc ) × btl_sm_eager_limit
- max fragments:
n × btl_sm_free_list_num × btl_sm_max_send_size
where:
-
n is the number of MPI processes in your job on the node
-
pagesize is the OS page size (4KB for Linux and 8KB for Solaris)
-
btl_sm_* are MCA parameters
13. How can I decrease my shared-memory usage? |
There are two parts to this question.
First, how does one reduce how big the mmap file is? The answer is:
- Up to Open MPI v1.3.1: Reduce
mpool_sm_per_peer_size , mpool_sm_min_size ,
and mpool_sm_max_size
- Starting with Open MPI v1.3.2: Reduce
mpool_sm_min_size
Second, how does one reduce how much shared memory is needed? (Just
making the mmap file smaller doesn't help if then your job won't
start up.) The answers are:
- For small values of
n — that is, for few processes per node —
shared-memory usage during MPI_Init() is predominantly for max free lists.
So, you can reduce the MCA parameter btl_sm_max_send_size . Alternatively,
you could reduce btl_sm_free_list_num , but it is already pretty small by
default.
- For large values of
n — that is, for many processes per node — there
are two cases:
- Up to Open MPI v1.3.1: Shared-memory usage is dominated by the
FIFOs, which consume a certain number of pages. Usage is
high and cannot be reduced much via MCA parameter tuning.
- Starting with Open MPI v1.3.2: Shared-memory usage is dominated
by the eager free lists. So, you can reduce the MCA parameter
btl_sm_eager_limit .
|