Thus far in the workshop, we have assumed that all processors are available for all communication events. This is the effect of using the MPI_COMM_WORLD communicator, which includes all processors, for message passing activity.
To limit communication to a subset of the processors, the programmer can create a group, and associate a communicator with that group. The new communicator can be used in point-to-point or collective communication routines. Both groups and communicators are MPI objects (stored in system space) accessed by handles (returned from or passed to MPI routines).
A group is an ordered set of processes. Each process in a group is associated with a unique integer rank. Rank values start at zero and go to N-1, where N is the number of processes in the group.
A communicator is a mechanism for creating a self-contained communication "universe". A message sent with a given communicator is only eligible to be received by a process specifying the same communicator.
A communicator incorporates an instance of a group, and also includes contexts. A context can be thought of as a unique tag that is allocated by the system. For a message to be received, it must match on context and on message tag. A communicator includes two separate contexts: one for point-to-point communication and one for collective communication. These operations can therefore occur simultaneously within one communicator, without the point-to-point receiving the collective message, or vice versa.
Communicators and contexts are particularly important for libraries. There is no way to ensure that a library developer will choose a message tag that is not used by the application programmer. If the library creates its own communicator, it will be guaranteed a unique context and no interference from other communication events.
There are two additional issues related to communicators which are beyond the scope of this talk. First, a communicator can cache attribute information, to be carried over between subsequent subroutine calls. MPI virtual topologies are implemented using cached attributes. Second, in addition to intra-communicators (discussed in this talk), inter-communicators can be created to allow point-to-point communication between groups.
MPI includes routines for accessing information on groups or communicators, for creating new groups or communicators from existing ones, and for deleting groups or communicators. A list follows. Please see the man pages or the MPI standard for information on syntax. The commands that will be illustrated later in this module and lab are italicized.
Communicator creation routines are collective. They require all processes in the input communicator to participate, and may require communication amongst processes. All other group and communicator routines are local. As will be discussed later, it often makes sense to have all members of an input group call a group creation routine, if a communicator will later be created for that group.
| Group Accessors | |
MPI_Group_size |
returns number of processes in group |
MPI_Group_rank |
returns rank of calling process in group |
MPI_Group_translate_ranks |
translates ranks of processes in one group to those in another group |
MPI_Group_compare |
compares group members and group order |
| Group Constructors | |
MPI_Comm_group |
returns the group associated with a communicator |
MPI_Group_union |
creates a group by combining two groups |
MPI_Group_intersection |
creates a group from the intersection of two groups |
MPI_Group_difference |
creates a group from the difference between two groups |
MPI_Group_incl |
creates a group from listed members of an existing group |
MPI_Group_excl |
creates a group excluding listed members of an existing group |
MPI_Group_range_incl |
creates a group according to first rank, stride, last rank |
MPI_Group_range_excl |
creates a group by deleting according to first rank, stride, last rank |
| Group Destructors | |
MPI_Group_free |
marks a group for deallocation |
| Communicator Accessors | |
MPI_Comm_size |
returns number of processes in communicator's group |
MPI_Comm_rank |
returns rank of calling process in communicator's group |
MPI_Comm_compare |
compares two communicators |
| Communicator Constructors | |
MPI_Comm_dup |
duplicates a communicator |
MPI_Comm_create |
creates a new communicator for a group |
MPI_Comm_split |
splits a communicator into multiple, non-overlapping communicators |
| Communicator Destructors | |
MPI_Comm_free |
marks a communicator for deallocation |
In forest dynamics models, leaf-area profile growth is dictated by the amount of available light. In some models, diffuse light at a given grid point depends on data at grid points in cardinal directions (N,E,S,W). If one views the processors laid out in a 2-dimensional grid, one would like to exchange data among processors sharing the same row and processors sharing the same column. This can be accomplished by using groups of processors.
Suppose we have 12 processors laid out in a 3x4 grid, where the number in the box is the processor number.
Since data must be exchanged among rows and columns of processors, we must create a group for each row and a group for each column. In our example, 7 different groups must thus be generated. However, a particular processor will belong to only 2 of these groups -- one row group and one column group.
This module will demonstrate two methods for creating the groups for the forest dynamics model. The first method is the most generally applicable, but not the most compact.
It should first be noted, however, that MPI does not provide a mechanism to build groups from scratch; they can only be developed from previously defined ones. This leads to an obvious question: How can one build the first group? You must start with the base group of all processes, which is associated with the MPI_COMM_WORLD communicator.
Here are the steps for the first method:
The following snippet of Fortran code shows how to set up groups for rows and columns.
subroutine set_group(row_comm,col_comm)
include 'mpif.h'
parameter (NROW=3, NCOL=4)
integer row_list(NCOL), col_list(NROW), base_grp, grp
integer temp_comm, row_comm, col_comm
c
c ------------------------------------------------
c Get base group from MPI_COMM_WORLD communicator
c ------------------------------------------------
c
call MPI_COMM_GROUP(MPI_COMM_WORLD,base_grp,ierr)
c
c -------------------------------------------------------------
c Establish the row and column to which this processor belongs
c -------------------------------------------------------------
c
call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)
irow = MOD(irank,NROW) + 1
icol = INT(irank/NROW) + 1
c
c -------------------------
c Build row groups
c -------------------------
c
row_list(1) = 0
do i=2,NCOL
row_list(i) = row_list(i-1) + NROW
end do
do i=1,NROW
call MPI_GROUP_INCL(base_grp,NCOL,row_list,grp,ierr)
call MPI_COMM_CREATE(MPI_COMM_WORLD,grp,temp_comm,ierr)
if (irow .eq. i) row_comm=temp_comm
do j=1,NCOL
row_list(j) = row_list(j)+1
end do
end do
c
c -------------------------
c Build column groups
c -------------------------
c
do i=1,NROW
col_list(i) = i-1
end do
do i=1,NCOL
call MPI_GROUP_INCL(base_grp,NROW,col_list,grp,ierr)
call MPI_COMM_CREATE(MPI_COMM_WORLD,grp,temp_comm,ierr)
if (icol .eq. i) col_comm=temp_comm
do j=1,NROW
col_list(j) = col_list(j) + NROW
end do
end do
return
end
You might wonder why every processor is involved in the creation of
all groups, even though it belongs to only 2 of them. This is because
the MPI standard states that the MPI_COMM_CREATE call must be executed
by all processes in the input communicator (in our case,
MPI_COMM_WORLD), and that all processes must pass the
same value for the group argument (grp), even if they do
not belong to the new group.
The group communicators row_comm and
col_comm can be used in a number of ways. Returning to
the forest dynamics model again, each processor calculates the maximum
tree height for trees in its domain. This information must be
exchanged with all other processors sharing the same row and column,
with the result that each processor can then calculate the maximum
tree height in the entire row and column. If maxht is
the local maximum tree height, then the following snippet of code
illustrates how one can determine the row and column maximum:
include 'mpif.h'
parameter (NROW=3, NCOL=4)
integer row_hgt(NCOL), col_hgt(NROW), max_row, max_col
call MPI_ALLGATHER(maxht,1,MPI_INTEGER,row_hgt,1,MPI_INTEGER,
1 row_comm,ierr)
call MPI_ALLGATHER(maxht,1,MPI_INTEGER,col_hgt,1,MPI_INTEGER,
1 col_comm,ierr)
max_row = row_hgt(1)
do i=2,NCOL
if (row_hgt(i) .gt. max_row) max_row=row_hgt(i)
end do
max_col = col_hgt(1)
do i=2,NROW
if (col_hgt(i) .gt. max_col) max_col=col_hgt(i)
end do
For collective communication, all processes in the specified
communicator must participate. Processes not in the communicator must
not call the routine. In this example, all processes belong to a
valid row_comm and col_comm, so all call
MPI_ALLGATHER. Some codes will require making the
collective communication call conditional upon membership in the
communicator's group. In this situation, all processes can call
MPI_GROUP_RANK to learn their rank in the group. Those whose rank
equals MPI_UNDEFINED should not call the collective communication
routine.
An alternate approach is to use MPI_COMM_SPLIT, which partitions one communicator into multiple, non-overlapping communicators. All manipulation of groups is done behind the scenes. This approach is less generally applicable than using MPI_COMM_CREATE, but produces more compact code for regular decomposition problems, such as the forest dynamics problem.
In the subroutine below, the first call to
MPI_COMM_SPLIT divides MPI_COMM_WORLD into
three communicators corresponding to the three rows in the 3x4 grid.
One communicator includes all processes for which color =
1 (row 1), another includes all processes for which color
= 2 (row 2), and a third includes all processes for which
color = 3 (row 3). Within the new communicators,
processes are ranked in order of increasing key value (in
this case, according to increasing rank in the original communicator
MPI_COMM_WORLD).
C version:
#include#include "mpi.h" #define NROW 3 #define NCOL 4 void set_groups(MPI_Comm *, MPI_Comm *); void set_groups(MPI_Comm *row_comm, MPI_Comm *col_comm) { int irow, icol, color, key, rank_in_world; /* ------------------------------------------------------------ Establish the row and column to which this processor belongs ------------------------------------------------------------ */ MPI_Comm_rank(MPI_COMM_WORLD,&rank_in_world); irow = (rank_in_world % NROW); icol = (rank_in_world/NROW); /* ------------------------- Build row communicators -------------------------- */ color = irow; key = rank_in_world; MPI_Comm_split (MPI_COMM_WORLD, color, key, row_comm); /* -------------------------- Build column communicators -------------------------- */ color = icol; MPI_Comm_split (MPI_COMM_WORLD, color, key, col_comm); } main(int argc, char **argv) { MPI_Comm row_comm, col_comm; int row_hgt[NCOL], col_hgt[NROW], max_row, max_col, rank_in_world, maxht,i; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank_in_world); /* --------------------------------------------------------------- A real simulation would calculate a meaning maxht here --------------------------------------------------------------- */ maxht = rank_in_world; set_groups(&row_comm, &col_comm); MPI_Allgather(&maxht,1,MPI_INT,row_hgt,1,MPI_INT,row_comm); MPI_Allgather(&maxht,1,MPI_INT,col_hgt,1,MPI_INT,col_comm); max_row = row_hgt[0]; for (i=1; i < NCOL; i++) if (row_hgt[i] > max_row) max_row=row_hgt[i]; max_col = col_hgt[1]; for (i=1; i < NROW; i++) if (col_hgt[i] > max_col) max_col=col_hgt[i]; printf("%d, %d, %d\n", rank_in_world, max_row, max_col); }
Fortran version:
subroutine set_group(row_comm,col_comm)
include 'mpif.h'
parameter (NROW=3, NCOL=4)
integer row_comm, coll_comm, color, key
c -------------------------------------------------------------
c Establish the row and column to which this processor belongs
c -------------------------------------------------------------
c
call MPI_COMM_RANK(MPI_COMM_WORLD,irank,ierr)
irow = MOD(irank,NROW) + 1
icol = INT(irank/NROW) + 1
c
c -------------------------
c Build row communicators
c -------------------------
c
color = irow
key = irank
call MPI_COMM_SPLIT (MPI_COMM_WORLD, color, key, row_comm, ierr)
c
c ---------------------------
c Build column communicators
c ---------------------------
c
color = icol
call MPI_COMM_SPLIT (MPI_COMM_WORLD, color, key, col_comm, ierr)
return
end