forked from dftfeDevelopers/dftfe
-
Notifications
You must be signed in to change notification settings - Fork 0
/
MPIRequestersNBX.h
269 lines (247 loc) · 11.8 KB
/
MPIRequestersNBX.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
// ---------------------------------------------------------------------
//
// Copyright (c) 2017-2022 The Regents of the University of Michigan and DFT-FE
// authors.
//
// This file is part of the DFT-FE code.
//
// The DFT-FE code is free software; you can use it, redistribute
// it, and/or modify it under the terms of the GNU Lesser General
// Public License as published by the Free Software Foundation; either
// version 2.1 of the License, or (at your option) any later version.
// The full text of the license can be found in the file LICENSE at
// the top level of the DFT-FE distribution.
//
// ---------------------------------------------------------------------
//
/*
* @author Bikash Kanungo, Sambit Das
*/
#ifndef dftfeMPIRequestersNBX_h
#define dftfeMPIRequestersNBX_h
#include <mpi.h>
#include <TypeConfig.h>
#include <MPIRequestersBase.h>
#include <vector>
#include <set>
#include <memory>
namespace dftfe
{
namespace utils
{
namespace mpi
{
class MPIRequestersNBX : public MPIRequestersBase
{
/*
* @brief Implements the Non-blocking Consensus (NBX) algorithm as
* described in the following paper to determine the list of requesting
* processors for the current processors
* @article{hoefler2010scalable,
* title={Scalable communication protocols for dynamic sparse data
* exchange}, author={Hoefler, Torsten and Siebert, Christian and
* Lumsdaine, Andrew}, journal={ACM Sigplan Notices}, volume={45},
* number={5},
* pages={159--168},
* year={2010},
* publisher={ACM New York, NY, USA}
* }
*/
/*
* The following is a brief description of the typical use case
* situation. Each processor has a list of target processors to which it
* wants to send a message (think of it as a message to another
* processor to request some data that is owned by the other processor).
* Similarly, other processors might be requesting the current processor
* for some of the data owned by the current processor. However, the
* current processor has no apriori knowledge of which processors will
* be requesting data from it. The challenge is to utilize the current
* processor's list of target processors to determine the current
* processor's requesting processors. In other words, we have to use a
* one way communication information to figure out the other way (its
* dual).
*
* Perhaps a more concrete example might help. Let's say, we have a
* vector/array which is distributed across a set of processors.
* Each processors own part of the vector. The ownership is exclusive,
* i.e., a processor is the sole owner of its part of the vector.
* In practice, it means that the processor owns a set of indices of the
* vector. Additionally, the different sets of owning indices across all
* the processors are disjoint. Moreover, the union of the sets across
* all the processors gives the set of indices of the distributed
* vector. However, each processor also needs information on a set of
* non-owned indices (hereafter termed ghost indices) based on the needs
* of the application. Based on the ghost indices, the current processor
* can easily determine the processor where it is owned. These
* processors are termed as target processors to which the current
* processor has to send a request to access the ghost data. Similarly,
* the ghost indices in some other processor might be owned by this
* processor. In that case, the other processor will be sending a
* request to the current processor to access some of its data (data
* which is ghost to the other processor but owned by the current
* processor). But the current processor has no apriori knowledge of
* which processors will be requesting data from it. A knowledge of it
* will help the current processor to prepare for the request of data.
*
* In cases of sparse communication, that is, where each processor only
* needs to communicate with a small subset of the total number of
* processors, the NBX algorithm offers an algorithm of complexity
* O(log(P)) (where P is the number of processors) to determing the
* list of requesting processors. The algorithm works as follows:
*
* 1. The current processor sends nonblocking synchronous message
* (i.e., MPI_ISsend) to all its target processors. Remember that
* the current processor already has information about its target
* processors. Also, note that the status of the nonblocking
* synchronous send turns to "completed" only when a when
* the message has been received by a receiving processor. Let's
* call this operation as the "local-send", as we are sending
* requests to target processors that are locally known by the current
* processor.
*
* 2. The current processor keeps on doing nonblocking probe for
* incoming message (i.e., MPI_IProbe). The MPI_IProbe checks if there
* is an incoming message matching a given source and tag or not. The
* source is the index of the source processor sending the message and
* tag is an MPI_tag associated with exchange. It does not initiate any
* receive operation , it only verfies whether there is something to be
* received or not. For our purpose, we will use a wildcards
* MPI_ANY_SOURCE and MPI_ANY_TAG, as we just want to know if there is
* an incoming message or not. In the event that there is an incoming
* message (i.e., the MPI_IProbe's flag is true), we can extract the
* source processor from the status handle of the MPI_IProbe and append
* it to a list that stores the requesting processor IDs. Addtionally,
* in the event that there is an incoming messag, we call a non-blocking
* receive (i.e., MPI_IRecv) to initiate the actual
* reception of the incoming. The MPI_Recv, in turn, will complete
* the status of source processor's MPI_ISsend through which the
* incoming message was sent to the current processor. Thus, we
* achieve two things over here: we detected a requesting processor
* and we also signaled the requesting processor that we have received
* their message. But this is only job half-done. How do we tell the
* current processor to stop probing for incoming message? And how do
* inform all the processors involved that all the incoming messages
* across all the processors have been received? This kind of problem
* is what is called a Consensus Problem
* (https://en.wikipedia.org/wiki/Consensus_(computer_science)).
* The way to reach the consensus in NBX is a two-step process:
* (a) the current processor checks if all the "local-send"
* (see #1 above) has been received or not.
* That is, if the status handle of all its MPI_ISsend have turned
* to completed or not. If all the local"local-send" have been
* completed, we initiate a non-blocking barrier (i.e.,
* MPI_IBarrier) on the current processor. This informs the network that
* the current processor has witnessed its part of an event (in this
* case the event is the completion of all its "local-send"). (b) the
* above only informs the network that the all "local-send" of the
* current processor have been received. But the current processor
* can still have incoming messages to be receieved. Hence, the current
* processor keeps on probing and receiving incoming messages, until
* the non-blocking barrier (MPI_IBarrier) (as mentioned
* above in (a)) has been invoked by all the processors. This can be
* checked from the status handle of the MPI_IBarrier, which
* completes only when all processors call it.
* At a stage when the status of MPI_IBarrier turns to completed,
* we know for sure that all the "local-send" of all
* the processors have been received and that there are no more
* incoming messages in any processor to be received. Thus, we
* can now safely terminate the nonblocking probe on all processors.
*
*
*
* @note: Since we are only interested in knowing the requesting
* processors for the current processor, we only need token
* MPI sends and receives (e.g., just an integer across) instead
* of large chunks of data. To that end, we harcode all the send
* and receive buffers to be of integer type
*/
public:
MPIRequestersNBX(const std::vector<size_type> &targetIDs,
const MPI_Comm & comm);
//
// default Constructor for serial (without MPI) compilation
//
MPIRequestersNBX() = default;
std::vector<size_type>
getRequestingRankIds() override;
private:
/**
* List of processes this processor wants to send requests to.
*/
std::vector<size_type> d_targetIDs;
/**
* Buffers for sending requests.
*/
std::vector<int> d_sendBuffers;
/**
* Requests for sending requests.
*/
std::vector<MPI_Request> d_sendRequests;
/**
* Buffers for receiving requests.
* We use a vector of pointers because that
* guarantees that the buffers themselves
* are never moved around in memory, even if the vector is
* resized and consequently its elements (the pointers) are moved
* around.
*/
std::vector<std::unique_ptr<int>> d_recvBuffers;
/**
* Requests for receiving requests.
*/
std::vector<std::unique_ptr<MPI_Request>> d_recvRequests;
//
// request for barrier
//
MPI_Request d_barrierRequest;
//
// MPI communicator
//
const MPI_Comm &d_comm;
/**
* List of processes who have made a request to this process.
*/
std::set<size_type> d_requestingProcesses;
int d_numProcessors;
int d_myRank;
/**
* Check whether all of message sent from the current processor
* to other processors have been received or not
*/
bool
haveAllLocalSendReceived();
/**
* Signal to all other processors that for this processor
* all its message sent to other processors have been received.
* This is done nonblocking barrier (i.e., MPI_IBarrier).
*/
void
signalLocalSendCompletion();
/**
* Check whether all of the incoming messages from other processors to
* the current processor have been received.
*/
bool
haveAllIncomingMsgsReceived();
/**
* Probe for an incoming message and if there is one receive it
*/
void
probeAndReceiveIncomingMsg();
/**
* Start to sending message to all the target processors
*/
void
startLocalSend();
/**
* After all processors have received all the incoming messages,
* the MPI data structures can be freed and the received messages
* can be processed.
*/
void
finish();
};
} // end of namespace mpi
} // end of namespace utils
} // end of namespace dftfe
#endif // dftfeMPIRequestersNBX_h