In the multi-processing execution model all accesses to non-local elements of distributed arrays are implemented by means of message-passing. If a particular processor has to access non-local data from a remote processor , the non-local data has to be transferred by means of message-passing communication from the remote processor into temporary variables on processor . As a consequence, each processor has to determine both the data to be sent to other processors and the data to be received from other processors. Since in this way all cross-processor dependencies are satisfied by means of message-passing, no explicit synchronization is required.
In the multi-threaded execution model, accesses to non-local data require synchronization usually realized by a barrier synchronization among all processors.
For INDEPENDENT loops the HPF compiler must usually insert communication for all data that is non-local to the processor that executes the corresponding iteration of the loop.
integer, dimension (N) :: IND1, IND2 !hpf$ distribute (block) :: IND1, IND2 real, dimension (M) :: A, B !hpf$ distribute (block) :: A, B !adp$ shared :: A, B ! ADAPTOR specific directive on machines ! that support a global address space ... !hpf$ independent, on home (IND1(I)) do I = 1, N A(IND1(I)) = B(IND2(I)) end do
If the array A is shared, the compiler generates barrier synchronization before and after the independent loop instead of communication for updating the values of A.
If the array B is shared, no compiler-generated communication will be necessary for accessing the values of B. Instead of communication, synchronization is added.
Note: If both arrays are shared, the barrier synchronizations before and after the loop are of course merged to one single synchronization before and after the loop. It is important to know that synchronization before and after the loop is necessary.
Array statements and parallel loops might contain implicit communication. The general strategy is to extract the communication outside of the parallel loop to get local independent computations. For this process, it might be the case that the HPF compiler creates temporary arrays.
integer, parameter :: N=100, NA=200, NB=50 real, dimension (N) :: VAL real, dimension (NA) :: A real, dimension (NB) :: B !hpf$ distribute (block) :: VAL, A, B integer IND1(N), IND2(N) !hpf$ align(I) with VAL(I) :: IND1, IND2 forall (I=1:N) A(IND1(I)) = B(IND2(I)) + VAL(I) ! on home VAL(I)
The complex FORALL statement will be handled in the following way:
real TMP1(N), TMP2(N) !hpf$ align(I) with VAL(I) :: TMP1, TMP2 forall (I=1:N) TMP1(I) = B(IND2(I)) ! pre-fetch forall (I=1:N) TMP2(I) = TMP1(I) + VAL(I) ! local forall (I=1:N) A(IND(I)) = TMP2(I) ! post-store
Before the parallel loop, which itself is now a local one, the processors collect the data that will be needed within the parallel loop. This is also called pre-fetching.
After the parallel loop, the processors will send the non-local computed data to the owner of the data. This is also called post-store. Due to the owner-computes rule, in most situations this is not necessary. But in case of indirect-addressing or more complex parallel loops, this kind of communication must be generated.
Some situations can be identified where it is not easy or not possible to extract the communication completely outside of the parallel loop.
TMP(1:N) = X(k+1:k+N) !hpf$ independent !hpf$ independent, on home(Y(I)), resident do I = 1, N do I = 1, N X(I) = ... X(I) = ... Y(I) = f(X(I+k)) Y(I) = f(TMP(I)) end do end do
But extracting the communication as it is done here produces wrong code if the value of k is 0. In this case, we have loop independent true dependences, as the needed value for X is computed during the same iteration.
!hpf$ independent do I = 1, N X(I) = ... call SUB (I, X(I), Y(I)) end do
pure subroutine F(X1) real X1 integer P P = G(X1) X1 = X1 * A(P) end subroutine F
The extraction of communication is absolutely necessary as every processor must be involved who has to send data or who has to receive data.
This problem does no longer exist, if one-sided communication is available. In this case, one processor can access non-local data during the parallel execution. Only synchronization before and after the parallel loop is necessary.