Next: 6 Global Communications Up: ADAPTOR HPF Programmers Guide Previous: 4 Local Computations Contents Index

Subsections

5 Communication and Synchronization

In the multi-processing execution model all accesses to non-local elements of distributed arrays are implemented by means of message-passing. If a particular processor has to access non-local data from a remote processor , the non-local data has to be transferred by means of message-passing communication from the remote processor into temporary variables on processor . As a consequence, each processor has to determine both the data to be sent to other processors and the data to be received from other processors. Since in this way all cross-processor dependencies are satisfied by means of message-passing, no explicit synchronization is required.

In the multi-threaded execution model, accesses to non-local data require synchronization usually realized by a barrier synchronization among all processors.

5.1 Using Shared Arrays

For INDEPENDENT loops the HPF compiler must usually insert communication for all data that is non-local to the processor that executes the corresponding iteration of the loop.

      integer, dimension (N) :: IND1, IND2
!hpf$ distribute (block) :: IND1, IND2
      real, dimension (M) :: A, B
!hpf$ distribute (block) :: A, B
!adp$ shared :: A, B                ! ADAPTOR specific directive on machines
                                    ! that support a global address space 
      ...
!hpf$ independent, on home (IND1(I))
      do I = 1, N
         A(IND1(I)) = B(IND2(I))
      end do

If the array A is shared, the compiler generates barrier synchronization before and after the independent loop instead of communication for updating the values of A.

If the array B is shared, no compiler-generated communication will be necessary for accessing the values of B. Instead of communication, synchronization is added.

Note: If both arrays are shared, the barrier synchronizations before and after the loop are of course merged to one single synchronization before and after the loop. It is important to know that synchronization before and after the loop is necessary.

5.2 Temporary Arrays

Array statements and parallel loops might contain implicit communication. The general strategy is to extract the communication outside of the parallel loop to get local independent computations. For this process, it might be the case that the HPF compiler creates temporary arrays.

      integer, parameter :: N=100, NA=200, NB=50

      real, dimension (N)  :: VAL
      real, dimension (NA) :: A
      real, dimension (NB) :: B
!hpf$ distribute (block) :: VAL, A, B

      integer IND1(N), IND2(N)
!hpf$ align(I) with VAL(I) :: IND1, IND2

      forall (I=1:N) A(IND1(I)) = B(IND2(I)) + VAL(I)  ! on home VAL(I)

The complex FORALL statement will be handled in the following way:

      real TMP1(N), TMP2(N)
!hpf$ align(I) with VAL(I) :: TMP1, TMP2
 
      forall (I=1:N) TMP1(I) = B(IND2(I))         !  pre-fetch

      forall (I=1:N) TMP2(I) = TMP1(I) + VAL(I)   !  local

      forall (I=1:N) A(IND(I)) = TMP2(I)          !  post-store

Before the parallel loop, which itself is now a local one, the processors collect the data that will be needed within the parallel loop. This is also called pre-fetching.

After the parallel loop, the processors will send the non-local computed data to the owner of the data. This is also called post-store. Due to the owner-computes rule, in most situations this is not necessary. But in case of indirect-addressing or more complex parallel loops, this kind of communication must be generated.

5.3 Problems with Extracting Communication

Some situations can be identified where it is not easy or not possible to extract the communication completely outside of the parallel loop.

In the following loop, communication might be necessary to have local copies of the values X(I+k):

                                        TMP(1:N) = X(k+1:k+N)
!hpf$ independent                 !hpf$ independent, on home(Y(I)), resident
      do I = 1, N                       do I = 1, N
         X(I) = ...                        X(I) = ...
         Y(I) = f(X(I+k))                  Y(I) = f(TMP(I))
      end do                            end do

But extracting the communication as it is done here produces wrong code if the value of k is 0. In this case, we have loop independent true dependences, as the needed value for X is computed during the same iteration.

Extracting the communication is not possible for a subroutine call as long as it is not known which non-local data will be accessed within the subroutine.
```
!hpf$ independent
      do I = 1, N          
         X(I) = ...       
         call SUB (I, X(I), Y(I))
      end do
```
Though the subroutine might be a PURE subroutine, it is possible to read values of global arrays within the subroutine. If such an array is distributed, communication might be involved that cannot be extracted.
```
      pure subroutine F(X1)
      real X1
      integer P
      P = G(X1)
      X1 = X1 * A(P)
      end subroutine F
```

The extraction of communication is absolutely necessary as every processor must be involved who has to send data or who has to receive data.

This problem does no longer exist, if one-sided communication is available. In this case, one processor can access non-local data during the parallel execution. Only synchronization before and after the parallel loop is necessary.

Next: 6 Global Communications Up: ADAPTOR HPF Programmers Guide Previous: 4 Local Computations Contents Index

Thomas Brandes 2004-03-18