next up previous contents index
Next: 5 Communication and Synchronization Up: ADAPTOR HPF Programmers Guide Previous: 3 Home of Computations   Contents   Index

Subsections


4 Local Computations

A local computation is a computation that only works on data that is mapped to the same processor and does not involve any communication and/or synchronization. The more operations that are performed between operands which reside on the same processor, the more efficient the program will be.

This section explains where data parallel statements and parallel loops are local and will be very efficient for parallel execution.

Attention: It is very important that already at compile time it can be recognized which operands are on the same processor. Otherwise the compiler has to generate additional queries about data locality which can be very expensive.

4.1 Local Array Assignments

In the following, none of the statements causes any communication. It is assumed that the home of the computations is always given by the owner computes rule. Data on the left hand side and on the right hand side reside always on the same processor.

      real, dimension (N,N)    ::  A, A1
      integer, dimension (N,N) :: IND
!hpf$ distribute (*,block) :: A, A1, IND
      real G

      A = A1                   ! same distribution
      A = G                    ! G is replicated on all processors
      A(1,1:N) = A1(2,1:N)     ! same slice in distributed dimension
      A(I,J) = A1(IND(I,J),J)  ! same value in distributed dimension

Internally, ADAPTOR translates array statements to parallel loops, namely INDEPENDENT loops.

4.2 Local FORALL Statements

Similar to the array statements, FORALL statements will be identified as local statements at compile time if data is aligned or mapped to the same processor.

      real, dimension (N,N) :: A, B
!hpf$ distribute (*,block) :: A, B
      ...
      forall (I=1:N, J=1:N) A(I,J) = 1.0 / real(I+J-1)
      forall (I=1:N, J=1:N, A(I,J) .ne. 0.0) B(I,J) = 1.0 / A(I,J)


4.3 Local Independent Loops

The INDEPENDENT directive specifies only that the computations can be executed independently. It does not imply that the iterations can be mapped to processors without any communication. But if every iteration works on data mapped to the same processor, no communication is necessary. The home of one iteration will be automatically the processor to which the data belongs.

      real, dimension (N,N)    ::  A, A1
!hpf$ distribute (*,block) :: A, A1 

!hpf$ independent    ! on home (A(1,I))
      do I = 1, N
         A(1,I) = A1(2,I) + A(1,I)
      end do

Note: Without the INDEPENDENT directive, ADAPTOR generates code where all processors run through all iterations but will skip the assignment of the body when the data is not local. The test for locality causes an overhead. This overhead will not be generated when the INDEPENDENT directive is used as in this case a local range for the loop iterations is determined.

The following example demonstrates the implemenation of a matrix-vector multiplication $Y = Y + A*X$ that is completely local. The vector Y and the rows of A are block distributed, and the alignment of X causes X to be replicated on all processors. Thus, no communication is generated for the outer loop.

      real, dimension (N,N) :: A
      real, dimension (N)   :: X, Y
!hpf$ distribute (block,*)  :: A
!hpf$ align X(J) with A(*,J)
!hpf$ align Y(I) with A(I,*)
      ...
!hpf$ independent    ! on home (A(I,:))
      do I = 1, N
         do J = 1, N
            Y(I)  = Y(I) + A(I,J) * X(J)
         end do
      end do


4.4 Independent Loops and NEW Directive

The NEW directive guarantees that a corresponding variable does not cause any dependence. It can be considered as if every iteration gets an own incarnation of the variable. By this way, the variable is resident.

      real, dimension (N,N)    ::  A, A1
!hpf$ distribute (*,block) :: A, A1 
      real :: S

!hpf$ independent, new (S)
      do I = 1, N
          S = A1 (2,I) + A(1,I)
          A(1,I) = S
      end do


4.5 Independent Loops and RESIDENT Directive

As already mentioned before, it is very important to know already at compile time that data is resident. ADAPTOR does its best to verify it. In some situations however, it might be necessary that the user asserts this property by the RESIDENT directive.

Attention: The RESIDENT directive can only be used with the ON_HOME directive.

      real, dimension (N) :: A
      real, dimension (NZ) :: VALS
      integer, dimension (N) :: OFS, NK
!hpf$ distribute (block) :: A, OFS; NK
!hpf$ distribute (gen_block(IA)) :: VALS
...
!hpf$ independent, on home (A(I)), resident (VALS)
      do I = 1, N
         do K = 1, NK(I)
            A(I) = A(I) + VALS (OFS(I)+K)
         end do
      end do

In this example, the compiler knows that the values of NK(I) and OFS(I) are on the same processor as the value of A(I). But it cannot assume that the values of VALS(OFS(I)+K) for $1 \le K \le NK(1)$ are on the same processor as A(I). If the user has defined a general block distribution of VALS that implies this property, he can assert the resident property by specifying for the INDEPENDENT loop the clause RESIDENT(VALS). As the resident property is only given if one iteration I of the loop is executed by the owner of A(I), the ON HOME clause is mandatory.

4.6 Importance of Alignment for Local Computations

By the ALIGNMENT directive arrays can be mapped in such a way that certain computations do not cause any communication.

      real, dimension (0:2*N)   :: A
      real, dimension (0:N)     :: B
      real, dimension (1:2*N-1) :: A1
!hpf$ align B(I)  with A(2*I)
!hpf$ align A1(I) with A(I)
      ...
      forall (I=0:N) B(I) = A(2*I)       ! local computation
      forall (I=1:2*N-1) A1(I) = A(I)    ! local computation

In this example, ADAPTOR knows at compile time that no communication is necessary. This would not be the case without the ALIGNMENT directives.

If two arrays are distributed in the same way, a local computation can be identified if the size of the arrays is known at compile time. For allocatable arrays, the size might not be known at compile time.

      real, allocatable :: A(:,:), B(:,:)
!hpf$ distribute (block, block) :: A, B
      ...
      forall (I=1:N,J=1:N) A(I,J) = B(I,J)

This FORALL statement cannot be identified as local computation because ADAPTOR must assume that arrays A and B have different sizes. With the ALIGN directive, the computation can be specified as a local one.

      real, allocatable :: A(:,:), B(:,:)
!hpf$ align B(I,J) with A(I,J)
      ...
      forall (I=1:N,J=1:N) A(I,J) = B(I,J)


4.7 PURE Procedures

Pure procedures have no side effects and cannot cause any dependencies if their arguments do not cause dependencies. Therefore they can be used within parallel loops.

      pure real function F (X1, X2)
      real :: X1, X2
      F = (X1 - 1) * (X2 + 1)
      end function F

      real, dimension (N,M) :: A, RA
      integer :: N, M
!hpf$ distribute A (block, block)
      forall (I=1:N, J=1:M)
         A(I,J) = F(A(I,J), RA(I,J))
      end forall

The compiler has to generate communication if the arguments are not resident. But be careful: write access to a replicated variable that has an incarnation might cause inconsistencies.

      pure subroutine S (X, Y)
      real :: X, Y
         X = Y
      end subroutine S

      real, dimension (N) :: A, RA
!hpf$ distribute (block) :: A, RA
      ...
!hpf$ independent, resident, on home (A(I))
      do I = 1, n
         A(I) = 1.0 / float (I)
         call  S(RA(I), A(I))
      end do

Note: The RESIDENT directive is not correctly used if the variable RA is used afterwards.

Attention has also to be paid to the data within a PURE routine. Access to local data does not cause problems in a pure subprogram. But the use of a global array within a pure subprogram is quite useful:

      pure subroutine P(I)
      integer :: I
      common /YOM/ A
      real, dimension (100) :: A
!hpf$ distribute A(block)
      real :: X
      X = A(I) + 1.0      ! A(I) local or remote read, allowed
      A(I) = x            ! A(I) local or remote write, not allowed in PURE
      end subroutine P

ADAPTOR assumes that within a subroutine every access is resident on the active processor or on the active processor set.

4.8 Coupling of the ON and RESIDENT Directives

Unfortunalety, compilers are conservative and can also introduce synchronization or communication where it is not really necessary. The RESIDENT directive tells the compiler that only local data is accessed and no communication has to be generated. This guarantees that only the specified processors are involved and the code can be skipped definitively by the other processors.

!hpf$   on (PROCS(1:2)), resident
        call TASK1 (A1,N)
!hpf$   on (PROCS(3:4)), resident
        call TASK2 (A2,N)

The RESIDENT directive is very useful for task parallelism where subroutines are called. It gives the compiler the important information that within the routine only resident data is accessed.

4.9 Parallelism with the ON Directive

The ON directive on its own inserts already task parallelism in a natural way. If the data is available on the specified processor set and no communication is required, the statement can be skipped by all the other processors. Code blocks mapped to disjoint processor sets will be executed in parallel.

        real, dimension (N) :: A1, A2
!hpf$   processors PROCS(4)
!hpf$   distribute A1 (block) onto PROCS(1:2)
!hpf$   distribute A2 (block) onto PROCS(3:4)
        ...
!hpf$   on (PROCS(1:2))
        call TASK1 (A1,N)
!hpf$   on (PROCS(3:4))
        call TASK2 (A2,N)

The code blocks might not be executed simultaneously if communication is involved. This will be the case if one of the code blocks uses data that has no incarnation on the specified processor subset or if it defines data that has also incarnations on other processors.


4.10 Local Procedures

A local routine is called on every processor that is in the active processors set. Every processor will see in the routine only the local parts of the actual arguments.

                                             extrinsic (HPF_LOCAL)
      program WORK                          &subroutine SUB (A)
      real, dimension (N,N) :: A             real, dimension (:,:) :: A
!hpf$ distribute A(*,block)                  ...
      ...                                    print *, lbound(A,1), ubound(A,1)
      call SUB (A(:,:))                      print *, lbound(A,2), ubound(A,2)
      ...                                    ...
      end program WORK                       end subroutine SUB

As a local routine is called independently for all processors, it provides a high degree of parallelism. The HPF compiler has not to generate any communication within the subroutine. Nevertheless it might be possible that the user calls message passing commands within a local routine, but in this case the user itself is responsible for any communication. But only local data can be accessed within the local routine. Local routines can also be used to provide an HPF interface to implementation-specific parallel libraries.

The processors are executing a local routine completely asynchronously, so it might be possible that they are branching into different sections of the code. So in a local routine it is not possible to have any global operation like redistribution or calling global routines. This implies the following restrictions for local routines:

ADAPTOR supports local routines like proposed in the HPF standard. The HPF_LOCAL_LIBRARY is available.


4.11 Private Variables

A private variable is a variable that has an incarnation on every processor where every processor can modify this variable for his purposes. Replicated variables have also an incarnation on every processor, but the compiler takes responsibility that all processors have always the same value. All incarnations of replicated variables must be consistent. This is not the case for private variables, its use requires no communication at all.

Within the context of HPF, private variables will exist in the following situations:

Computations that work only on private variables will not imply any communication. The computations are always local.


next up previous contents index
Next: 5 Communication and Synchronization Up: ADAPTOR HPF Programmers Guide Previous: 3 Home of Computations   Contents   Index
Thomas Brandes 2004-03-18