A local computation is a computation that only works on data that is mapped to the same processor and does not involve any communication and/or synchronization. The more operations that are performed between operands which reside on the same processor, the more efficient the program will be.
This section explains where data parallel statements and parallel loops are local and will be very efficient for parallel execution.
Attention: It is very important that already at compile time it can be recognized which operands are on the same processor. Otherwise the compiler has to generate additional queries about data locality which can be very expensive.
In the following, none of the statements causes any communication. It is assumed that the home of the computations is always given by the owner computes rule. Data on the left hand side and on the right hand side reside always on the same processor.
real, dimension (N,N) :: A, A1 integer, dimension (N,N) :: IND !hpf$ distribute (*,block) :: A, A1, IND real G A = A1 ! same distribution A = G ! G is replicated on all processors A(1,1:N) = A1(2,1:N) ! same slice in distributed dimension A(I,J) = A1(IND(I,J),J) ! same value in distributed dimension
Internally, ADAPTOR translates array statements to parallel loops, namely INDEPENDENT loops.
Similar to the array statements, FORALL statements will be identified as local statements at compile time if data is aligned or mapped to the same processor.
real, dimension (N,N) :: A, B !hpf$ distribute (*,block) :: A, B ... forall (I=1:N, J=1:N) A(I,J) = 1.0 / real(I+J-1) forall (I=1:N, J=1:N, A(I,J) .ne. 0.0) B(I,J) = 1.0 / A(I,J)
The INDEPENDENT directive specifies only that the computations can be executed independently. It does not imply that the iterations can be mapped to processors without any communication. But if every iteration works on data mapped to the same processor, no communication is necessary. The home of one iteration will be automatically the processor to which the data belongs.
real, dimension (N,N) :: A, A1 !hpf$ distribute (*,block) :: A, A1 !hpf$ independent ! on home (A(1,I)) do I = 1, N A(1,I) = A1(2,I) + A(1,I) end do
Note: Without the INDEPENDENT directive, ADAPTOR generates code where all processors run through all iterations but will skip the assignment of the body when the data is not local. The test for locality causes an overhead. This overhead will not be generated when the INDEPENDENT directive is used as in this case a local range for the loop iterations is determined.
The following example demonstrates the implemenation of a matrix-vector multiplication that is completely local. The vector Y and the rows of A are block distributed, and the alignment of X causes X to be replicated on all processors. Thus, no communication is generated for the outer loop.
real, dimension (N,N) :: A real, dimension (N) :: X, Y !hpf$ distribute (block,*) :: A !hpf$ align X(J) with A(*,J) !hpf$ align Y(I) with A(I,*) ... !hpf$ independent ! on home (A(I,:)) do I = 1, N do J = 1, N Y(I) = Y(I) + A(I,J) * X(J) end do end do
The NEW directive guarantees that a corresponding variable does not cause any dependence. It can be considered as if every iteration gets an own incarnation of the variable. By this way, the variable is resident.
real, dimension (N,N) :: A, A1 !hpf$ distribute (*,block) :: A, A1 real :: S !hpf$ independent, new (S) do I = 1, N S = A1 (2,I) + A(1,I) A(1,I) = S end do
As already mentioned before, it is very important to know already at compile time that data is resident. ADAPTOR does its best to verify it. In some situations however, it might be necessary that the user asserts this property by the RESIDENT directive.
Attention: The RESIDENT directive can only be used with the ON_HOME directive.
real, dimension (N) :: A real, dimension (NZ) :: VALS integer, dimension (N) :: OFS, NK !hpf$ distribute (block) :: A, OFS; NK !hpf$ distribute (gen_block(IA)) :: VALS ... !hpf$ independent, on home (A(I)), resident (VALS) do I = 1, N do K = 1, NK(I) A(I) = A(I) + VALS (OFS(I)+K) end do end do
In this example, the compiler knows that the values of NK(I) and OFS(I) are on the same processor as the value of A(I). But it cannot assume that the values of VALS(OFS(I)+K) for are on the same processor as A(I). If the user has defined a general block distribution of VALS that implies this property, he can assert the resident property by specifying for the INDEPENDENT loop the clause RESIDENT(VALS). As the resident property is only given if one iteration I of the loop is executed by the owner of A(I), the ON HOME clause is mandatory.
By the ALIGNMENT directive arrays can be mapped in such a way that certain computations do not cause any communication.
real, dimension (0:2*N) :: A real, dimension (0:N) :: B real, dimension (1:2*N-1) :: A1 !hpf$ align B(I) with A(2*I) !hpf$ align A1(I) with A(I) ... forall (I=0:N) B(I) = A(2*I) ! local computation forall (I=1:2*N-1) A1(I) = A(I) ! local computation
In this example, ADAPTOR knows at compile time that no communication is necessary. This would not be the case without the ALIGNMENT directives.
If two arrays are distributed in the same way, a local computation can be identified if the size of the arrays is known at compile time. For allocatable arrays, the size might not be known at compile time.
real, allocatable :: A(:,:), B(:,:) !hpf$ distribute (block, block) :: A, B ... forall (I=1:N,J=1:N) A(I,J) = B(I,J)
This FORALL statement cannot be identified as local computation because ADAPTOR must assume that arrays A and B have different sizes. With the ALIGN directive, the computation can be specified as a local one.
real, allocatable :: A(:,:), B(:,:) !hpf$ align B(I,J) with A(I,J) ... forall (I=1:N,J=1:N) A(I,J) = B(I,J)
Pure procedures have no side effects and cannot cause any dependencies if their arguments do not cause dependencies. Therefore they can be used within parallel loops.
pure real function F (X1, X2) real :: X1, X2 F = (X1 - 1) * (X2 + 1) end function F real, dimension (N,M) :: A, RA integer :: N, M !hpf$ distribute A (block, block) forall (I=1:N, J=1:M) A(I,J) = F(A(I,J), RA(I,J)) end forall
The compiler has to generate communication if the arguments are not resident. But be careful: write access to a replicated variable that has an incarnation might cause inconsistencies.
pure subroutine S (X, Y) real :: X, Y X = Y end subroutine S real, dimension (N) :: A, RA !hpf$ distribute (block) :: A, RA ... !hpf$ independent, resident, on home (A(I)) do I = 1, n A(I) = 1.0 / float (I) call S(RA(I), A(I)) end do
Note: The RESIDENT directive is not correctly used if the variable RA is used afterwards.
Attention has also to be paid to the data within a PURE routine. Access to local data does not cause problems in a pure subprogram. But the use of a global array within a pure subprogram is quite useful:
pure subroutine P(I) integer :: I common /YOM/ A real, dimension (100) :: A !hpf$ distribute A(block) real :: X X = A(I) + 1.0 ! A(I) local or remote read, allowed A(I) = x ! A(I) local or remote write, not allowed in PURE end subroutine P
ADAPTOR assumes that within a subroutine every access is resident on the active processor or on the active processor set.
Unfortunalety, compilers are conservative and can also introduce synchronization or communication where it is not really necessary. The RESIDENT directive tells the compiler that only local data is accessed and no communication has to be generated. This guarantees that only the specified processors are involved and the code can be skipped definitively by the other processors.
!hpf$ on (PROCS(1:2)), resident call TASK1 (A1,N) !hpf$ on (PROCS(3:4)), resident call TASK2 (A2,N)
The RESIDENT directive is very useful for task parallelism where subroutines are called. It gives the compiler the important information that within the routine only resident data is accessed.
The ON directive on its own inserts already task parallelism in a natural way. If the data is available on the specified processor set and no communication is required, the statement can be skipped by all the other processors. Code blocks mapped to disjoint processor sets will be executed in parallel.
real, dimension (N) :: A1, A2 !hpf$ processors PROCS(4) !hpf$ distribute A1 (block) onto PROCS(1:2) !hpf$ distribute A2 (block) onto PROCS(3:4) ... !hpf$ on (PROCS(1:2)) call TASK1 (A1,N) !hpf$ on (PROCS(3:4)) call TASK2 (A2,N)
The code blocks might not be executed simultaneously if communication is involved. This will be the case if one of the code blocks uses data that has no incarnation on the specified processor subset or if it defines data that has also incarnations on other processors.
A local routine is called on every processor that is in the active processors set. Every processor will see in the routine only the local parts of the actual arguments.
extrinsic (HPF_LOCAL) program WORK &subroutine SUB (A) real, dimension (N,N) :: A real, dimension (:,:) :: A !hpf$ distribute A(*,block) ... ... print *, lbound(A,1), ubound(A,1) call SUB (A(:,:)) print *, lbound(A,2), ubound(A,2) ... ... end program WORK end subroutine SUB
As a local routine is called independently for all processors, it provides a high degree of parallelism. The HPF compiler has not to generate any communication within the subroutine. Nevertheless it might be possible that the user calls message passing commands within a local routine, but in this case the user itself is responsible for any communication. But only local data can be accessed within the local routine. Local routines can also be used to provide an HPF interface to implementation-specific parallel libraries.
The processors are executing a local routine completely asynchronously, so it might be possible that they are branching into different sections of the code. So in a local routine it is not possible to have any global operation like redistribution or calling global routines. This implies the following restrictions for local routines:
ADAPTOR supports local routines like proposed in the HPF standard. The HPF_LOCAL_LIBRARY is available.
A private variable is a variable that has an incarnation on every processor where every processor can modify this variable for his purposes. Replicated variables have also an incarnation on every processor, but the compiler takes responsibility that all processors have always the same value. All incarnations of replicated variables must be consistent. This is not the case for private variables, its use requires no communication at all.
Within the context of HPF, private variables will exist in the following situations:
!hpf$ independent, new (S) do I = 1, N S = ... ! no consistency for s required X(I) + X(I) * S end do
The compiler has not to make sure that after the termination of the loop all processors will have the same value of s.
pure fucntion ITERATE (I,J) real X, Y ! private variables X, Y integer K ! private variable K X = I * 0.01 Y = J * 0.02 K = 3 .... end function
Computations that work only on private variables will not imply any communication. The computations are always local.