Many users complain about the overhead introduced by the HPF directives and by the necessary modifications of the original program. The compiled HPF programs run often on a single node much slower than the original Fortran program.
This overhead is of course very crucial for the acceptance of HPF. Within the ADAPTOR project, we paid much attention to this problem and analysed this overhead very carefully. Some of the typical overheads are described in the following sections.
At this point it should be noted that ADAPTOR also allows the compilation of an HPF program for a single node (gmdhpf -1 <program.hpf>). This version of the program contains no communication at all. It can be used to verify the overhead that comes from array operations and other code modifications. Any overhead here will also appear when the program is compiled to an SPMD program. Tuning of the program at this level is simpler and eliminates already a lot of performance problems.
Most Fortran programmers know that nearly on all machines it is better to run the innermost loop for the first array index.
real, dimension (N,N) :: A, B integer I, J do J = 1, N do I = 1, N A(I,J) = A(I,J) + B(I,J) end do end do
ADAPTOR takes care of this for array operations.
A = A + B ! will be translated to do J = 1, N do I = 1, N A(I,J) = A(I,J) + B(I,J) end do end do
But ADAPTOR does not change the order of nested loops in FORALL statements. So the following loop results in poor performance:
forall (I=1:N, J=1:N) A(I,J) = A(I,J) + B(I,J) ! will be translated to do I = 1, N do J = 1, N A(I,J) = A(I,J) + B(I,J) end do end do
ADAPTOR does not any loop interchanging, so users should be very careful about the loop ordering. Performance problems of this kind can already be identified when compiling HPF programs for a single node.
Due to the semantic of array operations and of the FORALL statement/construct, it might be necessary to create a temporary array to store intermediate results.
!hpf$ independent A(I,2:N) = A(K,1:N-1) !hpf$ independent forall (J=2:N) A(I,J) = A(K,J-1)
Any serial loop over a distributed array gives very poor code performance. The innermost statements will be executed by all processors that can imply a broadcast of single elements that are used in the statements.
real, dimension (N,N) :: A, B !hpf$ distribute (block,block) :: A, B ... do I = 2, N-1 do J = 2, N-1 A(I,J) = f(B(I,J),B(I,J-1),B(I,J+1),B(I-1,J),B(I+1,J)) end do end do
ADAPTOR would take this code and generate the following (pseudo-) SPMD program:
do I = 2, N-1 do J = 2, N-1 X1 = broadcast (B(I,J-1)) X2 = broadcast (B(I,J+1)) X3 = broadcast (B(I-1,J)) X4 = broadcast (B(I+1,J)) if (is_local (A (I,J)) then A(I,J) = f(B(I,J),X1,X2,X3,X4) end if end do end do
The best way to increase the performance is to insert the INDEPENDENT directive or to switch on the automatic parallelization.
A similar problem is given for loops that implement a reduction. It is very important to specify the INDEPENDENT and the REDUCTION directive for such a loop.
real, dimension (N) :: A, B real :: S !hpf$ distribute (block) :: A, B ... !hpf$ independent, reduction (S) do I = 1, N S = S + A(I) * B(I) end do
But nevertheless, there are some situations where ADAPTOR will serialize the loop and produce programs with a very low performance:
!hpf$ independent do I = 1, N if (A(I) .gt. 0.0) then X = A(I) A(I) = X * (X - 1.0) end if end doIn such situations, the use of the NEW directive will increase the performance dramatically. If the last value of X is really important, the program should use the REDUCTION directive:
X = 0.0 !hpf$ independent, new(Y), reduction(X) do I = 1, N Y = A(I) if (Y .gt. 0.0) then X = max (X, Y) A(I) = Y * (Y - 1.0) end if end do
!hpf$ distribute (block) :: A !hpf$ independent, new(K) do I = 2, N K = I - 1 A(I) = A(I) + A(K) end doIn such situations, it is very helpful for ADAPTOR to replace the index K directly with the expression I-1 (forward substitution). One of the next versions of ADAPTOR will do this automatically.
ADAPTOR might generate communication that is not very efficient at all. Here are the most important optimization issues:
Remapping takes place if there is mismatch for the distribution. Reallocation becomes necessary if the layout has changed, e.g. if an actual argument has not as many shadow as expected for the dummy argument.
The highest potential for unnecessary remappings and reallocations exists at subroutine boundaries.