Next: 12 Compiling Subprograms for Up: ADAPTOR HPF Programmers Guide Previous: 10 Explicit and Implicit Contents Index

Subsections

11 Pitfalls

11.1 Identifying Overhead of HPF Programs

Many users complain about the overhead introduced by the HPF directives and by the necessary modifications of the original program. The compiled HPF programs run often on a single node much slower than the original Fortran program.

This overhead is of course very crucial for the acceptance of HPF. Within the ADAPTOR project, we paid much attention to this problem and analysed this overhead very carefully. Some of the typical overheads are described in the following sections.

At this point it should be noted that ADAPTOR also allows the compilation of an HPF program for a single node (gmdhpf -1 <program.hpf>). This version of the program contains no communication at all. It can be used to verify the overhead that comes from array operations and other code modifications. Any overhead here will also appear when the program is compiled to an SPMD program. Tuning of the program at this level is simpler and eliminates already a lot of performance problems.

11.2 Loop Ordering

Most Fortran programmers know that nearly on all machines it is better to run the innermost loop for the first array index.

       real, dimension (N,N) :: A, B
       integer I, J
       do J = 1, N
          do I = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

ADAPTOR takes care of this for array operations.

       A = A + B

!      will be translated to

       do J = 1, N
          do I = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

But ADAPTOR does not change the order of nested loops in FORALL statements. So the following loop results in poor performance:

       forall (I=1:N, J=1:N) A(I,J) = A(I,J) + B(I,J)

!      will be translated to

       do I = 1, N
          do J = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

ADAPTOR does not any loop interchanging, so users should be very careful about the loop ordering. Performance problems of this kind can already be identified when compiling HPF programs for a single node.

11.3 Avoiding Temporary Arrays in Array Operations

Due to the semantic of array operations and of the FORALL statement/construct, it might be necessary to create a temporary array to store intermediate results.

!hpf$ independent
      A(I,2:N) = A(K,1:N-1)

!hpf$ independent
      forall (J=2:N) A(I,J) = A(K,J-1)

11.4 Serial Loops

Any serial loop over a distributed array gives very poor code performance. The innermost statements will be executed by all processors that can imply a broadcast of single elements that are used in the statements.

      real, dimension (N,N) :: A, B
!hpf$ distribute (block,block) :: A, B
      ...
      do I = 2, N-1
         do J = 2, N-1
            A(I,J) = f(B(I,J),B(I,J-1),B(I,J+1),B(I-1,J),B(I+1,J))
         end do
      end do

ADAPTOR would take this code and generate the following (pseudo-) SPMD program:

      do I = 2, N-1
         do J = 2, N-1
            X1 = broadcast (B(I,J-1))
            X2 = broadcast (B(I,J+1))
            X3 = broadcast (B(I-1,J))
            X4 = broadcast (B(I+1,J))
            if (is_local (A (I,J)) then
               A(I,J) = f(B(I,J),X1,X2,X3,X4)
            end if
         end do
      end do

The best way to increase the performance is to insert the INDEPENDENT directive or to switch on the automatic parallelization.

A similar problem is given for loops that implement a reduction. It is very important to specify the INDEPENDENT and the REDUCTION directive for such a loop.

      real, dimension (N) :: A, B
      real :: S
!hpf$ distribute (block) :: A, B
      ...
!hpf$ independent, reduction (S)
      do I = 1, N
         S = S + A(I) * B(I)
      end do

11.5 When Independent Loops Become Serial

But nevertheless, there are some situations where ADAPTOR will serialize the loop and produce programs with a very low performance:

ADAPTOR cannot always identify INDEPENDENT loops automatically. Here the user has to insert this property by inserting the corresponding directive.
If a variable is used like a NEW or REDUCTION variable but this is not identified automatically or the corresponding directive is not given, ADAPTOR will serialize the loop. In the following example, the loop is serialized to make sure that X contains the correct value at the end of the loop.
```
!hpf$ independent
      do I = 1, N
         if (A(I) .gt. 0.0) then
            X = A(I)
            A(I) = X * (X - 1.0)
         end if
      end do
```
In such situations, the use of the NEW directive will increase the performance dramatically. If the last value of X is really important, the program should use the REDUCTION directive:
```
      X = 0.0
!hpf$ independent, new(Y), reduction(X)
      do I = 1, N
         Y = A(I)
         if (Y .gt. 0.0) then
            X = max (X, Y)
            A(I) = Y * (Y - 1.0)
         end if
      end do
```
ADAPTOR is not always able to extract communication out of parallel loops. A typical example is the following loop:
```
!hpf$ distribute (block) :: A

!hpf$ independent, new(K)
      do I = 2, N
         K = I - 1
         A(I) = A(I) + A(K)
      end do
```
In such situations, it is very helpful for ADAPTOR to replace the index K directly with the expression I-1 (forward substitution). One of the next versions of ADAPTOR will do this automatically.
If an independent loop is serialized, ADAPTOR gives a warning message. The user should analyse the intermediate files (use the flag -G) very carefully to identify the problem. At first, it should be verified that ADAPTOR has chosen a good home for the loop iterations. If not, the HOME directive should be used. Then it should be verified whether ADAPTOR could really extract all necessary communication. If this was not possible, the user can do this very often by himself. He has to create a temporary array aligned to the loop iterations and to define it with the non-local values.

11.6 Efficient Communication

ADAPTOR might generate communication that is not very efficient at all. Here are the most important optimization issues:

Reusing communication schedules for unstructured communications.

11.7 Avoiding Remappings and Reallocations

Remapping takes place if there is mismatch for the distribution. Reallocation becomes necessary if the layout has changed, e.g. if an actual argument has not as many shadow as expected for the dummy argument.

The highest potential for unnecessary remappings and reallocations exists at subroutine boundaries.

Next: 12 Compiling Subprograms for Up: ADAPTOR HPF Programmers Guide Previous: 10 Explicit and Implicit Contents Index

Thomas Brandes 2004-03-18