next up previous contents index
Next: 12 Compiling Subprograms for Up: ADAPTOR HPF Programmers Guide Previous: 10 Explicit and Implicit   Contents   Index

Subsections


11 Pitfalls

11.1 Identifying Overhead of HPF Programs

Many users complain about the overhead introduced by the HPF directives and by the necessary modifications of the original program. The compiled HPF programs run often on a single node much slower than the original Fortran program.

This overhead is of course very crucial for the acceptance of HPF. Within the ADAPTOR project, we paid much attention to this problem and analysed this overhead very carefully. Some of the typical overheads are described in the following sections.

At this point it should be noted that ADAPTOR also allows the compilation of an HPF program for a single node (gmdhpf -1 <program.hpf>). This version of the program contains no communication at all. It can be used to verify the overhead that comes from array operations and other code modifications. Any overhead here will also appear when the program is compiled to an SPMD program. Tuning of the program at this level is simpler and eliminates already a lot of performance problems.

11.2 Loop Ordering

Most Fortran programmers know that nearly on all machines it is better to run the innermost loop for the first array index.

       real, dimension (N,N) :: A, B
       integer I, J
       do J = 1, N
          do I = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

ADAPTOR takes care of this for array operations.

       A = A + B

!      will be translated to

       do J = 1, N
          do I = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

But ADAPTOR does not change the order of nested loops in FORALL statements. So the following loop results in poor performance:

       forall (I=1:N, J=1:N) A(I,J) = A(I,J) + B(I,J)

!      will be translated to

       do I = 1, N
          do J = 1, N
             A(I,J) = A(I,J) + B(I,J)
          end do
       end do

ADAPTOR does not any loop interchanging, so users should be very careful about the loop ordering. Performance problems of this kind can already be identified when compiling HPF programs for a single node.

11.3 Avoiding Temporary Arrays in Array Operations

Due to the semantic of array operations and of the FORALL statement/construct, it might be necessary to create a temporary array to store intermediate results.

!hpf$ independent
      A(I,2:N) = A(K,1:N-1)

!hpf$ independent
      forall (J=2:N) A(I,J) = A(K,J-1)

11.4 Serial Loops

Any serial loop over a distributed array gives very poor code performance. The innermost statements will be executed by all processors that can imply a broadcast of single elements that are used in the statements.

      real, dimension (N,N) :: A, B
!hpf$ distribute (block,block) :: A, B
      ...
      do I = 2, N-1
         do J = 2, N-1
            A(I,J) = f(B(I,J),B(I,J-1),B(I,J+1),B(I-1,J),B(I+1,J))
         end do
      end do

ADAPTOR would take this code and generate the following (pseudo-) SPMD program:

      do I = 2, N-1
         do J = 2, N-1
            X1 = broadcast (B(I,J-1))
            X2 = broadcast (B(I,J+1))
            X3 = broadcast (B(I-1,J))
            X4 = broadcast (B(I+1,J))
            if (is_local (A (I,J)) then
               A(I,J) = f(B(I,J),X1,X2,X3,X4)
            end if
         end do
      end do

The best way to increase the performance is to insert the INDEPENDENT directive or to switch on the automatic parallelization.

A similar problem is given for loops that implement a reduction. It is very important to specify the INDEPENDENT and the REDUCTION directive for such a loop.

      real, dimension (N) :: A, B
      real :: S
!hpf$ distribute (block) :: A, B
      ...
!hpf$ independent, reduction (S)
      do I = 1, N
         S = S + A(I) * B(I)
      end do

11.5 When Independent Loops Become Serial

But nevertheless, there are some situations where ADAPTOR will serialize the loop and produce programs with a very low performance:

11.6 Efficient Communication

ADAPTOR might generate communication that is not very efficient at all. Here are the most important optimization issues:

11.7 Avoiding Remappings and Reallocations

Remapping takes place if there is mismatch for the distribution. Reallocation becomes necessary if the layout has changed, e.g. if an actual argument has not as many shadow as expected for the dummy argument.

The highest potential for unnecessary remappings and reallocations exists at subroutine boundaries.


next up previous contents index
Next: 12 Compiling Subprograms for Up: ADAPTOR HPF Programmers Guide Previous: 10 Explicit and Implicit   Contents   Index
Thomas Brandes 2004-03-18