Next: 10 Explicit and Implicit Up: ADAPTOR HPF Programmers Guide Previous: 8 Unstructured Communication Contents Index

Subsections

9 Shadow Edges and Halos

9.1 Shadow Edges

Many scientific application contain a lot of so-called stencil operations where for the update of one element only values of the direct neighborhood is needed. Shadow edges (or overlap areas) that can contain these values guarantee that for the corresponding array statements or parallel loops it is not necessary to create temporary data. Instead of the movement of data to the temporary it is only necessary to update the overlap area.

The following HPF example program shows the benefit of shadow edges.

      real, dimension (N,N) :: F,DF
!hpf$ distribute (block, block) :: F
!hpf$ align DF with F
      ...
      DF (2:N,1:N-1) = F (2:N,2:N) + F (1:N-1,1:N-1)
! equivalent to
      forall (J=1:N-1,I=2:N) DF(I,J) = F(I,J+1) + F(I-1,J)

The array assignment requires data movement. Though the data movement contains structured communication, a straight-forward translation would generate the following code:

      real*4 F (1:N,1:N)
      real*4 DF (1:N,1:N)
      real*4, allocatable :: TMP1 (:,:), TMP2 (:,:)
!hpf$ distribute F (block, block)
!hpf$ align with F(I,J) :: DF(I,J), TMP1(I,J), TMP2(I,J)
      ...
      allocate (TMP1(1:N,1:N), TMP2(1:N,1:N)
      TMP1(2:N,1:N-1) = F(2:N,2:N)       ! movement
      TMP2(2:N,1:N-1) = F(1:N-1,1:N-1)   ! movement
      DF(2:N,1:N-1) = TMP1(2:N,1:N-1)+TMP2(2:N,1:N-1)  ! local
      deallocate (TMP2, TMP1)

This solution requires two temporary arrays. The two array movements imply some communication between the neighbored processors. Nevertheless a lot of local data will be copied as most neighbored data is on the processor itself. By utilising shadow edges the following code will be generated:

      real*4 F (1:N,1:N) shadow (1:0,0:1)
      real*4 DF (1:N,1:N)
!hpf$ distribute F (block, block)
!hpf$ align DF (I_1,I_2) with F(I_1,I_2)
      SHADOW Update F BY [0:0,0:1]
      SHADOW Update F BY [1:0,0:0]
      DF(2:N,1:N-1) = F(2:N,2:N)+F(1:N-1,1:N-1)  ! local

Though the updating of the shadow edges area requires the same amount of communication as the assignments to the temporaries, the benefit is due to the fact that no copying of local data is required. Furthermore, the need of memory for the overlap area is less than for the temporary arrays.

**Figure 5:** Data partitioning with shadow edges
$\includegraphics[height=60mm]{ovarray.eps}$

Shadow areas are detected automatically. Nevertheless, the user has still the possibility to specify a certain size for the shadow area. The shadow area will not change the semantic of the program but can increase the performance dramatically.

      real, dimension (N,N):: A, B
!hpf$ shadow :: B(1:1,1:1)

ADAPTOR will use overlap areas also for aligned arrays.

      subroutine RESTR (NPC, NPF, FC, UF, FF)
      integer :: NPC, NPF
      double precision, dimension (NPC, NPC) :: FC
      double precision, dimension (NPF, NPF) :: FF, UF
CHPF$ distribute FF (block, block)
CHPF$ align UF (I,J) with FF (I,J)
CHPF$ align FC (I,J) with FF (2*I-1,2*J-1)
      integer :: NC
      ...
      NC = NPC - 1
      FC (2 : NC, 2 : NC) =
     1      2 * (FF (3 : 2 * NC - 1 : 2, 3 : 2 * NC - 1 : 2) -
     2      4.0D0 * UF (3 : 2 * NC - 1 : 2, 3 : 2 * NC - 1 : 2) +
     3              UF (3 : 2 * NC - 1 : 2, 2 : 2 * NC - 2 : 2) +
     4              UF (2 : 2 * NC - 2 : 2, 3 : 2 * NC - 1 : 2) +
     5              UF (4 : 2 * NC : 2, 3 : 2 * NC - 1 : 2) +
     6              UF (3 : 2 * NC - 1 : 2, 4 : 2 * NC : 2))
      END

9.2 Automatic Creation of Shadow Edges

ADAPTOR creates automatically shadow areas for distributed arrays if it is appropriate. Especially for stencil operations, shadow areas increase the performance of the program dramatically.

      real, dimension (N,N) :: A, B
!hpf$ distribute (block, block) :: A, B
      ...
!hpf$ independent
      do I = 2, N-1
!hpf$ independent
         do J = 2, N-1
            A(I,J) = f(B(I,J),B(I,J-1),B(I,J+1),B(I-1,J),B(I+1,J))
         end do
      end do

So ADAPTOR will insert the following directive automatically:

!hpf$ shadow B(1:1,1:1)

Unfortunately, there are still some situations that cause performance problems.

If the array B is an use associated array and the corresponding module is not in the same compilation unit (separate compilation), ADAPTOR cannot insert a shadow area. A corresponding warning message should be considered and the SHADOW directive should be inserted by hand directly in the source file of the module.
If the array B is a dummy array, ADAPTOR inserts the shadow. But if the actual argument has not enough shadow, a copy-in and copy-out operation will be necessary that decreases the performance. ADAPTOR propagates the SHADOW directives from dummy arguments to actual argument, but can do this only within one compilation unit.

ADAPTOR gives warnings in corresponding situations and the user should act on them to get better performance.

9.3 Halos

A halo does not specify only the shadow edges but also its related communication schedule and the translation function from global indexes to local indexes into the shadow.

Let X be a distributed array that is indirectly addressed with an integer array L, also distributed among the processors. We call the array X the halo array and the array L the halo indexes.

      real, dimension (N)       :: X
      real, dimension (M)       :: Y
      integer, dimension (3,M)  :: L
!hpf$ distribute (block) :: X, Y
!hpf$ align L(*,I) with Y(I)
      ...
!hpf$ independent, on home (Y(J))
      do J = 1, M
         Y(J) = f(X(L(1,J)),X(L(2,J)),X(L(3,J))
      end do

Figure 6 gives the idea of a halo and how it is computed. Every processor looks at its local part of the array L and determines which of these indexes are non-local for the array X. The number of the non-local indexes gives the size of the shadow area that would be needed for the array X. For every non-local index there will be a corresponding index pointing to the shadow area. Every element in the shadow area corresponds to an element on another processor and the communication schedule reflects this relation

**Figure 6:** Shadow area and data exchange
$\includegraphics[height=80mm]{calc_shadow.eps}$

Next: 10 Explicit and Implicit Up: ADAPTOR HPF Programmers Guide Previous: 8 Unstructured Communication Contents Index

Thomas Brandes 2004-03-18