Next: 4 Execution of Subprograms Up: Blocking Techniques for HPF Previous: 2 The HPF Execution

Subsections

3 Explicitly Blocked Computations

The merging of blocked computations is the most important issue for an efficient execution of the HPF program as otherwise an effective cache utilization is not possible. Every synchronization point implies that the emulation of abstract processors has to restart and cache reusage is not possible.

As shown in Section 2.5 the compiler has to be very conservative about the merging of blocked computations to guarantee correctness. Therefore a new directive has been introduced that enables the blocking explicitly.

3.1 The `PARALLEL` Directive

The following example shows the general idea of the PARALLEL directive.

      integer, parameter :: N = 16 * 1024 * 1024
      double precision, dimension (N) :: V
!hpf$ distribute V(block)

The following three statements are all independent computations and will be blocked.

      V = (/ (I, I=1,N) /)
      V = (V - 0.5) * W
      V = 4.0D0 / (1.0D0 + V * V)

Though in this situation the compiler might be able to verify that all computations can be merged, it is also possible to specifiy the blocked execution explicitly.

!hpf$ parallel
      V = (/ (I, I=1,N) /)
      V = (V - 0.5) * W
      V = 4.0D0 / (1.0D0 + V * V)
!hpf$ end parallel

The syntax of the PARALLEL directive is similar to all the other HPF directives and can appear in two flavors: a single-statement form and a multi-statement form.

!hpf$ parallel            !hpf$ parallel begin
      <stmt>                    <block>
                          !hpf$ end parallel

The directive indicates that the code in the PARALLEL region should be executed in the parallel mode. It also indicates that the blocked execution is safe, i.e. all data dependences are respected if the code of one abstract processor is executed before the next.

Note: The PARALLEL directive is very similar to the OpenMP PARALLEL directive. In contrary to OpenMP where the body of a parallel region is executed by all threads and worksharing must be explicitly specified, the worksharing in a HPF parallel region is implicitly given by the data distribution.

3.2 Work Sharing in Parallel Regions

The distribution of work within a parallel region is done in the same way as it is done for the DM execution model in HPF programs. This has no impact for distributed variables, but certain consequences for non-mapped variables.

      real, dimension (N) :: V, X
!hpf$ distribute (block)  :: V, X
      real                :: S
      ...
!hpf$ parallel 
      do I = 1, N-1
         S = V(I) / X(I)
         V(I+1:N) = V(I+1:N) * S
      end do
!hpf$ end parallel
      print *, S

In the DM execution model, every process has an own private incarnation of the variable S (replicated). The assignment to S is executed by all processors.

For the HPF blocking execution model, we keep the same strategy. But it should be observed that there is only a single incarnation of the variable S. But this is no problem as the serial emulation of processors guarantees that there are no race conditions and that the replicated assignment is safe.

But it should be observed that this is not true for the SM execution model. If the abstract processors are threads, the use of the shared variable S causes race conditions that might produce wrong results for the code. In such a situation it is recommended to use the NEW clause for the variable S ( see Section 3.3) that has for the SM execution model the same effect as the PRIVATE clause in OpenMP.

3.3 The NEW Directive

!hpf$ parallel, new (S)
      do I = 1, N-1
         S = V(I) / X(I)
         V(I+1:N) = V(I+1:N) * S
      end do
!hpf$ end parallel

The HPF NEW directive makes the following assertions:

The value of the variable S before entering the parallel region is no more needed.
The value of the variable S is not needed after the execution of the parallel region.

In this sense, it has exactly the same semantic as the PRIVATE directive of OpenMP.

3.4 The ON Directive

The HPF ON directive allows the user to control explicitly the distribution of computations among the abstract processors. For the HPF blocking execution model, it might be useful to avoid replicated computations.

      real, dimension (N) :: V, X
!hpf$ distribute (block)  :: V
      ...
!hpf$ parallel 
!hpf$ independent, on home (V(I))
      do I = 1, N
         X(I) = V(I)
      end do
!hpf$ end parallel
      print *, S

As the variable X is replicated, the default work sharing would imply that every processor assigns all the values to it. As there is only a single incarnation of X, the better solution is to distribute the assingment among the abstract processors by the ownership of the variable V.

3.5 Synchronization in Parallel Regions

For the HPF blocking execution model, it is assumed that the synchronization becomes redundant due to the serial emulation of the abstract processors. But this might not be true for the SM execution model.

!hpf$ parallel, new (S)
      do I = 1, N-1
!hpf$ barrier
         S = V(I) / X(I)
         V(I+1:N) = V(I+1:N) * S
      end do
!hpf$ end parallel

For the shared memory execution model, barrier synchronization is needed before the assignment to the scalar variable S as the abstract processors access non-local data V(I) and X(I).

Next: 4 Execution of Subprograms Up: Blocking Techniques for HPF Previous: 2 The HPF Execution

Thomas Brandes 2004-03-18