2 The HPF Execution Model for Blocking

In contrary to the distributed memory execution model, it is not possible to realize an HPF execution model for blocking that works in the SPMD mode. As soon as communication or synchronization is required, emulation of one abstract processor has to stop and to continue with other abstract processors until the needed data is available.

Therefore the blocking execution model has to be based on a master-slave model (see Figure 3). Generally speaking, the program will be executed by one processor. Blocking and running over a loop of abstract processors will only be done for regions that have been identified for blocked execution. These are independent computations that do not require any communication.

**Figure 3:** HPF blocking model based on master-slave paradigm.
$\includegraphics[height=50mm]{master_slave.eps}$

In the master mode, all the work is done by one processor. In the blocking mode, work distribution is the same as in the usual HPF execution model: each processor works on its own data.

Compiling HPF for distributed memory machines implies that the compiler has to insert communication for all non-local accesses. For the blocking execution model this is not necessary as all data is shared among all abstract processors. But the compiler has to synchronize the computations to respect dependences. In the blocking mode, this synchronization is implicitly given at the end of a blocked execution.

2.2 Layout of Data

The data mapping defines how the data of the program is mapped to the abstract processors. This mapping defines ownership, every processors owns a certain number of elements, its local section. In contrary to the data mapping, the data layout defines how and where memory is allocated for the data objects and how this data is accessible for other processors.

On distributed memory machines, the default strategy is that every processor only allocates memory for its local parts (local layout). Then this local data is only accessible to other processors via explicit message passing (distributed layout), but it might also be accessible to other processors via one-sided communication (remote layout). On shared memory machines, the default strategy is that the whole array is allocated once as it has been defined (global layout) in a global address space that is shared by all processors (shared layout).

A shared memory, or better a global address spaces, provides the opportunity to allocate the array contiguously for all processors instead of a partitioned way in the local memory. All processors can address the data in the same way and no communication will be necessary. One way of allocation is to keep the Fortran layout so that there is no need for address translation addresses need not to be translated to local addresses (see Figure 4). This is especially useful for applications that make assumptions about sequence and storage layout of arrays. In certain situations, the global layout might decrease the cache performance due to false sharing. Then a reshaping of the data might be more efficient where all data belonging to one processor is stored contiguously (see Figure 5). But the penalty of this layout is that global addresses must be translated to local addresses similar to the shrunk layout.

Figure 4: Shared global layout of a mapped array.

      real, dimension (23) :: A
!hpf$ processors P(3)
!hpf$ distribute A (cyclic(3)) onto P
!adp$ layout A (shared [global])

$\includegraphics[height=19mm]{layout_gshared.eps}$

Figure 5: Shared local layout of a mapped array.

      real, dimension (23) :: A
!hpf$ processors P(3)
!hpf$ distribute A (cyclic(3)) onto P
!adp$ layout A (shared local)

$\includegraphics[height=19mm]{layout_lshared.eps}$

For the blocking execution model, mapped arrays will always be allocated in a shared global layout. The shared local layout is not supported yet.

2.3 Blocking Of Independent Computations

Let the array V be a one-dimensional array that is distributed among a set of abstract processors (the number of abstract processors can be specified at runtime).

Serial operations or serial loops that work on a distributed array are not considered for blocking. They will be executed in the usual way. It could also be considered as only the first abstract processor will execute the operation.

The following kind of computations over distributed arrays are considered as independent computations and will be blocked:

2.4 Reductions

By the HPF REDUCTION directive it is possible to specifiy that a variable within independent computatins is used as a reduction variable.

Reductions do not cause any problems for the blocked execution of computations. Due to the emulation mode, it is not necessary at all to generate a temporary variable for the reduction variable.

2.5 Merging of Blocked Computations

Independent computations can be executed in parallel and therefore they can be executed in the blocking mode. Two independent compuations must be blocked separately if there is a synchronization point needed.

**Figure 6:** Synchronization point between independent computations.
$\includegraphics[height=40mm]{synchronization.eps}$

A synchronization point implies that the computation of all blocks in the first phase must be finished before the computations of the next phase can start. Regarding cache utilization, this is a bad situation as then data of a processor might be no longer in the cache when starting the next computation for it.

Nevertheless, synchronization points are not really necessarry in many situations and in such a case the computations can be merged.

**Figure 7:** Merging of Local Computations
$\includegraphics[height=30mm]{blocking1.eps}$

This example does not require any communication as every processor works on its local data. Therefore blocking of the whole region is possible.

In this example every abstract processor needs data from its left neightbor. Communication (or synchronization) is necessary. Blocking of the both computations is possible if the execution order of the abstract processor respects the dependences.

**Figure 8:** Blocking with Non-Local Computations
$\includegraphics[height=30mm]{blocking2.eps}$

The following example shows a situation where the two independent computations cannot be blocked.

Every abstract processor needs data from its left and right neightbor. Communication (or synchronization) is necessary. Blocking of the whole region is no more possible.

**Figure 9:** Blocking is no more possible.
$\includegraphics[height=30mm]{blocking3.eps}$

2.6 Remapping

HPF supports implicit remapping at subroutine boundaries and explicit remapping via HPF directives.

Remapping might be very useful as it allows to change between different blocking techniques. As the data layout of all arrays is globally shared, remapping does not imply any memory reorganization of the data.

Remapping must always be done in the master mode and never in the blocking mode.