F. Layout of Data

The data mapping defines how the data of the program is mapped to the abstract (physical) processors. This mapping defines ownership, every processors owns a certain number of elements, its local section.

In contrary to the data mapping, the data layout defines how and where memory is allocated for the data objects and how this data is accessible for other processors.

On distributed memory machines, the default strategy is that every processor only allocates memory for its local parts (local layout). Then this local data is only accessible to other processors via explicit message passing (distributed layout), but it might also be accessible to other processors via one-sided communication (remote layout). On shared memory machines, the default strategy is that the whole array is allocated once as it has been defined (global layout) in a global address space that is shared by all processors (shared layout).

The SHADOW directive of HPF is another example for a directive that is not used for mapping of the data. The SHADOW directive is mainly used for the local layout to allocate addtional memory (shadow) to keep non-local values on each processor. This additional memory avoids the introduction of temporary arrays to keep these non-local values.

ADAPTOR supports some compiler specific layout directives that can be used to influence the layout.

Note: The layout of an alignee is never defined implicitly via the ALIGN directive.

F..2 Default Layout

F..2.1 Default Layout on Distributed Memory Machines

By default, all serial data is replicated among the active processors. Only if the data has been defined as shared, it has a single incarnation owned by the first processor (host) but shared by all processors.

For the shadow, no fixed size is assumed. ADAPTOR makes its own choice about the shadow size depending on the use of the data. Shared arrays have no shadow. Noshrunk dimensions have no shadow.

F..2.2 Default Layout on Shared Memory Machines

By default, all serial data is owned by the HOST thread and shared among all threads. All arrays without explicit replication have a shared layout.

F..3 Layout of Dimensions

F..3.1 Layout Directive

A serial layout can only be specified for serial dimensions. A missing layout for a dimension stands for a default layout, the * stands for an underspecified dimension layout.

The HPF mapping directives define for each element an owner (see also Figure 12).

**Figure 12:** HPF: mapping of a distributed dimension
23#1

F..3.2 Local Layout of a Dimension

Data objects for which the user has specified a distribution by means of HPF directives are allocated in a partitioned manner, such that each processor only allocates those parts of a data object that are owned by it. The part of a distributed array owned by a processor is referred to as its local section. Since the size of the local section of a distributed array on a particular processor usually cannot be determined at compile time, a dynamic allocation strategy has to be adopted. With such a strategy, each processor participating in the execution of the parallel program computes the size of the local section of a distributed array at runtime and dynamically allocates a corresponding memory area.

The SPMD program generated by an HPF compiler is parameterized in such a way that every processor only allocates that portion of an array that has been mapped to it by means of the HPF mapping directives (DISTRIBUTE, ALIGN). Consequently, all accesses to mapped data objects have to be translated from the original global address space into the local address spaces of the processors participating in the execution of the program.

The usual way is a local layout where every abstract processor only allocates that portion of an array that has been mapped to it by means of the HPF mapping directives (DISTRIBUTE, ALIGN). The part of a distributed array owned by a processor is referred to as its local section. Since the size of the local section of a distributed array on a particular processor usually cannot be determined at compile time, a dynamic allocation strategy has to be adopted. With such a strategy, each abstract processor participating in the execution of the parallel program computes the size of the local section of a distributed array at runtime and dynamically allocates a corresponding memory area.

Figure 13: Local layout of a distributed dimension.

real, dimension (23) :: A !hpf$ processors P(3) !hpf$ distribute A (cyclic(3)) onto P !adp$ layout A (local)

$\includegraphics[height=21mm]{layout_local.eps}$

All accesses to mapped data objects have to be translated from the original global address space into the local address spaces of the processors participating in the execution of the program.

F..3.3 Global Layout of a Dimension

In contrast to the local layout, every abstract processor allocates the full portion of the array even it will be only responsible for a certain part of the array later.

Figure 14: Global layout of a distributed dimension.

real, dimension (23) :: A !hpf$ processors P(3) !hpf$ distribute A (cyclic(3)) onto P !adp$ layout A (global)

$\includegraphics[height=19mm]{layout_global.eps}$

One main advantage of this layout is the fact that global addresses have no longer to be translated to local addresses. Furthermore, all abstract processors have enough memory to keep copies of non-local values when they are needed. It avoids the use of temporary buffers for communication. In other words, all abstract processors have a full shadow region for all non-local data. Nevertheless, this layout wastes memory. It is no more possible to scale problem sizes with the number of processors.

F..3.4 Shared Layout of a Dimension

A shared memory, or better a global address spaces, provides the opportunity to allocate the array contiguously for all processors instead of a partitioned way in the local memory. All processors can address the data in the same way and no communication will be necessary. One way of allocation is to keep the Fortran layout so that there is no need for address translation addresses need not to be translated to local addresses (see Figure 15). This is especially useful for applications that make assumptions about sequence and storage layout of arrays. In certain situations, the global layout might decrease the cache performance due to false sharing. Then a reshaping of the data might be more efficient where all data belonging to one processor is stored contiguously (see Figure 16). But the penalty of this layout is that global addresses must be translated to local addresses similar to the shrunk layout.

Figure 15: Shared global layout of a distributed dimension.

real, dimension (23) :: A !hpf$ processors P(3) !hpf$ distribute A (cyclic(3)) onto P !adp$ layout A (shared [global])

$\includegraphics[height=19mm]{layout_gshared.eps}$

Figure 16: Shared local layout of a distributed dimension.

real, dimension (23) :: A !hpf$ processors P(3) !hpf$ distribute A (cyclic(3)) onto P !adp$ layout A (shared local)

$\includegraphics[height=19mm]{layout_lshared.eps}$

F..4 Layout of Arrays

F..4.1 Shared Layout

Shared dimensions on its own do no guarantee that every element can be accessed by all processors.

The shared directive guarantees that the array B is shared among all (active) processors. It implies a shared dimension layout for all distributed dimensions.

F..4.2 Global Layout

The global directive guarantees that the array B will be allocated on all (active) processors with the full global size. It implies a global dimension layout for all dimensions.

F..5 Shadow Edges

Many scientific application contain a lot of so-called stencil operations where for the update of one element only values of the direct neighborhood is needed. Shadow edges (or overlap areas) that can contain these values guarantee that for the corresponding array statements or parallel loops it is not necessary to create temporary data. Instead of the movement of data to the temporary it is only necessary to update the overlap area.

Shadow areas are detected automatically. Nevertheless, the user has still the possibility to specify a certain size for the shadow area. The shadow area will not change the semantic of the program but can increase the performance dramatically.

F..5.1 SHADOW Directive

The HPF mapping directives define for each element an owner (see also Figure 17).

**Figure 17:** HPF: mapping of a block distributed dimension
28#2

F..5.2 Local Shadow of a Dimension

Figure 18: Local shadow for a block distributed dimension.

real, dimension (23) :: A !hpf$ processors P(3) !hpf$ distribute A (block) onto P !hpf$ shadow A (1:1)

$\includegraphics[height=19mm]{shadow_local.eps}$

Local shadows will only be allocated for block distributed dimensions with a local layout. It should be noted that a local shared layout is the same as the global shared layout for a block distributed dimension. A global layout does not require the local shadow as the memory is already available or not necessary at all.

F..5.3 Global Shadow of a Dimension

A global shadow is allocated for every layout of the dimension. For the serial or shared layout, additional memory is also available. This memory can be used for circular shift operations.

F..5.4 Exact Shadow

ADAPTOR might choose an appropriate shadow by its own. This can be avoided by specifying fixed shadow sizes.

F..6 Sequence and Storage Association

If an array is distributed the user cannot make any assumption about sequence or storage association of this array.

Arrays in common blocks can also be distributed like other arrays. If a common block contains a distributed array, sequence association will not be guaranteed. Such a common block is called nonsequential.

For replicated arrays sequence association is guaranteed. Sequence association can explicitly be specified by the SEQUENCE directive for a COMMON block.

If sequence association is specified, all arrays in the common block will have to be replicated or will be replicated if no layout directive is specified.