Next: 4 Local Computations Up: ADAPTOR HPF Programmers Guide Previous: 2 Execution Model of Contents Index

Subsections

3 Home of Computations and Work Distribution

Parallel execution is achieved by distributing the computations to the processors. This task, referred to as work distribution, is usually done automatically by the compiler based on the user-specified data distribution. This work distribution is based on the owner computes rule, which ensures that processors only execute assignments to those elements of a distributed array they own.

Alternatively, work distribution may also be explicitly controlled by the user by means of the ON and ON HOME clause.

As the mapping of data implies the work distribution of the computations in the program and the necessary communication, it is very important to pay attention to this mapping first.

Note: ADAPTOR allows to generate an intermediate file that contains all information about the home of computations chosen by the HPF compiler.

3.1 Importance of Work Distribution

Two issues are the most important factors for the work distribution:

load balancing, all processors should be busy
minimizing non-local accesses and therefore reducing communication and (process parallelism) and/or synchronization (thread parallelism).

The home of computations should be chosen in such a way that a very high data locality is achieved.

3.2 Default Work Distribution in ADAPTOR

The main criteria for the load distribution are:

An assignment to a distributed variable will be executed by the processor that owns the element on the left hand side (owner-computes rule). The needed data (values on the right hand side) must be made available before.
An assignment to a replicated variable (for the multiprocessing execution model these are scalars, arrays without a user-specified mapping) will be executed by all processors.
I/O statements are executed by one dedicated processor called the master processor (see Section 3.7).
One iteration of an independent loop is executed completely on the home of this iteration. The ownership is usually given by the first assignment to a distributed variable.
```
!hpf$ independent
      do I = 1, N
         S    = ....
         X(I) = f(S, ...)
         ....
      end do
```
Iteration I is executed by the processor that owns the element X(I).

For optimization issues it might be the case that there are exceptions from these rules.

3.3 The `ON` Directive

The ON directive allows the user to control explicitly the distribution of computations among the processors of a parallel machine.

!hpf$ on home (A(I))
      A(I) = B(I) + C(I)

    !hpf$ processors PROCS(4)
          real, dimension (N) :: A
    !hpf$ distribute A(block) onto PROCS(3:4)
          ...
    !hpf$ on (PROCS(1:2))
            call SUB1()
    !hpf$ on home (A) begin
            call SUB2()
            call SUB3()
    !hpf$ end on

In the HOME clause, the user can specify a processor array or a processor subset or an array (template) or a subsection of an array (template).

The ON directive restricts the active processor set for a computation to those processors named in the home, or to the processors that own at least one element of the specified array or template. It should be noted that the ON directive only advises the compiler to use the corresponding processors to perform the ON statement or block. In nearly most situations, ADAPTOR accepts the home directive of the user. But ADAPTOR also informs the user if it overrides the user's advice.

All HOME specifications as proposed in the HPF 2.0 standard can be used within ADAPTOR. In contrary to the HPF 2.0 standard, ADAPTOR also allows vector-subscripts for processor subsets. This gives more flexibility in mapping data to certain processors and for the selection of processors executing a task.

    !hpf$ processors P (6)
          integer, dimension (4) :: IND = (/ 1, 3, 4, 6 /)
          real, dimension (N)    :: A1, A2
    !hpf$ distribute A1 (block) onto P
    !hpf$ distribute A2 (block) onto P(IND)
          ...
    !hpf$ on (P(IND))
             call TASK (A2, N)

3.4 Active Processors

An active processor is one that executes an HPF statement or block. Usually, an HPF program begins execution with all processors active. But the execution can be restricted to a subset of processors or to a single processor in the following cases:

The ON directive restricts the active processor set for the duration of execution of statements in its scope.
Serial routines (EXTRINSIC(HPF_SERIAL) will be executed on a single processor.

3.5 Restrictions for the `ON` Directive

Certain statements cannot be executed by a given processor subset, e.g.:

Allocation and deallocation of an array is only possible if all data is only mapped to the active processors.

!hpf$   processors PROCS (4)
        real, dimension (:), allocatable :: A
!hpf$   distribute A(block) onto PROCS (1:2)
        ...
!hpf$   on (PROCS(1:2))
        allocate (A)          ! is allowed
!hpf$   on (PROCS(1:2))
        deallocate (A)        ! not possible

Any redistribution must include all processors involved in it.

3.6 Execution of Subroutines

Usually, every user subroutine and every user function will be entered by all processors. These are the exceptions:

A serial routine is only executed by a single processor.
A pure subprogram is only executed by the processor that will call it. If it is called in a parallel loop, only one processor will execute it.
If the home is specified, the call is assumed to be pure and the previous rules apply.

From the caller's standpoint, an invocation of a local procedure from a ''global`` HPF program has the same semantics as an invocation of a regular procedure.

3.7 Execution of I/O Statements

ADAPTOR does not support parallel I/O until now. Currently I/O statements are translated in such a way that I/O operations are executed by the one dedicated processor. In the following, this processor is called master processor.

Due to the fact that only one processor executes I/O statements there are some inconveniences that should be observed.

There are no problems when using replicated data in I/O statements. If a replicated variable is updated by the statement (e.g. READ or in INQUIRE), the value is broadcast to all other processors (implicit communication and synchronization).

If distributed arrays are used in an I/O statement, the corresponding values have to be sent to or to be received from the I/O processor (implicit communication). Therefore temporary data of the corresponding size will be created.

      real, dimension (N,N) :: A
!hpf$ distribute A(block, block)
      ...
      READ *, A

is translated to

      real, dimension (:,:), allocatable :: TMP_A    ! replicated data
      ...
      allocate (TMP_A (N,N))
      read *, TMP_A              ! the serial data will be
      A = TMP_A                  ! sent to the processors
      deallocate (TMP_A)

Attention: As the full array is allocated on a single node, there might be memory problems with large arrays. In such a case, it is recommended to read single columns or rows of a matrix.

3.8 Serial Procedures

A serial routine is called by one single processor only. The data of the actual arguments will be mapped automatically to the processor executing the routine. In the first place, serial routines are not intended for parallelism but to guarantee that certain operations (e.g. for steering external devices) are really executed once.

                                             extrinsic (HPF_SERIAL)
      program WORK                          &subroutine SUB (A, N)
      real, dimension (N,N) :: A             real, dimension (:) :: A
!hpf$ distribute A(*,block)                  integer, intent(in) :: N
      ...                                    ...
      call SUB (A(1,:),N)                    ...
      do J = 1, N                            ...
         call SUB (A(:,J),N)                 end subroutine SUB
      end do

In the example, the first call of the subroutine SUB will be executed by one processor, probably the first one as in ADAPTOR. The data of the distributed row is automatically collected onto this processor and sent back after the call.

The second call can be executed directly by the processor that owns column J. This would avoid any communication and all other processor can skip the call. The execution of the loop results in a parallel execution.

Next: 4 Local Computations Up: ADAPTOR HPF Programmers Guide Previous: 2 Execution Model of Contents Index

Thomas Brandes 2004-03-18