next up previous contents
Next: 9 Data Profiling Up: ADAPTOR Profiling Guide Previous: 7 Tracing of Performance   Contents

Subsections


8 Profiling of Communication

Communication is an important issue for the optimization of parallel programs. ADAPTOR supports profiling of communication for parallel HPF programs compiled for distributed memory architectures by ADAPTOR itself.

Note: ADAPTOR does not support profiling of communication for MPI programs. This can be done by other tools like VAMPIR. As ADAPTOR generates MPI programs from data parallel HPF programs, other tools for profiling MPI communication can also be applied to the generated MPI programs.

In the following example program the distributed array A will be replicated (default for array B. This replication requires communication as every processor needs the values of all non-local data.

      program TEST
      integer, parameter :: N = 100
      integer, dimension (N) :: A, B
!hpf$ distribute (block) :: A
      B = A
      end program TEST

The compilation does not need any special flag to enable profiling of communication.

    adaptor -hpf_mpi -o example example.hpf


8.1 Counters for Communication

Currently, ADAPTOR supports the following counters within its HPF runtime system:

These counters can be enabled to get information about communication in the program.

    mpirun -np 4 example -pm=pm_config


Table 9: Counting communication.
  SEND_CALLS RECV_CALLS SEND_BYTES RECV_BYTES
1 3 3 300 300
2 3 3 300 300
3 3 3 300 300
4 3 3 300 300


Table 10 shows a more complex example for counting coummunication.


Table 10: Counting communication for regions.
      SEND SEND SEND RECV RECV RECV
region calls walltime CALLS BYTES BYTES' CALLS BYTES BYTES'
    netto netto netto netto netto netto netto
    s   M M   M M
HYDFLO 1 3.648 2 0.000 0.000 2305 0.018 0.005
INITIAL 1 0.037 52 0.861 23.464 25 0.387 10.535
GHOST5 101 0.286 121 10.635 37.244 101 8.877 31.088
SEEDRANF 1 0.000 0 - - 0 - -
MPPRANS 1 0.088 0 - - 0 - -
RANFK 1 0.000 0 - - 0 - -
IRANFEVEN 1 0.000 0 - - 0 - -
RANFKBINARY 1 0.000 0 - - 0 - -
IRANFODD 48 0.000 0 - - 0 - -
RANFATOK 1 0.000 0 - - 0 - -
RANFMODMULT 125075 0.329 0 - - 0 - -
MPPRANF 360 2.121 0 - - 0 - -
GHOST3 10 0.061 20 1.144 18.689 10 0.572 9.344
FLUX 20 4.175 140 3.516 0.842 140 3.516 0.842
HYDRO 20 0.559 30 2.637 4.718 30 2.637 4.718


8.2 Communication Statistics

    mpirun -np 2 executable -comm

The flag -comm that will give a communication statistics file at the end of the program. In contrary to the communication counters (see Section 8.1) the communication statistics will give more detailed information about how the processors interact with each other but will not relate this information with the different subprograms (regions).

By running the program with the flag -comm a communication statistic will be printed at the end of the program.

    mpirun -np 2 example -comm

   Communication volume summary for program example (2 procs):
    Number of  Bytes sent by all procs:
        to       all         1         2
       all       400       200       200
         1       200         0       200
         2       200       200         0

The replication of the array A implies a communication of the non-local part of processor 1 to processor 2 and vice versa. The number of bytes for the non-local part is $200 = 50 \times 4$ bytes.

    export COMMSTAT=1
    mpirun -np 4 example -comm


Table 11: Communication statistic.
  all 1 2 3 4
all 1200 300 300 300 300
1 300 0 300 0 0
2 300 0 0 300 0
3 300 0 0 0 300
4 300 300 0 0 0


The communication statistics of the program running onto four processors shows that every processor sends $300 = 3 \times 25 \times 4$ bytes to its right neighbor. By looking at the more detailed information in the file example.commstat on can see that the replication is actually done in three ($np-1$) steps.

Communication Summary for program example (4 procs):
 total number of sends by all procs:
     to    all      1      2      3      4
    all     12      3      3      3      3
      1      3      0      3      0      0
      2      3      0      0      3      0
      3      3      0      0      0      3
      4      3      3      0      0      0
...
Communication statistics of proc # 1:

 SEND  total     total       avg       min       max
  PID    num     Bytes     Bytes     Bytes     Bytes
  all      3       300       100       100       100
    1      0         0         0         0         0
    2      3       300       100       100       100
    3      0         0         0         0         0
    4      0         0         0         0         0

The communication statistic prints a summary for the whole program and does not give any information related to the regions of the program. Nevertheless, it is possible to use communication counters that will give information related to the regions of the program.

8.3 Tracing of Communication

Vampir provides the possibility to visualize MPI. This can also be used for HPF programs compiled for distributed memory architectures using MPI.

    adaptor -vt -hpf_mpi -o example example.hpf

Important: The flag -vt should be set before the flag -hpf_mpi.

Running the application without any profiling by the ADAPTOR runtime will still do the profiling with VAMPIR.

    mpirun -np 4 example

But enabling the tracing guarantees that regions of the source code can be identified in the VAMPIR visualization.

    mpirun -np 4 example -trace

Therefore, if the (parallel) program is linked with -trace=vt and executed with -trace, a tracefile <exec>.stf will be generated that can afterwards be visualized with VAMPIR. In any case, the tracefile contains information about all profiled regions of the program (this includes timestamps for every enter and exit of the profiled region, but also source handles). Furthermore, if performance monitoring is enabled (flag -pm at runtime), the tracefile will also contain the performance counter values that can be visualized with VAMPIR.

Vampirtrace allows its own sampling of performance counters. By using this functionality the program should be linked with -trace=vtsample. Vampirtrace reads the (hardware) performance counter as specified in the configuration file (environment variable VT_CONFIG is used to specifiy this configuration file). The performance counters will be read at every trace event. As the flag -trace generates VampirTrace calls for every enter and exit of a region, performance counter values related to the regions of the program will be available. Using the ADAPTOR runtime flag -pm is still possible, but then performance counter values will appear twice. The difference between the two approaches of performance counter monitoring are the following ones:

Currently, ADAPTOR supports performance counters for the communication volume (number of bytes sent and received for every process) and can give a communication summary at the end of the program (number of bytes exchanged between the different processors). But ADAPTOR does not generate any trace call for the communication.

Nevertheless, it is possible to use VampirTrace for tracing of MPI communication generated for HPF programs (MPI must have been set as communication model for the distributed memory model by the ADAPTOR link flag -dm=mpi). This is possible as linking with the MPI library must only be replaced with linking the correspondig VampirTrace library. Therefore compiling and linking a parallel HPF program with the following ADAPTOR command enables the generation of a VampirTrace file for MPI communication (no ADAPTOR runtime flag has to be set).

adaptor -hpf_mpi -vt <program>.hpf

It is possible to combine the MPI visualization with the ADAPTOR supported trace capabilities for VampirTrace. The link step is as follows:

adaptor -hpf_mpi -trace=vt <program>.hpf

If now the runtime flag -trace is used, trace calls will not only be generated for the MPI calls (done automatically by VampirTrace) but also for the entering and leaving of regions (done by the ADAPTOR runtime system using the VampirTrace interface).


next up previous contents
Next: 9 Data Profiling Up: ADAPTOR Profiling Guide Previous: 7 Tracing of Performance   Contents
Thomas Brandes 2004-03-19