Next: 2 Related Work
Up: ADAPTOR Profiling Guide
Previous: Contents
  Contents
Profiling helps to understand the dynamic behavior of an application, to
detect performance problems and to find out where it can be made faster.
ADAPTOR [ADA04] supports profiling by
instrumentation of Fortran programs (and
therefore profile data can be related back to source code) and by
offering runtime support to gather profiling information.
Performance events can be counted and accumulated for regions of the
source program or trace files can be generated that might be
visualized.
Figure 1 shows how ADAPTOR provides information about
the call graph of a given program. The source-to-source translation
(fadapt -call) on its own can provide the static call graph
on the left hand side.
Couting subroutine calls at runtime (hydflo -pm) gives the
information how many times
each subroutine has been called. Furthermore, a full trace file can provide
the detailed information about the call graph at runtime (visualization
with Vampir).
Figure 1:
Counting of Subroutine Calls.
HYDFLO
. INITIAL
. . GHOST5
. SEEDRANF
. . MPPRANS
. . . RANFK
. . . . IRANFEVEN
. . . RANFKBINARY
. . . . IRANFODD
. . . RANFATOK
. . . . RANFMODMULT
. . . RANFMODMULT (+)
. MPPRANF
. GHOST3
. FLUX
. GHOST5 (+)
. HYDRO
a) Static Call Graph.
|
Region |
Calls |
HYDFLO |
1 |
INITIAL |
1 |
GHOST5 |
101 |
SEEDRANF |
1 |
MPPRANS |
1 |
RANFK |
1 |
IRANFEVEN |
1 |
RANFKBINARY |
1 |
IRANFODD |
48 |
RANFATOK |
1 |
RANFMODMULT |
125075 |
MPPRANF |
360 |
GHOST3 |
10 |
FLUX |
20 |
HYDRO |
20 |
b) Counting Calls.
|
c) Visualization of Runtime Call Graph.
|
|
Table 1 shows the profiling result for event
counting within an example program capacity that calls
three subprograms. The instrumentation
generates runtime calls for the regions (subprograms) and at runtime
performance counter values are read and related back to the regions of the
code. For the example program the results show that the bad performance of
the third subroutine is due to a high rate of Level-2 cache misses.
Table 1:
Example of performance monitoring results for cache misses.
|
wall_time |
Total |
Load |
Level2 |
Level1 |
|
Region |
(brutto) |
Instructions |
Instructions |
Cache Misses |
Cache Misses |
|
|
ms |
M |
M |
M |
M |
|
CAPACITY |
2379.80 |
1241.591 |
769.713 |
25.890 |
344.337 |
|
CASE1 |
479.80 |
418.502 |
257.502 |
0.000 |
0.005 |
|
CASE2 |
479.10 |
411.320 |
256.189 |
0.000 |
96.257 |
|
CASE3 |
1420.10 |
410.024 |
256.010 |
25.890 |
248.073 |
|
|
Figure 2 shows the profiling result for tracing the
floating point instructions within a Finite Element application running 350
iterations. Due to the timeline it is possible to see that there is a lower
performance for the iterations between 150 and 220.
Event counting for regions where values are accumulated does not
identify this problem.
Figure 2:
Tracing of floating point instructions within a FEM application.
|
Figure 3 shows the result of data profiling for a region
witin an application program. During the execution of the region, sampling is
done for the data addresses that cause a Level-2 cache miss. Afterwards the data
addresses are mapped back to the arrays of the region. The percentage distribution of
the cache misses among the data structures is visualized as a pie chart.
Figure 3:
Distribution of Level-2 cache misses among the data structures of a subroutine.
|
The following issues have been considered when designing the profile
support of the ADAPTOR system:
- Profiling is supported for all kind of Fortran programs; this includes
serial programs but also parallel programs using MPI, HPF and OpenMP.
- HPF and OpenMP programs can be compiled directly by ADAPTOR
(see [Bra03a] and [Bra03b]). But
instrumentation can also be applied for OpenMP Fortran programs that are
compiled by any other OpenMP compiler.
- Due to the instrumenation, recompilation is required.
By using the compiler driver adaptor it is possible to change
easily the Makefile to instrument, compile and link
an existing program for profiling (unfortunately not as easy as
using a single flag like -g).
- The instrumented program runs in the same way as before with only
very small overhead. Even when profiling is enabled, the profiling
overhead is very small (but this depends certainly on what is really
measured at runtime).
- The runtime system supports event counting (accumulation for the regions) and
tracing (generation of trace file containing trace records for the events).
Event counting generates a simple text file containing the counter values
while tracing requires a visualization tool like Vampir.
- The decision about which events should be counted or traced is completely moved
to runtime. Recompilation is not necessary when other performance events
are measured. Recompilation becomes only necessary if more or less
instrumentation for regions or data structures is needed.
This guide describes the features of the ADAPTOR system for profiling
and explains how to use it on the different architectures.
Next: 2 Related Work
Up: ADAPTOR Profiling Guide
Previous: Contents
  Contents
Thomas Brandes
2004-03-19