next up previous contents
Next: 2 Related Work Up: ADAPTOR Profiling Guide Previous: Contents   Contents

1 Introduction

Profiling helps to understand the dynamic behavior of an application, to detect performance problems and to find out where it can be made faster. ADAPTOR [ADA04] supports profiling by instrumentation of Fortran programs (and therefore profile data can be related back to source code) and by offering runtime support to gather profiling information. Performance events can be counted and accumulated for regions of the source program or trace files can be generated that might be visualized.

Figure 1 shows how ADAPTOR provides information about the call graph of a given program. The source-to-source translation (fadapt -call) on its own can provide the static call graph on the left hand side. Couting subroutine calls at runtime (hydflo -pm) gives the information how many times each subroutine has been called. Furthermore, a full trace file can provide the detailed information about the call graph at runtime (visualization with Vampir).

Figure 1: Counting of Subroutine Calls.
HYDFLO
 . INITIAL
 .  . GHOST5
 . SEEDRANF
 .  . MPPRANS
 .  .  . RANFK
 .  .  .  . IRANFEVEN
 .  .  . RANFKBINARY
 .  .  .  . IRANFODD
 .  .  . RANFATOK
 .  .  .  . RANFMODMULT
 .  .  . RANFMODMULT (+)
 . MPPRANF
 . GHOST3
 . FLUX
 . GHOST5 (+)
 . HYDRO
a) Static Call Graph.
Region Calls
HYDFLO 1
INITIAL 1
GHOST5 101
SEEDRANF 1
MPPRANS 1
RANFK 1
IRANFEVEN 1
RANFKBINARY 1
IRANFODD 48
RANFATOK 1
RANFMODMULT 125075
MPPRANF 360
GHOST3 10
FLUX 20
HYDRO 20
b) Counting Calls.
\includegraphics[height=53mm]{CALL.eps} c) Visualization of Runtime Call Graph.

Table 1 shows the profiling result for event counting within an example program capacity that calls three subprograms. The instrumentation generates runtime calls for the regions (subprograms) and at runtime performance counter values are read and related back to the regions of the code. For the example program the results show that the bad performance of the third subroutine is due to a high rate of Level-2 cache misses.


Table 1: Example of performance monitoring results for cache misses.
  wall_time Total Load Level2 Level1  
Region (brutto) Instructions Instructions Cache Misses Cache Misses  
  ms M M M M  
CAPACITY 2379.80 1241.591 769.713 25.890 344.337  
CASE1 479.80 418.502 257.502 0.000 0.005  
CASE2 479.10 411.320 256.189 0.000 96.257  
CASE3 1420.10 410.024 256.010 25.890 248.073  


Figure 2 shows the profiling result for tracing the floating point instructions within a Finite Element application running 350 iterations. Due to the timeline it is possible to see that there is a lower performance for the iterations between 150 and 220. Event counting for regions where values are accumulated does not identify this problem.

Figure 2: Tracing of floating point instructions within a FEM application.
\includegraphics[height=84mm]{vampir_FEM_fp_ins.eps}

Figure 3 shows the result of data profiling for a region witin an application program. During the execution of the region, sampling is done for the data addresses that cause a Level-2 cache miss. Afterwards the data addresses are mapped back to the arrays of the region. The percentage distribution of the cache misses among the data structures is visualized as a pie chart.

Figure 3: Distribution of Level-2 cache misses among the data structures of a subroutine.
\includegraphics[height=64mm]{DATA.eps}

The following issues have been considered when designing the profile support of the ADAPTOR system:

This guide describes the features of the ADAPTOR system for profiling and explains how to use it on the different architectures.


next up previous contents
Next: 2 Related Work Up: ADAPTOR Profiling Guide Previous: Contents   Contents
Thomas Brandes 2004-03-19