Next: 2 Related Work Up: ADAPTOR Profiling Guide Previous: Contents Contents

1 Introduction

Profiling helps to understand the dynamic behavior of an application, to detect performance problems and to find out where it can be made faster. ADAPTOR [ADA04] supports profiling by instrumentation of Fortran programs (and therefore profile data can be related back to source code) and by offering runtime support to gather profiling information. Performance events can be counted and accumulated for regions of the source program or trace files can be generated that might be visualized.

Figure 1 shows how ADAPTOR provides information about the call graph of a given program. The source-to-source translation (fadapt -call) on its own can provide the static call graph on the left hand side. Couting subroutine calls at runtime (hydflo -pm) gives the information how many times each subroutine has been called. Furthermore, a full trace file can provide the detailed information about the call graph at runtime (visualization with Vampir).

HYDFLO
 . INITIAL
 .  . GHOST5
 . SEEDRANF
 .  . MPPRANS
 .  .  . RANFK
 .  .  .  . IRANFEVEN
 .  .  . RANFKBINARY
 .  .  .  . IRANFODD
 .  .  . RANFATOK
 .  .  .  . RANFMODMULT
 .  .  . RANFMODMULT (+)
 . MPPRANF
 . GHOST3
 . FLUX
 . GHOST5 (+)
 . HYDRO

a) Static Call Graph.

Region	Calls
`HYDFLO`	1
`INITIAL`	1
`GHOST5`	101
`SEEDRANF`	1
`MPPRANS`	1
`RANFK`	1
`IRANFEVEN`	1
`RANFKBINARY`	1
`IRANFODD`	48
`RANFATOK`	1
`RANFMODMULT`	125075
`MPPRANF`	360
`GHOST3`	10
`FLUX`	20
`HYDRO`	20

b) Counting Calls.

$\includegraphics[height=53mm]{CALL.eps}$ c) Visualization of Runtime Call Graph.

Table 1 shows the profiling result for event counting within an example program capacity that calls three subprograms. The instrumentation generates runtime calls for the regions (subprograms) and at runtime performance counter values are read and related back to the regions of the code. For the example program the results show that the bad performance of the third subroutine is due to a high rate of Level-2 cache misses.

	wall_time	Total	Load	Level2	Level1
Region	(brutto)	Instructions	Instructions	Cache Misses	Cache Misses
	ms	M	M	M	M
`CAPACITY`	2379.80	1241.591	769.713	25.890	344.337
`CASE1`	479.80	418.502	257.502	0.000	0.005
`CASE2`	479.10	411.320	256.189	0.000	96.257
`CASE3`	1420.10	410.024	256.010	25.890	248.073

Figure 2 shows the profiling result for tracing the floating point instructions within a Finite Element application running 350 iterations. Due to the timeline it is possible to see that there is a lower performance for the iterations between 150 and 220. Event counting for regions where values are accumulated does not identify this problem.

**Figure 2:** Tracing of floating point instructions within a FEM application.
$\includegraphics[height=84mm]{vampir_FEM_fp_ins.eps}$

Figure 3 shows the result of data profiling for a region witin an application program. During the execution of the region, sampling is done for the data addresses that cause a Level-2 cache miss. Afterwards the data addresses are mapped back to the arrays of the region. The percentage distribution of the cache misses among the data structures is visualized as a pie chart.

**Figure 3:** Distribution of Level-2 cache misses among the data structures of a subroutine.
$\includegraphics[height=64mm]{DATA.eps}$

The following issues have been considered when designing the profile support of the ADAPTOR system:

Profiling is supported for all kind of Fortran programs; this includes serial programs but also parallel programs using MPI, HPF and OpenMP.
HPF and OpenMP programs can be compiled directly by ADAPTOR (see [Bra03a] and [Bra03b]). But instrumentation can also be applied for OpenMP Fortran programs that are compiled by any other OpenMP compiler.
Due to the instrumenation, recompilation is required. By using the compiler driver adaptor it is possible to change easily the Makefile to instrument, compile and link an existing program for profiling (unfortunately not as easy as using a single flag like -g).
The instrumented program runs in the same way as before with only very small overhead. Even when profiling is enabled, the profiling overhead is very small (but this depends certainly on what is really measured at runtime).
The runtime system supports event counting (accumulation for the regions) and tracing (generation of trace file containing trace records for the events). Event counting generates a simple text file containing the counter values while tracing requires a visualization tool like Vampir.
The decision about which events should be counted or traced is completely moved to runtime. Recompilation is not necessary when other performance events are measured. Recompilation becomes only necessary if more or less instrumentation for regions or data structures is needed.

This guide describes the features of the ADAPTOR system for profiling and explains how to use it on the different architectures.

Next: 2 Related Work Up: ADAPTOR Profiling Guide Previous: Contents Contents

Thomas Brandes 2004-03-19