Next: 7 Tracing of Performance Up: ADAPTOR Profiling Guide Previous: 5 Compiling and Linking Contents

Subsections

6 Counting Performance Events For Regions

6.1 The Idea of Event Counting

Event counting is a generalization of elapsed time measurement. Instead of just counting clock ticks it allows counting of other performance events (e.g. cache misses, floating point instructions) for which counters are enabled during the whole execution of the program. Certain events (e.g. ticks of wall-time clock) are counted by the operating system, some other events are counted by the ADAPTOR runtime system (e.g. number of calls for a region). Most processors allow also the counting of processor specific events by enabling hardware performance counter.

At runtime, the ADAPTOR runtime system reads the values of enabled counters at every entry and exit of a region and accumulates the values over the whole execution of the program.

A configuration file (e.g. pm_config specifies the events that are counted and where the counter values are accumulated for the regions. Before starting the program the runtime system also makes sure that the corresponding event counters are enabled.

6.2 Event Counting with ADAPTOR

Event counting is enabled by the following steps:

The program must be instrumented by fadapt to have runtime calls generated when a region or subprogram is entered or left (see Section 3).
Performance counters used by the ADAPTOR runtime itself (walltime ticks, communication counters) do not require any special link flags. But if PAPI or PCL performance counters should be used, the corresponding libraries and the ADAPTOR runtime interfaces must be linked (see Section 5.2).
```
    adaptor ... [-pm=papi | -pm=pcl | -pm=perfctr]
```
At runtime the (default) performance counters will be read if the flag -pm is enabled or if the environment variable PM is set. At the end of the program there will be a summary of the counter values related to the different regions.
```
    a.out -pm             ! measures default performance counters
```

Events that should be counted during the program execution can be specified via a configure file:

    a.out -pm=file      ! measures performance counters specified in file
    export PM=file
    a.out

6.3 Brutto and Netto Counting

The runtime system supports brutto and netto counting for regions (inclusive and exclusive counting). Brutto counting accumulates the counter values from start until the end of a region while netto counting does not accumulate the counter values within a region when another region has been entered.

Table 6 shows an example how the counter values are accumulated for brutto and netto counting. Brutto counting accumulates the number of events for all regions in the stack of called regions while netto counting accumulates the number of events only for the region that is on top of the stack.

region	counter	brutto counting	netto counting
calls	value	for `CAPACITY`	for `CAPACITY`
start `CAPACITY`	1000	0	0
start `SUB1`	1200	200	200
end `SUB1`	1300	300	200
start `SUB2`	1500	500	400
end `SUB2`	1800	800	400
end `CAPACITY`	2100	1100	700

6.4 Composed Counters

It is possible to specify a new counter whose value is composed by the values of two known counters.

new_counter = counter1 op counter2

The entry itself only defines a composed counter, but does not enable it. Enabling of the new counter is done in the same way as enabling any other counter.

The typical use of composed counters is to define rates between other counters. E.g. the number of cache misses itself does not give many information on its own, but related to the total number of load instructions the corresponding cache miss rate gives better information about a performance problem.

L1_MRATE  = L1_MISS / LD_INS
L2_MRATE  = L2_MISS / L2_READ
TLB_MRATE = DTLB_MISS / LD_INS

New counters can also be added counter values or the difference between counter values.

L2_READ1 = L2_HIT + L2_MISS
BUS_PF   = BUS - BUS_NP

It is only possible to define composed counters with two known counters. Composed counters cannot be used to define other composed counters. This has been proven to be a rather serious restriction in the following cases:

The cache miss ratio is the number of cache misses (L2_MISS) divided by the the number of cache accesses. If there is not a counter for the cache accesses but only for the cache hits (L2_HIT), the counter must be specified as follows:
```
L2_MISS_RATIO = L2_MISS / (L2_MISS + L2_HIT)
```
For this situation there is the following workaround:
```
L2_MISS_RATIO = L2_MISS # L2_HIT
CN  = C1 # C2  ! stands for  C1 / (C1 + C2)
```
The bus prefetch ratio is the number of prefetch accesses divided by the number of bus accesses (BUS). If there is not a counter for prefetch accesses but only for the non-prefetch accesses (BUS_NP), the counter must be specified as follows:
```
BUS_PREFETCH_RATIO =  (BUS - BUS_NP) / BUS
```
For this situation there is the following workaround:
```
BUS_PREFETCH_RATIO =  BUS_NP ~  BUS
DN  = D1 ~ D2  !  stands for (D2 - D1) / D2
```

If a composed counter is enabled, the runtime system enables the counters that define the composed counter.

	walltime	Level-2	Level-2	Level-2 Cache
Region		Cache Hits	Cache Misses	Miss Rate
	ms	M	M	%
`MG`	13034.071	997.914	72.918	6.809
`ZERO3`	1108.649	5.392	5.306	49.594
`ZRAN3`	2256.180	22.118	6.733	23.338
`VRANLC`	915.841	6.125	0.135	2.152
`COMM3`	271.549	3.294	2.656	44.640
`NORM2U3`	411.851	6.613	1.576	19.242
`RESID`	5576.151	578.691	34.209	5.582
`MG3P`	6742.085	640.729	43.987	6.424
`RPRJ3`	841.641	103.109	5.745	5.278
`PSINV`	2215.255	255.308	11.073	4.157
`INTERP`	1096.870	29.069	10.320	26.200

6.5 Derived Counter

ADAPTOR supports the possibility to rate counter values by the walltime and by the CPU time.

According the terminology for composed counters, the number of load operations per second could be defined as follows:

LOADS = LD_INS / WALL_TIME

This kind of definition is not possible as WALL_TIME itself is handled internally as a composed counter (WALL_TICKS / TICK_RATE). But the following definition gives the correct value:

LOADS = LD_INS '

Note: The definition of FLOPS as FLOPS = FP_INS ' is given by default.

	walltime	Level-2	Level-2 Cache
Region		Cache Misses	Misses / s
	ms	M	M
`MG`	13034.071	72.918	5.594
`ZERO3`	1108.649	5.306	4.786
`ZRAN3`	2256.180	6.733	2.984
`VRANLC`	915.841	0.135	0.147
`COMM3`	271.549	2.656	9.782
`NORM2U3`	411.851	1.576	3.825
`RESID`	5576.151	34.209	6.135
`MG3P`	6742.085	43.987	6.524
`RPRJ3`	841.641	5.745	6.826
`PSINV`	2215.255	11.073	4.999
`INTERP`	1096.870	10.320	9.409

In a similar way, counters can be derived for the CPU_TIME instead of WALL_TIME:

CPU_FLOPS  = FP_INS "

CPU_TIME = TOT_CYC / CPU_RATE
CPU_FLOPS = (FP_INS / TOT_CYC) * CPU_RATE

6.6 Output of Region Event Couting

After the termination of the program the runtime writes an output file pm.out that contains the values of the counted events related to the regions.

While counting itself is done by integer values, the output format can be specified by certain items in the configuration file:

<x>COUNTER [BRUTTO|NETTO] width precision

A counter prefixed with a m specifies that it will be rated for milli.
A counter prefixed with a k specifies that it will be rated for kilo.
A counter prefixed with a M specifies that it will be rated for Mega.
A counter prefixed with a % specifies that it will be rated for percent.
For a counter entry the field-width and precision specify how the rated value is printed (corresponds to width:precision in C programs).

# Performance Monitoring Configuration File
# inclusive walltime in milliseconds  %f12.3
WALL_TIME BRUTTO 12 3
# exclusive walltime in milliseconds  %f12.2
mWALL_TIME NETTO 12 2
# count load instructions Mega  %f14.3
MLD_INS BRUTTO 14 3
# count Level 2 Data cache misses kilo %f12.3
kL2_DCM BRUTTO 12 3
# count Level 1 Data cache misses, %f13.0
L1_DCM BRUTTO 13 0
# Level 1 miss rate in percentage, %f10.3
%L1_MISS_RATE BRUTTO 10 3

Next: 7 Tracing of Performance Up: ADAPTOR Profiling Guide Previous: 5 Compiling and Linking Contents

Thomas Brandes 2004-03-19