next up previous contents
Next: 7 Tracing of Performance Up: ADAPTOR Profiling Guide Previous: 5 Compiling and Linking   Contents

Subsections

6 Counting Performance Events For Regions

6.1 The Idea of Event Counting

Event counting is a generalization of elapsed time measurement. Instead of just counting clock ticks it allows counting of other performance events (e.g. cache misses, floating point instructions) for which counters are enabled during the whole execution of the program. Certain events (e.g. ticks of wall-time clock) are counted by the operating system, some other events are counted by the ADAPTOR runtime system (e.g. number of calls for a region). Most processors allow also the counting of processor specific events by enabling hardware performance counter.

At runtime, the ADAPTOR runtime system reads the values of enabled counters at every entry and exit of a region and accumulates the values over the whole execution of the program.

A configuration file (e.g. pm_config specifies the events that are counted and where the counter values are accumulated for the regions. Before starting the program the runtime system also makes sure that the corresponding event counters are enabled.

6.2 Event Counting with ADAPTOR

Event counting is enabled by the following steps:

6.3 Brutto and Netto Counting

The runtime system supports brutto and netto counting for regions (inclusive and exclusive counting). Brutto counting accumulates the counter values from start until the end of a region while netto counting does not accumulate the counter values within a region when another region has been entered.

Table 6 shows an example how the counter values are accumulated for brutto and netto counting. Brutto counting accumulates the number of events for all regions in the stack of called regions while netto counting accumulates the number of events only for the region that is on top of the stack.


Table 6: Brutto and netto counting for region CAPACITY.
region counter brutto counting netto counting
calls value for CAPACITY for CAPACITY
start CAPACITY 1000 0 0
start SUB1 1200 200 200
end SUB1 1300 300 200
start SUB2 1500 500 400
end SUB2 1800 800 400
end CAPACITY 2100 1100 700


6.4 Composed Counters

It is possible to specify a new counter whose value is composed by the values of two known counters.

new_counter = counter1 op counter2

The entry itself only defines a composed counter, but does not enable it. Enabling of the new counter is done in the same way as enabling any other counter.

The typical use of composed counters is to define rates between other counters. E.g. the number of cache misses itself does not give many information on its own, but related to the total number of load instructions the corresponding cache miss rate gives better information about a performance problem.

L1_MRATE  = L1_MISS / LD_INS
L2_MRATE  = L2_MISS / L2_READ
TLB_MRATE = DTLB_MISS / LD_INS

New counters can also be added counter values or the difference between counter values.

L2_READ1 = L2_HIT + L2_MISS
BUS_PF   = BUS - BUS_NP

It is only possible to define composed counters with two known counters. Composed counters cannot be used to define other composed counters. This has been proven to be a rather serious restriction in the following cases:

If a composed counter is enabled, the runtime system enables the counters that define the composed counter.


Table 7: Composed counter: Level-2 Cache Miss Rate
  walltime Level-2 Level-2 Level-2 Cache
Region   Cache Hits Cache Misses Miss Rate
  ms M M %
MG 13034.071 997.914 72.918 6.809
ZERO3 1108.649 5.392 5.306 49.594
ZRAN3 2256.180 22.118 6.733 23.338
VRANLC 915.841 6.125 0.135 2.152
COMM3 271.549 3.294 2.656 44.640
NORM2U3 411.851 6.613 1.576 19.242
RESID 5576.151 578.691 34.209 5.582
MG3P 6742.085 640.729 43.987 6.424
RPRJ3 841.641 103.109 5.745 5.278
PSINV 2215.255 255.308 11.073 4.157
INTERP 1096.870 29.069 10.320 26.200


6.5 Derived Counter

ADAPTOR supports the possibility to rate counter values by the walltime and by the CPU time.

According the terminology for composed counters, the number of load operations per second could be defined as follows:

LOADS = LD_INS / WALL_TIME

This kind of definition is not possible as WALL_TIME itself is handled internally as a composed counter (WALL_TICKS / TICK_RATE). But the following definition gives the correct value:

LOADS = LD_INS '

Note: The definition of FLOPS as FLOPS = FP_INS ' is given by default.


Table 8: Level-2 cache misses derived by walltime.
  walltime Level-2 Level-2 Cache
Region   Cache Misses Misses / s
  ms M M
MG 13034.071 72.918 5.594
ZERO3 1108.649 5.306 4.786
ZRAN3 2256.180 6.733 2.984
VRANLC 915.841 0.135 0.147
COMM3 271.549 2.656 9.782
NORM2U3 411.851 1.576 3.825
RESID 5576.151 34.209 6.135
MG3P 6742.085 43.987 6.524
RPRJ3 841.641 5.745 6.826
PSINV 2215.255 11.073 4.999
INTERP 1096.870 10.320 9.409


In a similar way, counters can be derived for the CPU_TIME instead of WALL_TIME:

CPU_FLOPS  = FP_INS "

CPU_TIME = TOT_CYC / CPU_RATE
CPU_FLOPS = (FP_INS / TOT_CYC) * CPU_RATE

6.6 Output of Region Event Couting

After the termination of the program the runtime writes an output file pm.out that contains the values of the counted events related to the regions.

While counting itself is done by integer values, the output format can be specified by certain items in the configuration file:

<x>COUNTER [BRUTTO|NETTO] width precision

# Performance Monitoring Configuration File
# inclusive walltime in milliseconds  %f12.3
WALL_TIME BRUTTO 12 3
# exclusive walltime in milliseconds  %f12.2
mWALL_TIME NETTO 12 2
# count load instructions Mega  %f14.3
MLD_INS BRUTTO 14 3
# count Level 2 Data cache misses kilo %f12.3
kL2_DCM BRUTTO 12 3
# count Level 1 Data cache misses, %f13.0
L1_DCM BRUTTO 13 0
# Level 1 miss rate in percentage, %f10.3
%L1_MISS_RATE BRUTTO 10 3


next up previous contents
Next: 7 Tracing of Performance Up: ADAPTOR Profiling Guide Previous: 5 Compiling and Linking   Contents
Thomas Brandes 2004-03-19