Event counting is a generalization of elapsed time measurement. Instead of just counting clock ticks it allows counting of other performance events (e.g. cache misses, floating point instructions) for which counters are enabled during the whole execution of the program. Certain events (e.g. ticks of wall-time clock) are counted by the operating system, some other events are counted by the ADAPTOR runtime system (e.g. number of calls for a region). Most processors allow also the counting of processor specific events by enabling hardware performance counter.
At runtime, the ADAPTOR runtime system reads the values of enabled counters at every entry and exit of a region and accumulates the values over the whole execution of the program.
A configuration file (e.g. pm_config specifies the events that are counted and where the counter values are accumulated for the regions. Before starting the program the runtime system also makes sure that the corresponding event counters are enabled.
Event counting is enabled by the following steps:
adaptor ... [-pm=papi | -pm=pcl | -pm=perfctr]
a.out -pm ! measures default performance counters
a.out -pm=file ! measures performance counters specified in file export PM=file a.out
The runtime system supports brutto and netto counting for regions (inclusive and exclusive counting). Brutto counting accumulates the counter values from start until the end of a region while netto counting does not accumulate the counter values within a region when another region has been entered.
Table 6 shows an example how the counter values are accumulated for brutto and netto counting. Brutto counting accumulates the number of events for all regions in the stack of called regions while netto counting accumulates the number of events only for the region that is on top of the stack.
It is possible to specify a new counter whose value is composed by the values of two known counters.
new_counter = counter1 op counter2
The entry itself only defines a composed counter, but does not enable it. Enabling of the new counter is done in the same way as enabling any other counter.
The typical use of composed counters is to define rates between other counters. E.g. the number of cache misses itself does not give many information on its own, but related to the total number of load instructions the corresponding cache miss rate gives better information about a performance problem.
L1_MRATE = L1_MISS / LD_INS L2_MRATE = L2_MISS / L2_READ TLB_MRATE = DTLB_MISS / LD_INS
New counters can also be added counter values or the difference between counter values.
L2_READ1 = L2_HIT + L2_MISS BUS_PF = BUS - BUS_NP
It is only possible to define composed counters with two known counters. Composed counters cannot be used to define other composed counters. This has been proven to be a rather serious restriction in the following cases:
L2_MISS_RATIO = L2_MISS / (L2_MISS + L2_HIT)For this situation there is the following workaround:
L2_MISS_RATIO = L2_MISS # L2_HIT CN = C1 # C2 ! stands for C1 / (C1 + C2)
BUS_PREFETCH_RATIO = (BUS - BUS_NP) / BUSFor this situation there is the following workaround:
BUS_PREFETCH_RATIO = BUS_NP ~ BUS DN = D1 ~ D2 ! stands for (D2 - D1) / D2
If a composed counter is enabled, the runtime system enables the counters that define the composed counter.
ADAPTOR supports the possibility to rate counter values by the walltime and by the CPU time.
According the terminology for composed counters, the number of load operations per second could be defined as follows:
LOADS = LD_INS / WALL_TIME
This kind of definition is not possible as WALL_TIME itself is handled internally as a composed counter (WALL_TICKS / TICK_RATE). But the following definition gives the correct value:
LOADS = LD_INS '
Note: The definition of FLOPS as FLOPS = FP_INS ' is given by default.
|
In a similar way, counters can be derived for the CPU_TIME instead of WALL_TIME:
CPU_FLOPS = FP_INS "
CPU_TIME = TOT_CYC / CPU_RATE CPU_FLOPS = (FP_INS / TOT_CYC) * CPU_RATE
After the termination of the program the runtime writes an output file pm.out that contains the values of the counted events related to the regions.
While counting itself is done by integer values, the output format can be specified by certain items in the configuration file:
<x>COUNTER [BRUTTO|NETTO] width precision
# Performance Monitoring Configuration File # inclusive walltime in milliseconds %f12.3 WALL_TIME BRUTTO 12 3 # exclusive walltime in milliseconds %f12.2 mWALL_TIME NETTO 12 2 # count load instructions Mega %f14.3 MLD_INS BRUTTO 14 3 # count Level 2 Data cache misses kilo %f12.3 kL2_DCM BRUTTO 12 3 # count Level 1 Data cache misses, %f13.0 L1_DCM BRUTTO 13 0 # Level 1 miss rate in percentage, %f10.3 %L1_MISS_RATE BRUTTO 10 3