Skip to content

Commit 71055c4

Browse files
committed
Merge branch 'v0.12-dev' into dev
2 parents 83ed8fc + 700a8f4 commit 71055c4

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1843
-1077
lines changed

CHANGELOG.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# *perf-cpp*: Changelog
22

3+
## v0.12.2
4+
5+
- **Metric Functions**: Metrics now support built-in functions such as `ratio(A, B)` and `sum(A, B, C, ...)`, enabling more expressive and reusable formulas (see the [documentation](docs/metrics.md#functions)).
6+
- **Optimized Compile-time Event Injection**: The generated runtime event registration class is now only created if it does not already exist, reducing unnecessary recompilation.
7+
- **Improved Live Event Accuracy**: Live event values now account for partial runtime durations via time scaling, improving accuracy when counters were not active for the full measurement window.
8+
39
## v0.12.1
410
This update extends event discovery to ARM platforms, improves hardware counter introspection, and enhances the flexibility of metric definitions.
511

@@ -29,7 +35,6 @@ The previous flat API is still available but deprecated and will be removed in `
2935
- **Explicit Latency Attributes**: Vendor-specific latency signals–*cache-access* on Intel and *cache-miss* on AMD–are now surfaced as distinct fields.
3036
- **Heterogeneous-core Support**: Sampling can target multiple PMU domains (e.g., *cpu_core* and *cpu_atom*) on hybrid Intel processors.
3137

32-
3338
## v0.10.0
3439
* New feature: The *auxiliary event* is added automatically if required by the (Intel-) hardware (see the [documentation](docs/sampling.md#sapphire-rapids-and-beyond)).
3540
* New feature: The *Memory Access Analyzer* allows to describe complex data objects and maps sampled memory addresses in order to report latency and access information (see the [documentation](docs/analyzing-memory-access-patterns)).

CMakeLists.txt

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,17 @@ set(PERF_CPP_SRC
3939
src/exception.cpp
4040
src/group.cpp
4141
src/hardware_info.cpp
42-
src/metric_expression.cpp
4342
src/requested_event.cpp
4443
src/sampler.cpp
4544
src/sample_decoder.cpp
4645
src/util/table.cpp
4746
src/mmap_buffer.cpp
4847
src/symbol_resolver.cpp
48+
src/metric/expression/token.cpp
49+
src/metric/expression/tokenizer.cpp
50+
src/metric/expression/parser.cpp
51+
src/metric/expression/function.cpp
52+
src/metric/expression/expression.cpp
4953
src/analyzer/memory_access.cpp
5054
src/analyzer/flame_graph_generator.cpp
5155
)

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,7 @@ This is a non-exhaustive list of academic research papers and blog articles (fee
192192
- [Analyzing memory accesses with modern processors](https://dl.acm.org/doi/abs/10.1145/3399666.3399896) (2020)
193193
- [Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10068807&tag=1) (2023)
194194
- [Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE](https://arxiv.org/html/2410.01514v1) (2024)
195+
- [Breaking the Cycle - A Short Overview of Memory-Access Sampling Differences on Modern x86 CPUs](https://dl.acm.org/doi/pdf/10.1145/3736227.3736241) (2025)
195196

196197
### Blog Posts
197198
- [C2C - False Sharing Detection in Linux Perf](https://joemario.github.io/blog/2016/09/01/c2c-blog/) (2016)

docs/metrics.md

Lines changed: 50 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,39 @@
11
# Metrics
2-
Performance metrics are critical for evaluating the efficiency of computer hardware using specific, user-defined calculations based on hardware events.
3-
One key metric frequently used is the "Cycles per Instruction" (CPI).
4-
This metric helps to measure how many CPU cycles are consumed for executiong an instruction, providing insight into the system's efficiency—the fewer the cycles needed per instruction, the more efficient the system.
2+
Performance metrics provide essential insights into hardware efficiency by combining multiple hardware events into meaningful calculations.
3+
A commonly used metric is "Cycles per Instruction" (CPI), which measures how many CPU cycles are required to execute an instruction.
4+
This metric reveals system efficiency–fewer cycles per instruction indicates better performance.
5+
56

67
> [!TIP]
7-
> Our examples include a working code-example: **[statistics/metric.cpp](../examples/statistics/metric.cpp)**.
8+
> Our examples include a working code example: **[statistics/metric.cpp](../examples/statistics/metric.cpp)**.
89
>
9-
> When [defining custom metrics](#creating-custom-metrics), you should take a look at the list of metrics in the [Likwid project](https://github.com/RRZE-HPC/likwid/tree/master/groups).
10-
11-
> [!NOTE]
12-
> Metrics are not applicable for [live events](recording-live-events.md).
10+
> When [defining custom metrics](#creating-custom-metrics), consider reviewing the comprehensive metric definitions in the [Likwid project](https://github.com/RRZE-HPC/likwid/tree/master/groups).
1311
1412
---
1513
## Table of Contents
1614
- [Built-in Metrics](#built-in-metrics)
17-
- [Utilizing Metrics](#utilizing-metrics)
15+
- [Using Metrics](#using-metrics)
1816
- [Defining Metrics](#creating-custom-metrics)
1917
---
2018

2119
## Built-in Metrics
22-
*perf-cpp* comes pre-equipped with several built-in metrics which can be used analogously to events.
23-
To employ these metrics, include their names in the `perf::EventCounter` instance as shown in the [Utilizing Metrics](#utilizing-metrics) section:
24-
25-
| Metric name | Description |
26-
|--------------------------|-----------------------------------------------------------------------|
27-
| `gigahertz` | Processor speed during the measurement (`cycles/seconds*1e+09`). |
28-
| `cycles-per-instruction` | Represents the number of cycles required per instruction. |
29-
| `instructions-per-cycle` | Represents the number of instructions executed per cycle. |
30-
| `cache-hit-ratio` | Indicates the ratio of cache hits to total cache accesses. |
31-
| `cache-miss-ratio` | Indicates the ratio of cache misses to total cache accesses. |
32-
| `dTLB-miss-ratio` | The ratio of data TLB misses to data TLB accesses. |
33-
| `iTLB-miss-ratio` | The ratio of instruction TLB misses to instruction TLB accesses. |
34-
| `L1-data-miss-ratio` | Reflects the ratio of L1 data cache misses to L1 data cache accesses. |
35-
| `branch-miss-ratio` | Reflects the ratio of branch misses to executed branches. |
36-
37-
## Utilizing Metrics
38-
Metrics function similarly to hardware events in the `perf::EventCounter`:
20+
*perf-cpp* includes several pre-defined metrics that you can use just like hardware events.
21+
Simply include their names in your `perf::EventCounter` by treating them as standard events (e.g., `event_counter.add("gigahertz");`):
22+
23+
| Metric name | Description |
24+
|--------------------------|----------------------------------------------------------------------|
25+
| `gigahertz` | Processor frequency during the measurement (`cycles/seconds*1e+09`). |
26+
| `cycles-per-instruction` | Number of cycles required per instruction. |
27+
| `instructions-per-cycle` | Number of instructions executed per cycle. |
28+
| `cache-hit-ratio` | Ratio of cache hits to total cache accesses. |
29+
| `cache-miss-ratio` | Ratio of cache misses to total cache accesses. |
30+
| `dTLB-miss-ratio` | Ratio of data TLB misses to data TLB accesses. |
31+
| `iTLB-miss-ratio` | Ratio of instruction TLB misses to instruction TLB accesses. |
32+
| `L1-data-miss-ratio` | Ratio of L1 data cache misses to L1 data cache accesses. |
33+
| `branch-miss-ratio` | Ratio of branch mispredictions to total executed branches. |
34+
35+
## Using Metrics
36+
Metrics work exactly like hardware events within the `perf::EventCounter`:
3937

4038
```cpp
4139
#include <perfcpp/event_counter.h>
@@ -56,21 +54,15 @@ const auto result = event_counter.result();
5654
const auto cycles_per_instruction = result.get("cycles-per-instruction");
5755
```
5856
59-
When metrics are used, *perf-cpp* internally counts the required hardware events (like cycles and instructions for CPI) and displays only the specified metrics and events.
57+
When you use metrics, *perf-cpp* automatically counts the necessary hardware events (such as *cycles* and *instructions* for the *cycles-per-instruction* metric) and presents only the requested metrics and events in the results.
6058
6159
## Creating Custom Metrics
62-
Metrics are often based on the performance counters supported by the underlying hardware.
63-
You can create custom metrics to tailor them to your specific hardware.
60+
Custom metrics allow you to leverage the specific performance counters available on your hardware platform.
6461
65-
> [!TIP]
66-
> The [Likwid project](https://github.com/RRZE-HPC/likwid/tree/master) gives an excellent and extensive list of available metrics for various CPUs.
67-
> Take a look at their [groups/ directory](https://github.com/RRZE-HPC/likwid/tree/master/groups).
68-
69-
There are two ways to define custom metrics.
70-
For both, you will need to create your own instance of the `perf::CounterDefinition` and pass it to the `perf::EventCounter` or `perf::Sampler`.
62+
*perf-cpp* offers two approaches for defining custom metrics: *formula-based* definitions using text expressions, or implementing custom classes that inherit from the `perf::Metric` interface.
7163
7264
### Using Formulas
73-
The first option is to express a metric as a calculation of several hardware and time events, for example:
65+
The simplest approach is to define metrics using mathematical expressions that combine hardware events and timing data:
7466
7567
```cpp
7668
auto counter_definition = perf::CounterDefinition{};
@@ -80,17 +72,32 @@ counter_definition.add("stalls-by-mem-loads",
8072
auto event_counter = perf::EventCounter{ counter_definition };
8173
```
8274

83-
The formular can use the following **operators**: `+`, `-`, `*`, and `/`.
75+
This example uses Intel SkylakeX architecture events and is adapted from [Likwid](https://github.com/RRZE-HPC/likwid/blob/master/groups/skylakeX/CYCLE_STALLS.txt).
8476

85-
In addition, **scientific numbers** (e.g., `1E5`, `1e-5`) can be used.
77+
#### Operators
78+
Formulas support the following **mathematical operators**: `+`, `-`, `*`, and `/`.
79+
You can also use **scientific notation** (e.g., `1E5`, `1e-5`) for constants.
8680

87-
> [!NOTE]
88-
> In formulas, event names that contain *operators* (like `-` in `L1D-misses`) need to be **escaped** using single quotes, e.g., `'L1D-misses'`.
81+
#### Functions
82+
Formulas provide built-in functions for common calculations:
8983

90-
The example depends on events from the Intel SkylakeX architecture and is taken from [Likwid](https://github.com/RRZE-HPC/likwid/blob/master/groups/skylakeX/CYCLE_STALLS.txt).
84+
| Function | Description |
85+
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|
86+
| `ratio(a,b)` or `d_ratio(a,b)` | Calculates the ratio between to operands, e.g., `ratio('branch-misses', 'branches')` calculates the *branch-miss ratio* |
87+
| `sum(a,b,...)` | Adds together two or more operands, e.g., `sum('mem_load_retired.l1_hit', 'mem_load_retired.l2_hit', 'mem_load_retired.l3_hit')` totals all cache hits |
88+
89+
Functions can be combined within metric expressions:
90+
91+
```cpp
92+
counter_definition.add("cache-miss-ratio",
93+
"ratio( sum('mem_load_retired.l1_miss', 'mem_load_retired.l2_miss', 'mem_load_retired.l3_miss'), sum('mem_load_retired.l1_hit', 'mem_load_retired.l2_hit', 'mem_load_retired.l3_hit') )");
94+
```
95+
96+
> [!NOTE]
97+
> Event names containing **mathematical operators** (such as the `-` in `L1D-misses`) must be **enclosed in single quotes**, e.g., `'L1D-misses'`.
9198
9299
### Implementing Metrics using the Interface
93-
The second option is to define metrics by implementing the `perf::Metric` interface, for example:
100+
For more complex calculations, you can create custom metric classes by implementing the `perf::Metric` interface:
94101

95102
```cpp
96103
#include <perfcpp/metric.h>
@@ -126,7 +133,7 @@ public:
126133
};
127134
````
128135

129-
After implementing custom metrics, incorporate them into the `perf::CounterDefinition` to utilize them effectively:
136+
After implementing your custom metric, register it with the `perf::CounterDefinition`:
130137

131138
```cpp
132139
auto counter_definition = perf::CounterDefinition{};

docs/recording-live-events.md

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,7 @@ auto event_counter = perf::EventCounter{ counter_definition };
2727

2828
try {
2929
/// Events for live monitoring.
30-
event_counter.add_live({"cache-misses", "cache-references"});
31-
32-
/// Traditional events for post-processing analysis.
33-
event_counter.add({"instructions", "cycles", "branches", "branch-misses", "cache-misses", "cache-references"});
30+
event_counter.add_live({"cache-misses", "cache-references", "branches"});
3431
} catch (std::runtime_error& e) {
3532
std::cerr << e.what() << std::endl;
3633
}
@@ -40,6 +37,11 @@ try {
4037
> The `perf::CounterDefinition` instance is used to store event configurations (e.g., names) and passed as a reference.
4138
> Consequently, the instance needs to be alive while using the `EventCounter`.
4239
40+
> [!IMPORTANT]
41+
> We experienced that not mixing live with "traditional" events leads to more consistent results.
42+
43+
> [!NOTE]
44+
> Live events can only capture hardware events but not metrics.
4345
4446
## Initializing the Hardware Counters *(optional)*
4547
Optionally, preparing the hardware counters ahead of time to exclude configuration time from your measurements, though this is also handled automatically at the start if skipped:
@@ -114,17 +116,11 @@ for (auto i = 0U; i < runs; ++i) {
114116
```
115117

116118
## Finalizing and Retrieving Results
117-
Upon completion, stop the counters and fetch final results for non-live events:
119+
Upon completion, stop the counters:
118120

119121
```cpp
120122
/// Stop the counter after processing.
121123
event_counter.stop();
122-
123-
/// Calculate the result.
124-
const auto result = event_counter.result();
125-
126-
//// Or print the results as table.
127-
std::cout << result.to_string() << std::endl;
128124
```
129125

130126
For further information, refer to the [recording basics documentation](recording.md) and the [code example](../examples/statistics/live_events.cpp).

examples/access_benchmark.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,21 @@ class AccessBenchmark
6969
[[nodiscard]] const std::vector<std::uint64_t>& indices() const noexcept { return _indices; }
7070
[[nodiscard]] const std::vector<cache_line>& data_to_read() const noexcept { return _data_to_read; }
7171

72+
/**
73+
* Makes the compiler think that the result is used – consequently, the optimizer cannot optimize the value away.
74+
*
75+
* @param result Value that should not be optimized away.
76+
*/
77+
template<typename T>
78+
inline void pretend_to_use(T& result) const noexcept
79+
{
80+
#ifdef __clang__
81+
asm volatile("" : "+r,m"(result) : : "memory");
82+
#else
83+
asm volatile("" : "+m,r"(value) : : "memory");
84+
#endif
85+
}
86+
7287
private:
7388
/// Indices, defining the order in which the memory chunk is accessed.
7489
std::vector<std::uint64_t> _indices;

examples/sampling/branch.cpp

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,9 @@ main()
5252
for (auto index = 0U; index < benchmark.size(); ++index) {
5353
value += branchy_function(benchmark[index]);
5454
}
55-
asm volatile(""
56-
: "+r,m"(value)
57-
:
58-
: "memory"); /// We do not want the compiler to optimize away
59-
/// this unused value.
55+
56+
/// We do not want the compiler to optimize away this (otherwise) unused value (and consequently the loop above).
57+
benchmark.pretend_to_use(value);
6058

6159
/// Stop sampling.
6260
sampler.stop();

examples/sampling/context_switch.cpp

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,9 @@ main()
3636
for (auto index = 0U; index < benchmark.size(); ++index) {
3737
value += benchmark[index].value;
3838
}
39-
asm volatile(""
40-
: "+r,m"(value)
41-
:
42-
: "memory"); /// We do not want the compiler to optimize away
43-
/// this unused value.
39+
40+
/// We do not want the compiler to optimize away this (otherwise) unused value (and consequently the loop above).
41+
benchmark.pretend_to_use(value);
4442

4543
/// Stop sampling.
4644
sampler.stop();

examples/sampling/counter.cpp

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -42,11 +42,9 @@ main()
4242
for (auto index = 0U; index < benchmark.size(); ++index) {
4343
value += benchmark[index].value;
4444
}
45-
asm volatile(""
46-
: "+r,m"(value)
47-
:
48-
: "memory"); /// We do not want the compiler to optimize away
49-
/// this unused value.
45+
46+
/// We do not want the compiler to optimize away this (otherwise) unused value (and consequently the loop above).
47+
benchmark.pretend_to_use(value);
5048

5149
/// Stop sampling.
5250
sampler.stop();

examples/sampling/flame_graph.cpp

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,9 @@ main()
3636
for (auto index = 0U; index < benchmark.size(); ++index) {
3737
value += benchmark[index].value;
3838
}
39-
asm volatile(""
40-
: "+r,m"(value)
41-
:
42-
: "memory"); /// We do not want the compiler to optimize away
43-
/// this unused value.
39+
40+
/// We do not want the compiler to optimize away this (otherwise) unused value (and consequently the loop above).
41+
benchmark.pretend_to_use(value);
4442

4543
/// Stop sampling.
4644
sampler.stop();

0 commit comments

Comments
 (0)