Skip to content

Commit 7f9019b

Browse files
committed
The Profiler Performance Engineering
1 parent 83bb335 commit 7f9019b

File tree

6 files changed

+197
-1
lines changed

6 files changed

+197
-1
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
.hugo_build.lock
22
public/
3-
resources/
3+
resources/
4+
.idea
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Diego Lovison
3+
photo: ''
4+
twitter: diegolovison
5+
---
6+
Senior Software Engineer - Application Services Performance Team
17.3 KB
Loading
24.9 KB
Loading
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
title: "The Profiler Performance Engineering"
3+
date: 2025-12-04T00:00:00Z
4+
categories: ['performance', 'methodology']
5+
summary: 'Learn how to proper benchmark an application'
6+
image: 'logo.png'
7+
related: ['']
8+
authors:
9+
- Diego Lovison
10+
---
11+
12+
Modern software systems are increasingly complex, and ensuring their performance under real-world conditions is critical
13+
to delivering reliable and scalable applications. Traditional performance testing often focuses on high-level metrics such
14+
as average response time or throughput. While these metrics are useful, they can obscure deeper system inefficiencies and
15+
bottlenecks. To uncover these hidden issues, a more granular and methodical approach is required—one that examines the
16+
behavior of individual system components under stress.
17+
18+
This document introduces a performance engineering workflow that integrates profiling techniques with the
19+
https://www.brendangregg.com/usemethod.html[USE Method] (Utilization, Saturation, and Errors) to diagnose and resolve
20+
performance issues. By combining performance testing tools like Hyperfoil with low-level profilers such as
21+
https://github.com/async-profiler/async-profiler[async-profiler] and https://man7.org/linux/man-pages/man1/perf-stat.1.html[perf],
22+
we demonstrate how to identify CPU stalls, cache misses, and poor memory access patterns. Through a real-world benchmarking scenario,
23+
we show how profiling data can guide code optimizations and system configuration changes that lead to measurable
24+
improvements in Instruction Per Cycle (IPC) and overall system responsiveness.
25+
26+
## Software Development Life Cycle
27+
28+
A *software developer* implements features based on defined requirements—such as creating multiple endpoints to solve a
29+
specific business problem. Once development is complete, the *performance engineering* team gathers SLAs from stakeholders
30+
and designs performance tests that reflect real business use cases. These tests typically measure metrics like average
31+
response time. For each release that affects the business logic, the performance tests are rerun to detect any regressions.
32+
If a regression is found, the team receives feedback to address it.
33+
34+
There is nothing wrong with this approach, but we can go even further.
35+
36+
### Personas
37+
38+
*Software Developer*: A professional who designs, builds, tests, and maintains software applications or systems.
39+
40+
*Performance Engineering*: Ensures that a software system meets performance requirements under expected workloads. This
41+
involves creating and maintaining performance tests—using tools like Hyperfoil and web-based scenarios—to simulate
42+
real-world usage. The results provide valuable feedback to the team. If the system passes these tests, the product is
43+
considered ready for General Availability (GA).
44+
45+
*Profiler Performance Engineering*: Analyzes performance test results by profiling the source code to uncover system
46+
bottlenecks. The process typically begins by identifying which resource—CPU, memory, disk, or network— the team has
47+
chosen to stress, guiding the analysis toward the root cause of any performance degradation.
48+
49+
### Java Developer Persona Belt
50+
51+
* Software Developer: Eclipse IDE, IBM Semeru JDK
52+
* Performance Engineering: Hyperfoil and a web-based application
53+
* Profiler Performance Engineering: async-profiler, jfrconv, perf, sar
54+
55+
## The USE Method
56+
57+
According to Brendan Gregg, The **U**tilization **S**aturation and **E**rrors (USE) Method is a methodology for analyzing the
58+
performance of any system. It directs the construction of a checklist, which for server analysis can be used for
59+
quickly identifying resource bottlenecks or errors. It begins by posing questions, and then seeks answers, instead of
60+
beginning with given metrics (partial answers) and trying to work backwards.
61+
62+
### Terminology definitions:
63+
64+
The terminology definition is:
65+
66+
* *resource*: all physical server functional components (CPUs, disks, busses, ...)
67+
* *utilization*: the average time that the resource was busy servicing work
68+
* *saturation*: the degree to which the resource has extra work which it can't service, often queued
69+
* *errors*: the count of error events
70+
71+
The metrics are usually expressed in the following terms:
72+
73+
* *utilization*: as a percent over a time interval. eg, "one disk is running at 90% utilization".
74+
* *saturation*: as a queue length. eg, "the CPUs have an average run queue length of four".
75+
* *errors*: scalar counts. eg, "this network interface has had fifty late collisions".
76+
77+
## The benchmark
78+
79+
The benchmark scenario consists of three main components: a load generator, a System Under Test (SUT) running on an
80+
application server, and a database. The load generator initiates requests to various SUT endpoints, which in turn
81+
perform operations such as inserts, updates, and selects on the database. The benchmark is divided into two phases:
82+
warmup and steady state. During the warmup phase, the system gradually ramps up the number of requests over a defined
83+
period, aiming for a smooth transition into the steady state. This controlled increase helps stabilize the system
84+
before reaching the target injection rate for sustained testing.
85+
86+
## Hand’s on
87+
88+
In this example, we’ll examine a real-world scenario that occurred while benchmarking an application.
89+
90+
### SUT CPU analyses
91+
92+
We can start by looking for the “perf stat” for the SUT’ PID. "perf stat" is a powerful Linux command-line tool that
93+
gives you high-level statistics about a program's performance. It uses hardware performance counters in your CPU
94+
to measure things like instructions executed, CPU cycles, cache misses, and branch mispredictions.
95+
It's a great first step to understanding your application's behavior at a low level.
96+
97+
The result for the performance testing is the following.
98+
99+
```
100+
Performance counter stats for process id '3724985':
101+
3,707,235.65 msec task-clock # 5.296 CPUs utilized
102+
```
103+
104+
This metric indicates that 5.2 CPU cores are being utilized. For this test, we have a constraint of only 15 CPU cores
105+
available. Therefore, 5.2 ÷ 15 equals approximately 34%, meaning the CPU is busy 34% of the time. This suggests that
106+
the loader is not a highly utilized resource, so we could experiment by increasing its injection rate to raise the
107+
system load and observe the impact. However, this is not guaranteed to succeed, as other bottlenecks might limit the
108+
outcome. In our case, the loader can sustain that increase in the injection rate, and now the perf stat output for the SUT is:
109+
110+
```
111+
Performance counter stats for process id '3854032':
112+
8,798,820.34 msec task-clock # 12.570 CPUs utilized
113+
72,884,739 context-switches # 8.283 K/sec
114+
30,504,245 cpu-migrations # 3.467 K/sec
115+
192,855 page-faults # 21.918 /sec
116+
20,804,314,839,811 cycles # 2.364 GHz
117+
13,372,592,699,670 instructions # 0.64 insn per cycle
118+
2,535,826,054,792 branches # 288.201 M/sec
119+
148,654,216,727 branch-misses # 5.86% of all branches
120+
700.009883946 seconds time elapsed
121+
```
122+
123+
Our CPU resource showed 83% saturation over a 700-second interval, with the same error rate observed when CPU
124+
utilization was previously at 34%. This metric provides a strong starting point for deeper investigation. The
125+
next metric that we can analyze is the Instruction Per Cycle. The current value is 0.64. 0.64 is a low value for IPC.
126+
It means that the CPU is stalled and not computing.
127+
128+
The CPU is https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html[stalled],
129+
not "stressed." It's spending most of its time waiting for data instead of doing useful work. In order to investigate
130+
the root cause, we can use async-profiler and monitor the cache-miss event. The output of the profiler can be a flame
131+
graph as the following.
132+
133+
image::cache-miss-event.png[Interaction]
134+
135+
#### The Analysis: Visualizing the Bottleneck
136+
137+
Upon analyzing the `cache-misses` flame graph, a clear pattern emerges. The function triggering the most CPU stalls—is
138+
`java.util.HashMap$KeyIterator.next`.
139+
140+
When we correlate this visual data with the source code, we see that the application is iterating over a
141+
large `HashSet` (which uses a `HashMap` internally). While this appears to be a standard operation in Java, the
142+
hardware reality is quite different.
143+
144+
#### "Pointer Chasing"
145+
146+
The performance penalty here stems from **memory layout** and **data dependency**.
147+
148+
Iterating over a `HashMap` is effectively a **Pointer Chasing** operation. Unlike an `ArrayList`, where elements are
149+
stored contiguously in memory (allowing the CPU to calculate the next address mathematically), a `HashMap` stores
150+
data in scattered nodes. To move from one element to the next, the CPU must:
151+
152+
1. **Fetch the current node** from memory.
153+
2. **Read the pointer** inside that node to find the address of the next node.
154+
3. **Fetch the next node** from that new address.
155+
156+
#### The Consequence: Breaking Instruction Level Parallelism
157+
158+
This structure creates a critical **Data Dependency**. The CPU *cannot* know *where* to look for the next item
159+
until the current item has been fully retrieved.
160+
161+
This dependency chain breaks the CPU’s ability to use **Instruction Level Parallelism (ILP)**.
162+
Modern CPUs attempt to "speculatively load" data by fetching future elements while processing current ones.
163+
However, because the memory addresses in a `HashMap` are random and dependent on the previous node,
164+
the hardware prefetcher is rendered useless.
165+
166+
Instead of processing data in parallel, the CPU pipeline is forced to **stall** completely between every element,
167+
waiting hundreds of cycles for data to arrive from main RAM. This explains the high cache-miss rate and the
168+
low IPC (Instructions Per Cycle) observed in our `perf` stats.
169+
170+
#### Validating the Hypothesis: Correlating Cycles with Misses
171+
172+
However, observing a high number of cache misses is not enough to declare a verdict. A function could generate
173+
millions of cache misses but execute so rarely that it impacts only 1% of the total runtime.
174+
To be certain this is our bottleneck, we must correlate the "Misses" (the event) with the "Cycles" (the time).
175+
176+
image::cycles.png[Interaction]
177+
178+
We compared the `cache-misses` flame graph against the `cycles` and the correlation was:
179+
180+
* In the cycles profile: The HashMap iteration, Integer.intValue and String.equals logic dominated the graph, accounting for the vast majority of CPU time.
181+
* In the cache-misses profile: These exact same methods appeared as the widest bar.
182+
183+
This overlap confirms the diagnosis: The application is not just "busy"; it is stalled. The high CPU usage seen in
184+
our original top output wasn't due to complex calculation, but rather the CPU burning cycles while waiting for data
185+
to arrive from memory to satisfy the HashMap's pointer-chasing requirements.
186+
187+
## Acknowledge
188+
189+
All this work was guided and supervised by Francesco Nigro.
257 KB
Loading

0 commit comments

Comments
 (0)