|
| 1 | +--- |
| 2 | +title: "The Profiler Performance Engineering" |
| 3 | +date: 2025-12-04T00:00:00Z |
| 4 | +categories: ['performance', 'methodology'] |
| 5 | +summary: 'Learn how to proper benchmark an application' |
| 6 | +image: 'logo.png' |
| 7 | +related: [''] |
| 8 | +authors: |
| 9 | + - Diego Lovison |
| 10 | +--- |
| 11 | + |
| 12 | +Modern software systems are increasingly complex, and ensuring their performance under real-world conditions is critical |
| 13 | +to delivering reliable and scalable applications. Traditional performance testing often focuses on high-level metrics such |
| 14 | +as average response time or throughput. While these metrics are useful, they can obscure deeper system inefficiencies and |
| 15 | +bottlenecks. To uncover these hidden issues, a more granular and methodical approach is required—one that examines the |
| 16 | +behavior of individual system components under stress. |
| 17 | + |
| 18 | +This document introduces a performance engineering workflow that integrates profiling techniques with the |
| 19 | +https://www.brendangregg.com/usemethod.html[USE Method] (Utilization, Saturation, and Errors) to diagnose and resolve |
| 20 | +performance issues. By combining performance testing tools like Hyperfoil with low-level profilers such as |
| 21 | +https://github.com/async-profiler/async-profiler[async-profiler] and https://man7.org/linux/man-pages/man1/perf-stat.1.html[perf], |
| 22 | +we demonstrate how to identify CPU stalls, cache misses, and poor memory access patterns. Through a real-world benchmarking scenario, |
| 23 | +we show how profiling data can guide code optimizations and system configuration changes that lead to measurable |
| 24 | +improvements in Instruction Per Cycle (IPC) and overall system responsiveness. |
| 25 | + |
| 26 | +## Software Development Life Cycle |
| 27 | + |
| 28 | +A *software developer* implements features based on defined requirements—such as creating multiple endpoints to solve a |
| 29 | +specific business problem. Once development is complete, the *performance engineering* team gathers SLAs from stakeholders |
| 30 | +and designs performance tests that reflect real business use cases. These tests typically measure metrics like average |
| 31 | +response time. For each release that affects the business logic, the performance tests are rerun to detect any regressions. |
| 32 | +If a regression is found, the team receives feedback to address it. |
| 33 | + |
| 34 | +There is nothing wrong with this approach, but we can go even further. |
| 35 | + |
| 36 | +### Personas |
| 37 | + |
| 38 | +*Software Developer*: A professional who designs, builds, tests, and maintains software applications or systems. |
| 39 | + |
| 40 | +*Performance Engineering*: Ensures that a software system meets performance requirements under expected workloads. This |
| 41 | +involves creating and maintaining performance tests—using tools like Hyperfoil and web-based scenarios—to simulate |
| 42 | +real-world usage. The results provide valuable feedback to the team. If the system passes these tests, the product is |
| 43 | +considered ready for General Availability (GA). |
| 44 | + |
| 45 | +*Profiler Performance Engineering*: Analyzes performance test results by profiling the source code to uncover system |
| 46 | +bottlenecks. The process typically begins by identifying which resource—CPU, memory, disk, or network— the team has |
| 47 | +chosen to stress, guiding the analysis toward the root cause of any performance degradation. |
| 48 | + |
| 49 | +### Java Developer Persona Belt |
| 50 | + |
| 51 | +* Software Developer: Eclipse IDE, IBM Semeru JDK |
| 52 | +* Performance Engineering: Hyperfoil and a web-based application |
| 53 | +* Profiler Performance Engineering: async-profiler, jfrconv, perf, sar |
| 54 | + |
| 55 | +## The USE Method |
| 56 | + |
| 57 | +According to Brendan Gregg, The **U**tilization **S**aturation and **E**rrors (USE) Method is a methodology for analyzing the |
| 58 | +performance of any system. It directs the construction of a checklist, which for server analysis can be used for |
| 59 | +quickly identifying resource bottlenecks or errors. It begins by posing questions, and then seeks answers, instead of |
| 60 | +beginning with given metrics (partial answers) and trying to work backwards. |
| 61 | + |
| 62 | +### Terminology definitions: |
| 63 | + |
| 64 | +The terminology definition is: |
| 65 | + |
| 66 | +* *resource*: all physical server functional components (CPUs, disks, busses, ...) |
| 67 | +* *utilization*: the average time that the resource was busy servicing work |
| 68 | +* *saturation*: the degree to which the resource has extra work which it can't service, often queued |
| 69 | +* *errors*: the count of error events |
| 70 | + |
| 71 | +The metrics are usually expressed in the following terms: |
| 72 | + |
| 73 | +* *utilization*: as a percent over a time interval. eg, "one disk is running at 90% utilization". |
| 74 | +* *saturation*: as a queue length. eg, "the CPUs have an average run queue length of four". |
| 75 | +* *errors*: scalar counts. eg, "this network interface has had fifty late collisions". |
| 76 | + |
| 77 | +## The benchmark |
| 78 | + |
| 79 | +The benchmark scenario consists of three main components: a load generator, a System Under Test (SUT) running on an |
| 80 | +application server, and a database. The load generator initiates requests to various SUT endpoints, which in turn |
| 81 | +perform operations such as inserts, updates, and selects on the database. The benchmark is divided into two phases: |
| 82 | +warmup and steady state. During the warmup phase, the system gradually ramps up the number of requests over a defined |
| 83 | +period, aiming for a smooth transition into the steady state. This controlled increase helps stabilize the system |
| 84 | +before reaching the target injection rate for sustained testing. |
| 85 | + |
| 86 | +## Hand’s on |
| 87 | + |
| 88 | +In this example, we’ll examine a real-world scenario that occurred while benchmarking an application. |
| 89 | + |
| 90 | +### SUT CPU analyses |
| 91 | + |
| 92 | +We can start by looking for the “perf stat” for the SUT’ PID. "perf stat" is a powerful Linux command-line tool that |
| 93 | +gives you high-level statistics about a program's performance. It uses hardware performance counters in your CPU |
| 94 | +to measure things like instructions executed, CPU cycles, cache misses, and branch mispredictions. |
| 95 | +It's a great first step to understanding your application's behavior at a low level. |
| 96 | + |
| 97 | +The result for the performance testing is the following. |
| 98 | + |
| 99 | +``` |
| 100 | +Performance counter stats for process id '3724985': |
| 101 | + 3,707,235.65 msec task-clock # 5.296 CPUs utilized |
| 102 | +``` |
| 103 | + |
| 104 | +This metric indicates that 5.2 CPU cores are being utilized. For this test, we have a constraint of only 15 CPU cores |
| 105 | +available. Therefore, 5.2 ÷ 15 equals approximately 34%, meaning the CPU is busy 34% of the time. This suggests that |
| 106 | +the loader is not a highly utilized resource, so we could experiment by increasing its injection rate to raise the |
| 107 | +system load and observe the impact. However, this is not guaranteed to succeed, as other bottlenecks might limit the |
| 108 | +outcome. In our case, the loader can sustain that increase in the injection rate, and now the perf stat output for the SUT is: |
| 109 | + |
| 110 | +``` |
| 111 | +Performance counter stats for process id '3854032': |
| 112 | + 8,798,820.34 msec task-clock # 12.570 CPUs utilized |
| 113 | + 72,884,739 context-switches # 8.283 K/sec |
| 114 | + 30,504,245 cpu-migrations # 3.467 K/sec |
| 115 | + 192,855 page-faults # 21.918 /sec |
| 116 | +20,804,314,839,811 cycles # 2.364 GHz |
| 117 | +13,372,592,699,670 instructions # 0.64 insn per cycle |
| 118 | +2,535,826,054,792 branches # 288.201 M/sec |
| 119 | + 148,654,216,727 branch-misses # 5.86% of all branches |
| 120 | + 700.009883946 seconds time elapsed |
| 121 | +``` |
| 122 | + |
| 123 | +Our CPU resource showed 83% saturation over a 700-second interval, with the same error rate observed when CPU |
| 124 | +utilization was previously at 34%. This metric provides a strong starting point for deeper investigation. The |
| 125 | +next metric that we can analyze is the Instruction Per Cycle. The current value is 0.64. 0.64 is a low value for IPC. |
| 126 | +It means that the CPU is stalled and not computing. |
| 127 | + |
| 128 | +The CPU is https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html[stalled], |
| 129 | +not "stressed." It's spending most of its time waiting for data instead of doing useful work. In order to investigate |
| 130 | +the root cause, we can use async-profiler and monitor the cache-miss event. The output of the profiler can be a flame |
| 131 | +graph as the following. |
| 132 | + |
| 133 | +image::cache-miss-event.png[Interaction] |
| 134 | + |
| 135 | +#### The Analysis: Visualizing the Bottleneck |
| 136 | + |
| 137 | +Upon analyzing the `cache-misses` flame graph, a clear pattern emerges. The function triggering the most CPU stalls—is |
| 138 | +`java.util.HashMap$KeyIterator.next`. |
| 139 | + |
| 140 | +When we correlate this visual data with the source code, we see that the application is iterating over a |
| 141 | +large `HashSet` (which uses a `HashMap` internally). While this appears to be a standard operation in Java, the |
| 142 | +hardware reality is quite different. |
| 143 | + |
| 144 | +#### "Pointer Chasing" |
| 145 | + |
| 146 | +The performance penalty here stems from **memory layout** and **data dependency**. |
| 147 | + |
| 148 | +Iterating over a `HashMap` is effectively a **Pointer Chasing** operation. Unlike an `ArrayList`, where elements are |
| 149 | +stored contiguously in memory (allowing the CPU to calculate the next address mathematically), a `HashMap` stores |
| 150 | +data in scattered nodes. To move from one element to the next, the CPU must: |
| 151 | + |
| 152 | +1. **Fetch the current node** from memory. |
| 153 | +2. **Read the pointer** inside that node to find the address of the next node. |
| 154 | +3. **Fetch the next node** from that new address. |
| 155 | + |
| 156 | +#### The Consequence: Breaking Instruction Level Parallelism |
| 157 | + |
| 158 | +This structure creates a critical **Data Dependency**. The CPU *cannot* know *where* to look for the next item |
| 159 | +until the current item has been fully retrieved. |
| 160 | + |
| 161 | +This dependency chain breaks the CPU’s ability to use **Instruction Level Parallelism (ILP)**. |
| 162 | +Modern CPUs attempt to "speculatively load" data by fetching future elements while processing current ones. |
| 163 | +However, because the memory addresses in a `HashMap` are random and dependent on the previous node, |
| 164 | +the hardware prefetcher is rendered useless. |
| 165 | + |
| 166 | +Instead of processing data in parallel, the CPU pipeline is forced to **stall** completely between every element, |
| 167 | +waiting hundreds of cycles for data to arrive from main RAM. This explains the high cache-miss rate and the |
| 168 | +low IPC (Instructions Per Cycle) observed in our `perf` stats. |
| 169 | + |
| 170 | +#### Validating the Hypothesis: Correlating Cycles with Misses |
| 171 | + |
| 172 | +However, observing a high number of cache misses is not enough to declare a verdict. A function could generate |
| 173 | +millions of cache misses but execute so rarely that it impacts only 1% of the total runtime. |
| 174 | +To be certain this is our bottleneck, we must correlate the "Misses" (the event) with the "Cycles" (the time). |
| 175 | + |
| 176 | +image::cycles.png[Interaction] |
| 177 | + |
| 178 | +We compared the `cache-misses` flame graph against the `cycles` and the correlation was: |
| 179 | + |
| 180 | +* In the cycles profile: The HashMap iteration, Integer.intValue and String.equals logic dominated the graph, accounting for the vast majority of CPU time. |
| 181 | +* In the cache-misses profile: These exact same methods appeared as the widest bar. |
| 182 | + |
| 183 | +This overlap confirms the diagnosis: The application is not just "busy"; it is stalled. The high CPU usage seen in |
| 184 | +our original top output wasn't due to complex calculation, but rather the CPU burning cycles while waiting for data |
| 185 | +to arrive from memory to satisfy the HashMap's pointer-chasing requirements. |
| 186 | + |
| 187 | +## Acknowledge |
| 188 | + |
| 189 | +All this work was guided and supervised by Francesco Nigro. |
0 commit comments