Skip to content

Commit 724fcc4

Browse files
John O'Harajohnaohara
authored andcommitted
Shift left on performance
1 parent 4b25db8 commit 724fcc4

File tree

7 files changed

+176
-0
lines changed

7 files changed

+176
-0
lines changed
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
---
2+
title: "Shift Left on Performance"
3+
date: 2025-03-04T00:00:00Z
4+
categories: ['performance', 'methodology', 'CI/CD']
5+
summary: 'To speed up delivery of software using CD pipelines, performance testing needs to be continual performed, with the correct bounds to add value to the software development process'
6+
image: 'shift-left-banner.png'
7+
related: ['']
8+
authors:
9+
- John O'Hara
10+
---
11+
= Shift Left on Performance
12+
:icons: font
13+
14+
Performance testing is often a major bottleneck in the productization of software. Performance tests are normally ran late in the development lifecycle, require a lot of manual intervention and regressions are detected often after a long bake time.
15+
16+
Shifting from versioned products to continual delivered services changes the risk profile of performance regressions and requires a paradigm shift for managing performance testing.
17+
18+
19+
== Problem Statement
20+
21+
How can performance engineering teams enable Eng / QE / SRE to integrate Performance Testing into their workflows, to reduce the risk that performance regressions propagate through to product releases and production services?
22+
23+
24+
== Typical Product Workflow
25+
26+
A typical "boxed product"footnote:[A product that is shipped to customers, either physically or electronically, typically with a multi-month/year release cadence] productization workflow can be represented by;
27+
28+
29+
image::typical_workflow.png[Typical Workflow,,,float="right",align="center"]
30+
31+
The key issues that this type of workflow has;
32+
33+
* There is a *break in continuity* between `tag build` and `release` stages of the CI Build pipeline
34+
* Development, build and performance testing are performed by different teams, each passing *async* messages between teams
35+
* The feedback loop to developers is *manual* and *slow*
36+
* There is a lot of *manual analysis* performed, often with ad-hoc data capture and reporting
37+
38+
The above scenario generally develops due to a number of factors;
39+
40+
* Dedicated performance environments are costly and difficult to setup and manage
41+
* Performance Analysis (including system performance monitoring and analysis) is generally a specialized role, concentrated in small teams
42+
* The time required to manage reliable/accurate benchmarks is often a time sink
43+
44+
'''
45+
46+
=== Whats the problem?
47+
48+
image::shift-left-chart.png[]
49+
50+
The further into the development cycle performance testing occurs, the more costly it is to fix performance bugs.footnote:[https://www.nist.gov/system/files/documents/director/planning/report02-3.pdf]
51+
52+
Over the years, methodologies have developed to allow functional tests to be performed earlier in the development lifecycle, reducing the time between functional regressions being introduced and discovered.
53+
54+
This has the benefits of;
55+
56+
* Push earlier into development cycle
57+
* Discover quality issues more quickly
58+
* Reduce cost to fix
59+
* Reduce test & deploy cycles
60+
61+
Functional issues are typically easier to fix than Performance issues because they involve specific, reproducible errors in the software's behavior; therefore, performance testing should be "Shifted-Left" in the same way that functional testing has been
62+
63+
'''
64+
65+
=== What does it mean to Shift left?
66+
67+
In the traditional Waterfall model for software development, shift left means pushing tests earlier into the development cycle;
68+
69+
image::shift-left-waterfall.jpeg[]
70+
71+
source: https://insights.sei.cmu.edu/blog/four-types-of-shift-left-testing/
72+
73+
==== In the Agile world
74+
75+
For continually delivered services, "shifting left" incudes an additional dimension;
76+
77+
image::shift-left-agile.jpeg[Agile Shift Left,,,float="right"]
78+
79+
source: https://insights.sei.cmu.edu/blog/four-types-of-shift-left-testing/
80+
81+
Not only do we want to include performance tests earlier in the dev/release cycle, we also want to ensure that the full suite of performance tests (or any proxy performance tests) captures performance regressions before multiple release cycles have occurred.
82+
83+
== Risks in the managed service world
84+
85+
Managed services changes the risk associated with a software product;
86+
87+
* *Multiple, Rapid dev cycles*; the time period between development and release is greatly reduced
88+
89+
* Probability of releasing a product with a performance regression *increased*
90+
91+
* Performance Regressions will effect *all* customers, *immediately*
92+
93+
94+
'''
95+
96+
== Performance Shift-Left Workflow
97+
98+
In order to manage the changed risk profile of managed services compared to boxed products, an new methodology is required;
99+
100+
image::shift-workflow.png[Agile Shift Left,,,float="right"]
101+
102+
In a "Shifted-left" model;
103+
104+
* *Code Repository Bots* allow performance engineers to initiate *Upstream Performance Tests* against open Pull Requests, returning comparative performance data to workflow that the engineer uses in the day-to-day job.
105+
* *Integrated Performance Threshold* tests provide automated gating of acceptable levels of performance
106+
* *Continual Performance Testing* allows for analyzing trends over time, scaling, soak and chaos type testing, asynchronously from the CI/CD build pipeline
107+
* *Automated Regression Detection* provides automated tooling for detecting catastrophic performance regression related to a single commit, or creeps in performance degradation over time
108+
109+
Continual analysis is performed by experienced engineers, but the process does not require manual intervention with each release.
110+
111+
Engineers are free to focus on implementing features and not worry about performance regressions. When regressions are detected, the information they need to identify the root cause is readily available, in a suitable format.
112+
113+
== Code Repository Bots
114+
115+
Code Repository Bots initiate performance tests against PR's. Their purpose is to allow engineers to make a decision on whether to merge a PR or not. The results need to be actionable by engineers. Profiling data should also be provide to allow engineers to understand what their changes are doing
116+
117+
Receive report & analysis of impact of changes to key performance metrics
118+
119+
Allow automated capture of profiling data of system under load, allowing engineers to *see* what their changes are doing under realistic scenarios
120+
121+
* Triggered from CI/CD pipeline
122+
* Automatic / Manual
123+
* Performance Results reported in PR
124+
* Actionable data for engineers (results/profiles added the PR's to keep all information co-located for each PR)
125+
126+
127+
== Integrated Performance Thresholds
128+
129+
The aim of Integrated Performance Tests is to determine whether a release meets acceptable levels of performance with respect to customer expectations, not to capture changes over time. The results need to be automatically calculated and should provide a boolean Pass/Fail result.
130+
131+
* Pass/Fail criteria - the same as functional tests, the performance should be either be acceptable, or not-acceptable
132+
* Fully automated - not manual intervention / analysis
133+
* Focused on user experience
134+
* Threshold based?
135+
* Integrated with QE tools
136+
* Portable Tests
137+
* Limits Thresholds defined by CPT
138+
139+
== Continual Performance Testing
140+
141+
The aim of Continual Performance Testing is to perform larger scale performance workloads, that can take time to perform.
142+
143+
These tests can include;
144+
145+
* Large scale end-to-end testing
146+
* Soak tests
147+
* Chaos Testing
148+
* Trend analysis
149+
* Scale testing
150+
* Automated tuning of environment
151+
* Detailed profiling and analysis work
152+
153+
== Automatic Change Detection
154+
155+
Automated tools that allow detection of changes in key performance metrics over time. Tools such as Horreumfootnote:[https://horreum.hyperfoil.io/] can be configured to monitor key performance metrics for particular products and cen be integrated into existing workflows and tools to raise alerts/block build pipelines when a significant change is detected.
156+
157+
The key to incorporating automated tools into the CI/CD pipeline is for the ability for the tools to integrated seamlessly into existing CI/CD pipelines and provide accurate, actionable events.
158+
159+
== Continual Profiling & Monitoring
160+
161+
Not all performance issues will be caught during the development lifecycle. It is crucial that production systems are capturing sufficient performance related data that allows performance issues to be identified in production. The data needs to be of sufficient resolution to be able to perform a root case analysis during a post mortem, or provide information to be able to test for the performance issue in the CI/CD pipeline.
162+
163+
== Integration with Analytical Tools
164+
165+
In order to understand the performance characteristic of a service running in production, all of the performance metrics captured at different stages of testing (dev, CI/CD, production) need to be accessible for a performance engineer to analyse.
166+
167+
This requires the performance metrics to be collocated, and available for analysis by tools, e.g. statistical analysis tools.
168+
169+
== Further Tooling to assist Product Teams
170+
171+
Other tools that can help product teams with performance related issues are;
172+
173+
* *Performance Bisect*: perform an automated bisect on source repository, running performance test(s) each time to automatically identify the code merge that introduced the performance regression
174+
* *Automated profiling analysis*: AI/ML models to automatically spot performance issues in profiling data
175+
* *Proxy Metrics*: System metrics captured during functional testing that will provide an indication that a performance/scale issue will manifest at runtime
176+
* *Automatic tuning of service configuration*: Using Hyper-Parameter Optimizationfootnote:[https://github.com/kruize/hpo] to automatically tune configuration space of a service to optimize the performance for a given target environment/workload
22 KB
Loading
997 KB
Loading
180 KB
Loading
17.9 KB
Loading
2.31 MB
Loading
1.08 MB
Loading

0 commit comments

Comments
 (0)