Skip to content

Commit 2f0b7dc

Browse files
committed
feat: Add GitHub Actions docs and simplify chaos test workflow
Signed-off-by: XploY04 <[email protected]>
1 parent 7abea15 commit 2f0b7dc

File tree

4 files changed

+240
-8
lines changed

4 files changed

+240
-8
lines changed

.github/README.md

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# GitHub Actions for CloudNativePG Chaos Testing
2+
3+
This directory contains GitHub Actions workflows and reusable composite actions for automated chaos testing of CloudNativePG clusters.
4+
5+
## Workflows
6+
7+
### `chaos-test-full.yml`
8+
9+
Comprehensive chaos testing workflow that validates PostgreSQL cluster resilience under failure conditions.
10+
11+
**What it does**:
12+
- Provisions a Kind cluster using cnpg-playground
13+
- Installs CloudNativePG operator and PostgreSQL cluster
14+
- Deploys Litmus Chaos and Prometheus monitoring
15+
- Runs Jepsen consistency tests with pod-delete chaos injection
16+
- **Validates resilience** - fails the build if chaos tests don't pass
17+
- Collects comprehensive artifacts including cluster state dumps on failure
18+
19+
**Triggers**:
20+
- **Manual**: `workflow_dispatch` with configurable chaos duration (default: 300s)
21+
- **Automatic**: Pull requests to `main` branch (skips documentation-only changes)
22+
- **Scheduled**: Weekly on Sundays at 13:00 UTC
23+
24+
**Quality Gates**:
25+
- Litmus chaos experiment must pass
26+
- Jepsen consistency validation must pass (`:valid? true`)
27+
- Workflow fails if either check fails
28+
29+
---
30+
31+
## Reusable Composite Actions
32+
33+
### `free-disk-space`
34+
35+
Removes unnecessary pre-installed software from GitHub runners to free up ~40GB of disk space.
36+
37+
**What it removes**:
38+
- .NET SDK (~15-20 GB)
39+
- Android SDK (~12 GB)
40+
- Haskell tools (~5-8 GB)
41+
- Large tool caches (CodeQL, Go, Python, Ruby, Node)
42+
- Unused browsers
43+
44+
**What it preserves**:
45+
- Docker
46+
- kubectl
47+
- Kind
48+
- Helm
49+
- jq
50+
51+
**Usage**:
52+
```yaml
53+
- name: Free disk space
54+
uses: ./.github/actions/free-disk-space
55+
```
56+
57+
---
58+
59+
### `setup-tools`
60+
61+
Installs and upgrades chaos testing tools to latest stable versions.
62+
63+
**Tools installed/upgraded**:
64+
- kubectl (latest stable)
65+
- Kind (latest release)
66+
- Helm (latest via official installer)
67+
- krew (kubectl plugin manager)
68+
- kubectl-cnpg plugin (via krew)
69+
70+
**Usage**:
71+
```yaml
72+
- name: Setup chaos testing tools
73+
uses: ./.github/actions/setup-tools
74+
```
75+
76+
---
77+
78+
### `setup-kind`
79+
80+
Creates a Kind cluster using the proven cnpg-playground configuration.
81+
82+
**Features**:
83+
- Multi-node cluster with PostgreSQL-labeled nodes
84+
- Configured for HA testing
85+
- Proven configuration from cnpg-playground
86+
87+
**Inputs**:
88+
- `region` (optional): Region name for the cluster (default: `eu`)
89+
90+
**Outputs**:
91+
- `kubeconfig`: Path to kubeconfig file
92+
- `cluster-name`: Name of the created cluster
93+
94+
**Usage**:
95+
```yaml
96+
- name: Create Kind cluster
97+
uses: ./.github/actions/setup-kind
98+
with:
99+
region: eu
100+
```
101+
102+
---
103+
104+
### `setup-cnpg`
105+
106+
Installs CloudNativePG operator and deploys a PostgreSQL cluster.
107+
108+
**What it does**:
109+
1. Installs CNPG operator using `kubectl cnpg install generate` (recommended method)
110+
2. Waits for operator deployment to be ready
111+
3. Applies CNPG operator configuration
112+
4. Waits for webhook to be fully initialized
113+
5. Deploys PostgreSQL cluster
114+
6. Waits for cluster to be ready with health checks
115+
116+
**Requirements**:
117+
- `clusters/cnpg-config.yaml` - CNPG operator configuration
118+
- `clusters/pg-eu-cluster.yaml` - PostgreSQL cluster definition
119+
120+
**Usage**:
121+
```yaml
122+
- name: Setup CloudNativePG
123+
uses: ./.github/actions/setup-cnpg
124+
```
125+
126+
---
127+
128+
### `setup-litmus`
129+
130+
Installs Litmus Chaos operator, experiments, and RBAC configuration.
131+
132+
**What it installs**:
133+
- litmus-core operator (via Helm)
134+
- pod-delete chaos experiment
135+
- Litmus RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
136+
137+
**Verification**:
138+
- Checks all CRDs are installed
139+
- Verifies operator is ready
140+
- Validates RBAC permissions
141+
142+
**Requirements**:
143+
- `litmus-rbac.yaml` - RBAC configuration file
144+
145+
**Usage**:
146+
```yaml
147+
- name: Setup Litmus Chaos
148+
uses: ./.github/actions/setup-litmus
149+
```
150+
151+
---
152+
153+
### `setup-prometheus`
154+
155+
Installs Prometheus monitoring (without Grafana) and configures CNPG ServiceMonitor.
156+
157+
**What it installs**:
158+
- kube-prometheus-stack (Grafana and AlertManager disabled)
159+
- Prometheus Operator
160+
- kube-state-metrics
161+
- CNPG ServiceMonitor for PostgreSQL metrics
162+
163+
**Resource limits** (optimized for CI):
164+
- Prometheus: 512Mi request, 1Gi limit
165+
- Prometheus Operator: 128Mi request, 256Mi limit
166+
167+
**Requirements**:
168+
- `monitoring/podmonitor-pg-eu.yaml` - CNPG ServiceMonitor configuration
169+
170+
**Usage**:
171+
```yaml
172+
- name: Setup Prometheus
173+
uses: ./.github/actions/setup-prometheus
174+
```
175+
176+
---
177+
178+
## Artifacts
179+
180+
Each workflow run produces the following artifacts (retained for 30 days):
181+
182+
**Jepsen Results**:
183+
- `results.edn` - Test results in EDN format
184+
- `history.edn` - Operation history
185+
- `STATISTICS.txt` - Test statistics
186+
- `*.png` - Visualization graphs
187+
188+
**Litmus Results**:
189+
- `chaosresult.yaml` - Chaos experiment results
190+
191+
**Logs**:
192+
- `test.log` - Complete test execution log
193+
194+
**Cluster State** (on failure only):
195+
- `cluster-state-dump.yaml` - Complete cluster state including pods, events, and operator logs
196+
197+
---
198+
199+
## Usage in Other Workflows
200+
201+
You can reuse these actions in your own workflows:
202+
203+
```yaml
204+
name: My Chaos Test
205+
206+
on:
207+
workflow_dispatch:
208+
209+
jobs:
210+
test:
211+
runs-on: ubuntu-latest
212+
permissions:
213+
contents: read
214+
actions: write
215+
216+
steps:
217+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
218+
219+
- name: Free disk space
220+
uses: ./.github/actions/free-disk-space
221+
222+
- name: Setup tools
223+
uses: ./.github/actions/setup-tools
224+
225+
- name: Create cluster
226+
uses: ./.github/actions/setup-kind
227+
with:
228+
region: us
229+
230+
- name: Setup CNPG
231+
uses: ./.github/actions/setup-cnpg
232+
233+
# Your custom chaos testing steps here
234+
```
235+
236+
---

.github/workflows/chaos-test-full.yml

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ jobs:
2626
contents: read
2727
steps:
2828
- name: Checkout repository
29-
uses: actions/checkout@v4
29+
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
3030

3131
- name: Free disk space
3232
uses: ./.github/actions/free-disk-space
@@ -54,7 +54,6 @@ jobs:
5454
5555
echo "Verifying Prometheus is ready..."
5656
57-
# Quick check that Prometheus service exists
5857
kubectl -n monitoring get svc prometheus-kube-prometheus-prometheus >/dev/null 2>&1 || {
5958
echo "❌ Prometheus service not found"
6059
exit 1
@@ -74,8 +73,7 @@ jobs:
7473
echo "Chaos duration: ${{ inputs.chaos_duration || '300' }} seconds"
7574
echo ""
7675
77-
# Run the chaos test script
78-
./scripts/run-jepsen-chaos-test-v2.sh pg-eu app ${{ inputs.chaos_duration || '300' }}
76+
./scripts/run-jepsen-chaos-test.sh pg-eu app ${{ inputs.chaos_duration || '300' }}
7977
8078
- name: Collect test results
8179
if: always()
@@ -84,7 +82,6 @@ jobs:
8482
8583
echo "=== Collecting Test Results ==="
8684
87-
# Find the latest results directory
8885
RESULTS_DIR=$(ls -td logs/jepsen-chaos-* 2>/dev/null | head -1 || echo "")
8986
9087
if [ -z "$RESULTS_DIR" ]; then
@@ -95,7 +92,6 @@ jobs:
9592
echo "Results directory: $RESULTS_DIR"
9693
echo ""
9794
98-
# Parse Jepsen verdict
9995
echo "=== Jepsen Verdict ==="
10096
if [ -f "$RESULTS_DIR/results/results.edn" ]; then
10197
grep ':valid?' "$RESULTS_DIR/results/results.edn" || echo "No verdict found"
@@ -114,7 +110,7 @@ jobs:
114110
115111
- name: Upload test artifacts
116112
if: always()
117-
uses: actions/upload-artifact@v4
113+
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
118114
with:
119115
name: chaos-test-results-${{ github.run_number }}
120116
path: |

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ Import the official dashboard JSON from <https://github.com/cloudnative-pg/grafa
262262
### 6. Run the Jepsen chaos test
263263

264264
```bash
265-
./scripts/run-jepsen-chaos-test-v2.sh pg-eu app 600
265+
./scripts/run-jepsen-chaos-test.sh pg-eu app 600
266266
```
267267

268268
This script deploys Jepsen (`jepsenpg` image), applies the Litmus ChaosEngine (primary pod delete), monitors logs, collects Elle results, and cleans up transient resources **automatically** (no manual exit needed - the script handles everything).
File renamed without changes.

0 commit comments

Comments
 (0)