Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart

/kind bug

In CI a test failed:

https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962/job/29717241629?pr=255

> [FAIL] ClusterAddonReconciler [AfterEach] Basic test applies the helm chart
> /home/runner/work/cluster-stack-operator/cluster-stack-operator/internal/test/integration/workloadcluster/cluster_addon_test.go:109

After pressing "re-run test" in the Github web UI, the test was fine.

This means that the test is flaky since it fails and succeeds without any code change.

Here is the successful run: https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962?pr=255

**What did you expect to happen:**

I expect tests to not be flaky.


# Fixing flakyness

I used that script to run tests again and again:

```go
#!/bin/bash

trap 'echo "ERROR: A command has failed. Exiting the script. Line was ($0:$LINENO): $(sed -n "${LINENO}p" "$0")"; exit 3' ERR
set -Eeuo pipefail

i=0
while make test-unit; do
    i=$((i + 1))
    echo "Test run $i"
    date
    sleep 1
    echo "=========================================================================================
    echo "=========================================================================================
    echo
done
```

Works fine, the second run failed:

```
Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Tests with multiple versions [It] checks ProviderClusterstackrelease is deleted when version is removed from spec
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:616

Ran 40 of 40 Specs in 10.203 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
```

From time to time this fails, too:

```

Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Basic test [It] creates the cluster stack release object with cluster stack auto subscribe false
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:487

Ran 40 of 40 Specs in 8.498 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
```

sometimes this fails:

```

Summarizing 1 Failure:
  [FAIL] ClusterAddonReconciler Basic test [It] creates the clusterAddon object
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusteraddon_controller_test.go:112

Ran 40 of 40 Specs in 9.054 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
```

----------

## How to make the test always fail

If the test waits until the clusterStackRelease v2 is created, then the test always fails.



```go
			/////////////////////////////////////////////////////////////////////
			// Flaky test
			FIt("checks ProviderClusterstackrelease is deleted when version is removed from spec", func() {
				fmt.Println("itttttttttttttttttttttttttttttttttttttttttt")
				ph, err := patch.NewHelper(clusterStack, testEnv)
				Expect(err).ShouldNot(HaveOccurred())

				// new
				providerclusterStackReleaseRefV2 := &corev1.ObjectReference{
					APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
					Kind:       "TestInfrastructureProviderClusterStackRelease",
					Name:       clusterStackReleaseTagV2,
					Namespace:  testNs.Name,
				}

				Eventually(func() bool {
					obj, err := external.Get(ctx, testEnv.GetClient(), providerclusterStackReleaseRefV2, testNs.Name)
					if err != nil {
						fmt.Printf("foundProviderclusterStackReleaseRef is not found. %s\n", err.Error())
						return false
					}
					fmt.Printf("foundProviderclusterStackReleaseRef is found. %s %+v %+v\n",
						obj.GetName(), obj.GetFinalizers(), obj.GetOwnerReferences())
					return true
				}, timeout, interval).Should(BeTrue())
```

Instead of the `Eventually` block, you can use `time.Sleep(time.Second * 1)` it has the same effect. This will fail here:

```go
				Eventually(func() bool {
					return apierrors.IsNotFound(testEnv.Get(ctx, clusterStackReleaseTagV2Key, &csov1alpha1.ClusterStackRelease{}))
				}, timeout, interval).Should(BeTrue())
```

---

I think the real error is in BeforeEach(). It does not wait until the resources get created.

```go
			BeforeEach(func() {
				clusterStack.Spec = csov1alpha1.ClusterStackSpec{
					Provider:          "docker",
					Name:              "ferrol",
					KubernetesVersion: "1.27",
					Versions:          []string{"v1", "v2"},
					AutoSubscribe:     false,
					ProviderRef: &corev1.ObjectReference{
						APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
						Kind:       "TestInfrastructureProviderClusterStackReleaseTemplate",
						Name:       "provider-test1",
						Namespace:  testNs.Name,
					},
				}
				Expect(testEnv.Create(ctx, clusterStack)).To(Succeed())

				cs, err := clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v1")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV1 = cs.String()

				cs, err = clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v2")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV2 = cs.String()

				clusterStackReleaseTagV1Key = types.NamespacedName{Name: clusterStackReleaseTagV1, Namespace: testNs.Name}
				clusterStackReleaseTagV2Key = types.NamespacedName{Name: clusterStackReleaseTagV2, Namespace: testNs.Name}
			})
```

@janiskemper what do you think: Does it make sense to wait in BeforeEach, so that the tests get a stable environment?

The good news: The test fails reproducible with `FIt()` this means. The flakiness is inside this single test (because no other test gets called).

## Current conclusion

The test (`It("checks ProviderClusterstackrelease is deleted when version is removed from spec")`) has always been wrong. In some edge cases (roughly every 8th run), it failed because the correct sequence was executed. 

The deletionTimestamp does not get propagated from the parent-object to the child-object, because in envTest the GC is not running.

Related kubebuilder docs: https://book.kubebuilder.io/reference/envtest.html#testing-considerations

I asked at #controller-runtime how other developers handle it.

I will update the code, so that I check for the DeletionTimestamp of the parent-object, and then check the ownerRef at the child-object. I will remove the test that the child-object gets deleted.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Fixing flakyness

How to make the test always fail

Current conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Description

Fixing flakyness

How to make the test always fail

Current conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions