-
Notifications
You must be signed in to change notification settings - Fork 3
Description
/kind bug
In CI a test failed:
[FAIL] ClusterAddonReconciler [AfterEach] Basic test applies the helm chart
/home/runner/work/cluster-stack-operator/cluster-stack-operator/internal/test/integration/workloadcluster/cluster_addon_test.go:109
After pressing "re-run test" in the Github web UI, the test was fine.
This means that the test is flaky since it fails and succeeds without any code change.
Here is the successful run: https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962?pr=255
What did you expect to happen:
I expect tests to not be flaky.
Fixing flakyness
I used that script to run tests again and again:
#!/bin/bash
trap 'echo "ERROR: A command has failed. Exiting the script. Line was ($0:$LINENO): $(sed -n "${LINENO}p" "$0")"; exit 3' ERR
set -Eeuo pipefail
i=0
while make test-unit; do
i=$((i + 1))
echo "Test run $i"
date
sleep 1
echo "=========================================================================================
echo "=========================================================================================
echo
doneWorks fine, the second run failed:
Summarizing 1 Failure:
[FAIL] ClusterStackReconciler Test with provider Tests with multiple versions [It] checks ProviderClusterstackrelease is deleted when version is removed from spec
/home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:616
Ran 40 of 40 Specs in 10.203 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
From time to time this fails, too:
Summarizing 1 Failure:
[FAIL] ClusterStackReconciler Test with provider Basic test [It] creates the cluster stack release object with cluster stack auto subscribe false
/home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:487
Ran 40 of 40 Specs in 8.498 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
sometimes this fails:
Summarizing 1 Failure:
[FAIL] ClusterAddonReconciler Basic test [It] creates the clusterAddon object
/home/guettli/syself/cluster-stack-operator-public/internal/controller/clusteraddon_controller_test.go:112
Ran 40 of 40 Specs in 9.054 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped
How to make the test always fail
If the test waits until the clusterStackRelease v2 is created, then the test always fails.
/////////////////////////////////////////////////////////////////////
// Flaky test
FIt("checks ProviderClusterstackrelease is deleted when version is removed from spec", func() {
fmt.Println("itttttttttttttttttttttttttttttttttttttttttt")
ph, err := patch.NewHelper(clusterStack, testEnv)
Expect(err).ShouldNot(HaveOccurred())
// new
providerclusterStackReleaseRefV2 := &corev1.ObjectReference{
APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
Kind: "TestInfrastructureProviderClusterStackRelease",
Name: clusterStackReleaseTagV2,
Namespace: testNs.Name,
}
Eventually(func() bool {
obj, err := external.Get(ctx, testEnv.GetClient(), providerclusterStackReleaseRefV2, testNs.Name)
if err != nil {
fmt.Printf("foundProviderclusterStackReleaseRef is not found. %s\n", err.Error())
return false
}
fmt.Printf("foundProviderclusterStackReleaseRef is found. %s %+v %+v\n",
obj.GetName(), obj.GetFinalizers(), obj.GetOwnerReferences())
return true
}, timeout, interval).Should(BeTrue())Instead of the Eventually block, you can use time.Sleep(time.Second * 1) it has the same effect. This will fail here:
Eventually(func() bool {
return apierrors.IsNotFound(testEnv.Get(ctx, clusterStackReleaseTagV2Key, &csov1alpha1.ClusterStackRelease{}))
}, timeout, interval).Should(BeTrue())I think the real error is in BeforeEach(). It does not wait until the resources get created.
BeforeEach(func() {
clusterStack.Spec = csov1alpha1.ClusterStackSpec{
Provider: "docker",
Name: "ferrol",
KubernetesVersion: "1.27",
Versions: []string{"v1", "v2"},
AutoSubscribe: false,
ProviderRef: &corev1.ObjectReference{
APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
Kind: "TestInfrastructureProviderClusterStackReleaseTemplate",
Name: "provider-test1",
Namespace: testNs.Name,
},
}
Expect(testEnv.Create(ctx, clusterStack)).To(Succeed())
cs, err := clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v1")
Expect(err).To(BeNil())
clusterStackReleaseTagV1 = cs.String()
cs, err = clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v2")
Expect(err).To(BeNil())
clusterStackReleaseTagV2 = cs.String()
clusterStackReleaseTagV1Key = types.NamespacedName{Name: clusterStackReleaseTagV1, Namespace: testNs.Name}
clusterStackReleaseTagV2Key = types.NamespacedName{Name: clusterStackReleaseTagV2, Namespace: testNs.Name}
})@janiskemper what do you think: Does it make sense to wait in BeforeEach, so that the tests get a stable environment?
The good news: The test fails reproducible with FIt() this means. The flakiness is inside this single test (because no other test gets called).
Current conclusion
The test (It("checks ProviderClusterstackrelease is deleted when version is removed from spec")) has always been wrong. In some edge cases (roughly every 8th run), it failed because the correct sequence was executed.
The deletionTimestamp does not get propagated from the parent-object to the child-object, because in envTest the GC is not running.
Related kubebuilder docs: https://book.kubebuilder.io/reference/envtest.html#testing-considerations
I asked at #controller-runtime how other developers handle it.
I will update the code, so that I check for the DeletionTimestamp of the parent-object, and then check the ownerRef at the child-object. I will remove the test that the child-object gets deleted.