Skip to content

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217

Open
jkhelil wants to merge 1 commit intotektoncd:mainfrom
jkhelil:fix_deadlock
Open

Fix TektonInstallerSet deadlock when resources have deletionTimestamp#3217
jkhelil wants to merge 1 commit intotektoncd:mainfrom
jkhelil:fix_deadlock

Conversation

@jkhelil
Copy link
Member

@jkhelil jkhelil commented Feb 14, 2026

Changes

  • Added installerSetName parameter to ensureResources() and all callers
  • Checks OwnerReferences to determine if resource belongs to this InstallerSet
  • Only waits for owned resources, skips others
  • Allows reconciliation to continue even with TERMINATING CRDs

Fixes #2474

The operator enters a deadlock when any resource (e.g., CRD) has a deletionTimestamp during InstallerSet reconciliation. The current code immediately aborts the entire reconciliation phase, preventing critical namespace-scoped resources (ServiceAccounts, RBAC) from being created.

Symptoms:

  • No ServiceAccounts created in openshift-pipelines namespace
  • InstallerSets stuck with "reconcile again and proceed"
  • No component pods running (Deployments fail: serviceaccount not found)
  • Webhooks unavailable → TektonConfig can't reconcile
  • Operator logs show infinite CRD fetching loop

Impact: Complete operator failure during installations, upgrades, downgrades, or recovery operations.

Root Cause

Location: pkg/reconciler/kubernetes/tektoninstallerset/install.go:166-168

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

NONE

@tekton-robot tekton-robot added the release-note-none Denotes a PR that doesnt merit a release note. label Feb 14, 2026
@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please ask for approval from jkhelil after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 14, 2026
@jkhelil
Copy link
Member Author

jkhelil commented Feb 17, 2026

@vdemeester @anithapriyanatarajan PTAL

@jkhelil jkhelil closed this Feb 17, 2026
@jkhelil jkhelil reopened this Feb 17, 2026
@anithapriyanatarajan
Copy link
Contributor

/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 17, 2026
@anithapriyanatarajan
Copy link
Contributor

@jkhelil - could you help with steps to reproduce the issue?

}

// Resource is being deleted by another controller/InstallerSet, skip it
ressourceLogger.Debug("resource is being deleted by another owner, skipping")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to log the finalizer name as well and Deletion time stamp here?

@jkhelil
Copy link
Member Author

jkhelil commented Feb 18, 2026

@jkhelil - could you help with steps to reproduce the issue?
It’s a somewhat strange error that we’ve been carrying across several versions. if you remember all the issues people encountered during upgrades
what happens:
Some TaskRuns or PipelineRuns have finalizers. A user tries to delete TektonConfig to uninstall the operator because they notice a problem during an upgrade, or because teams are experiencing etcd or infrastructure issues.
When TektonConfig is deleted, the CRDs are also removed, but they get stuck due to the finalizers on the TaskRuns. The user doesn’t see this because they delete TektonConfig and don’t monitor the remaining resources. TektonConfig appears as deleted.
The user then tries to reinstall or upgrade using a new CSV, but it doesn’t work. This is because the installer set is stuck in a loop: it checks whether there is a deletion timestamp on the CRDs and keeps reconciling again. The loop never proceeds to the next step (installing the service accounts, RBAC, etc.).
The user ends up seeing an error related to the webhook service account, but in reality there is nothing wrong with it — the installer set is simply stuck continuously reconciling.

to reproduce, install once, have some workload, check finaliers are there, delete tektonconfig, reinstall or do an upgrade

@anithapriyanatarajan
Copy link
Contributor

anithapriyanatarajan commented Feb 24, 2026

@jkhelil - Attempted to test this implementation and have the following observation.

Installersets especially the pipeline static installerset that manages the CRD (tested this with CRD taskruns.tekton.dev) gets into deadlock, if any given CRD has a finalizer and marked for deletion.

With the current PR implementation, we skip CRD reconciliation if the resource does not belong to the new Installerset. However, this behavior prevents the new Installerset version from reconciling and installing the required CRDs.

What I observed was the static installersets that reconciles the CRDs does not create new instance of CRD if any of the CRDs are deleted or missing. If we skip this during deadlock, we end up in a situation where PRS/TRs will not be accpeted to the cluster unless the user manually figures out and has this reconciled by forcibly deleting the pipeline static installerset.

Hence instead of skipping, could we attempt to remove the finalizers with proper logging so that stale resource is removed and new CRD is created properly? Sample code block generated with Claude for reference:

    ressourceLogger.Warn("force deleting stale resource to break deadlock",
        "deletionTimestamp", res.GetDeletionTimestamp(),
        "finalizers", res.GetFinalizers(),
        "oldOwners", res.GetOwnerReferences(),
        "reason", "resource not owned by current InstallerSet") 

    // Remove finalizers to complete deletion
    res.SetFinalizers([]string{})
    if err := i.mfClient.Update(res); err != nil {
        ressourceLogger.Error("failed to remove finalizers", "error", err)
        return err
    }
    
    // Reconcile again to create new resource
    return v1alpha1.RECONCILE_AGAIN_ERR    

Thank you

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 25, 2026
@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
@anithapriyanatarajan
Copy link
Contributor

/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2026
@anithapriyanatarajan
Copy link
Contributor

@jkhelil - Apologize for being so fussy about this PR. Thank you for the update. On further testing observed the following issue:

When multiple CRDs are stuck in terminating state simultaneously (e.g., both pipelineruns.tekton.dev and taskruns.tekton.dev holding the customresourcecleanup.apiextensions.k8s.io finalizer), the current fix clears only one CRD's finalizer per reconcile pass because it did an early return RECONCILE_AGAIN_ERR immediately after the first successful finalizer removal — exiting the loop before reaching subsequent CRDs.

This produced the "alternating" behavior observed in logs:

Pass N: clears pipelineruns → returns, taskruns untouched
Pass N+1: clears taskruns → returns, pipelineruns re-checked
...and so on, doubling the time to recovery for every additional stuck CRD

Proposed fix derived with help from Claude: Replace the early return with a needsReconcileAgain = true + continue pattern, so all stuck CRDs have their finalizers cleared in a single reconcile pass. The RECONCILE_AGAIN_ERR is returned once after the loop completes. Proposed install.go for reference - https://gist.github.com/anithapriyanatarajan/9b78f1b09645f4adc6aea9547194fdb5

Have also included another minor return nil fix that stops spam of log entries - "failed to get resource: error=null" log entries. Though this is not related to this PR, this fix is very insignificant but keeps the logs clean.

Sorry again for delaying with this. But I think these fixes will help us completely get out of the upgrade issues due to stale PRs/TRs

Steps to capture the edge case:

  • make apply the PR changes on a kind cluster
  • Once all the pods are running in tekton-pipelines namespace, create simple PRs and allow to succeed.
  • Added a random finalizer finalizer.tekton.dev to the PR
  • executed kubectl delete tektonpipelines pipeline and watched if the ownerReference UIDs are chaning properly to the new pipeline CR's UID using the following commands
  • Watch update of ownerReference on PipelineRuns with kubectl get crd pipelineruns.tekton.dev -w -o json | jq '. | {name: .metadata.name, ownerUID: .metadata.ownerReferences[0].uid, resourceVersion: .metadata.resourceVersion, generation: .metadata.generation}'
  • Watched updated of OwnerReference on TaskRuns with kubectl get crd taskruns.tekton.dev -w -o json | jq '. | {name: .metadata.name, ownerUID: .metadata.ownerReferences[0].uid, resourceVersion: .metadata.resourceVersion, generation: .metadata.generation}'

Observed that only on alternate deletions the UIDs were updated. With that minor enhancement to iterate through all CRDs before existingm this issue was gone. Thankyou

@anithapriyanatarajan
Copy link
Contributor

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 5, 2026
@anithapriyanatarajan anithapriyanatarajan removed the lgtm Indicates that a PR is ready to be merged. label Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. release-note-none Denotes a PR that doesnt merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect *stale* tekton objects prior to a upgrade / reconciliaton

3 participants