Skip to content

Conversation

@leandroberetta
Copy link
Contributor

@leandroberetta leandroberetta commented Nov 19, 2025

Description

In HyperShift deployments and other external control plane configurations, the flowlogs-pipeline pods were unable to connect to the Kubernetes API server, resulting in timeout errors:

Cannot get dynamic config: Get "https://172.30.0.1:443/api/v1/namespaces/netobserv/configmaps/flowlogs-pipeline-config-dynamic": dial tcp 172.30.0.1:443: i/o timeout
Failed to watch err="failed to list *v1.Pod: Get https://172.30.0.1:443/api/v1/pods?limit=500&resourceVersion=0\": dial tcp 172.30.0.1:443: i/o
timeout"

In HyperShift deployments, the control plane (API server, etcd, controllers) runs in a separate management cluster, while worker nodes run in a different cluster. This is evident from:

  1. HyperShift label: The kubernetes service endpoint has the label hypershift.openshift.io/managed: "true"
  2. No local pods: The default namespace contains no pods serving the API server endpoint

The previous NetworkPolicy only allowed egress traffic to port 6443 when targeting specific in-cluster namespaces (openshift-apiserver, openshift-kube-apiserver) using namespaceSelector with
podSelector. However, external control plane endpoints are not represented as pods within the worker cluster, so the namespace selector rules don't apply and traffic is blocked.

NB: this issue is specific to main branch; release-1.10 is unaffected

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@leandroberetta leandroberetta self-assigned this Nov 19, 2025
@leandroberetta leandroberetta requested a review from jotak November 19, 2025 14:46
@openshift-ci
Copy link

openshift-ci bot commented Nov 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from leandroberetta. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@OlivierCazade OlivierCazade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.05%. Comparing base (6613e89) to head (ae5996d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2157      +/-   ##
==========================================
+ Coverage   73.02%   73.05%   +0.02%     
==========================================
  Files          80       80              
  Lines        9180     9177       -3     
==========================================
  Hits         6704     6704              
+ Misses       2062     2060       -2     
+ Partials      414      413       -1     
Flag Coverage Δ
unittests 73.05% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
internal/controller/networkpolicy/np_objects.go 92.25% <ø> (-0.15%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jotak
Copy link
Member

jotak commented Nov 19, 2025

Hmm, this weakens considerably the egress policy
There should be another way, no? We're surely not the first having to deal with that

@leandroberetta
Copy link
Contributor Author

leandroberetta commented Nov 19, 2025

Hmm, this weakens considerably the egress policy There should be another way, no? We're surely not the first having to deal with that

I will investigate this issue. I think that we may want to watch the endpoint IP and add it to the network policy, but that would be something more complex, I don't think that ip would change, maybe we can do it one time, although if it changes, won't work, so we may want to watch that endpoint.

@jotak
Copy link
Member

jotak commented Nov 20, 2025

@leandroberetta yes, I asked the ovn folks, restricting on 172.20.0.1:6443 should work, and like you said we can know this address by watching the endpoint.
The only downside I see, is that the Endpoint API is deprecated so sooner or later it will break; but ok, we will fix it when that happens, I guess
EDIT: I see it's also defined as an EndpointSlice, so we can use the new API

@openshift-ci openshift-ci bot removed the lgtm label Nov 20, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 20, 2025

New changes are detected. LGTM label has been removed.

@leandroberetta
Copy link
Contributor Author

@leandroberetta yes, I asked the ovn folks, restricting on 172.20.0.1:6443 should work, and like you said we can know this address by watching the endpoint. The only downside I see, is that the Endpoint API is deprecated so sooner or later it will break; but ok, we will fix it when that happens, I guess EDIT: I see it's also defined as an EndpointSlice, so we can use the new API

Hi @jotak, good to know, yes, I implemented a version where first looks for EndpointSlice and fallback to Endpoint in case the new API is not available.

image

It's working fine for HyperShift, although I'm not sure how to properly try if the ip changes (I'm not sure is a common case or even possible). Same for IPv6 clusters (network policy needs CIDR so the /32 or /128 in the range needs to be provided accordingly).

@openshift-ci
Copy link

openshift-ci bot commented Nov 20, 2025

@leandroberetta: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator c6fa841 link false /test e2e-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants