-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
What happened?
- ✋ I have searched the open/closed issues and my issue is not listed.
Some small Spark jobs, lasting from tens of seconds to a little over a minute, may have one driver and four executors. With limited Spark Operator resources (around 2 cores and 4GB), and 20 jobs running continuously, the Spark Operator may classify correctly exited jobs as failed. This is because when the Spark Operator is slow in processing events in the queue, it queries the driver pod and finds it missing, thus setting the app status to failed.
I'd like to ask the experts: is this how Spark Operator is designed? Are there any ways to avoid this situation? In practice, we try to alleviate this problem by increasing the operator's resources, but it doesn't feel like a reasonable choice.
I checked the code in branch 2.2.1 and found that the strawberry's conditional logic still checks whether the driver's POD exists.
Reproduction Code
func (c *Controller) getAndUpdateDriverState(app *v1beta2.SparkApplication) error {
// Either the driver pod doesn't exist yet or its name has not been updated.
if app.Status.DriverInfo.PodName == "" {
return fmt.Errorf("empty driver pod name with application state %s", app.Status.AppState.State)
}
driverPod, err := c.getDriverPod(app)
if err != nil {
return err
}
if driverPod == nil {
app.Status.AppState.ErrorMessage = "driver pod not found"
app.Status.AppState.State = v1beta2.FailingState
app.Status.TerminationTime = metav1.Now()
return nil
}
Expected behavior
The app status can be correctly determined.
Actual behavior
Error in judgment
Environment & Versions
- Kubernetes Version: 1.24
- Spark Operator Version: 1.1.25
- Apache Spark Version: 3.4.1
Additional context
No response
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The app's final status determination failed, impacting usability.