Skip to content

driver pod completed before its status is processed by the operator. Then spark application was failed incorrectly. #2755

@xpfwork

Description

@xpfwork

What happened?

  • ✋ I have searched the open/closed issues and my issue is not listed.

Some small Spark jobs, lasting from tens of seconds to a little over a minute, may have one driver and four executors. With limited Spark Operator resources (around 2 cores and 4GB), and 20 jobs running continuously, the Spark Operator may classify correctly exited jobs as failed. This is because when the Spark Operator is slow in processing events in the queue, it queries the driver pod and finds it missing, thus setting the app status to failed.

I'd like to ask the experts: is this how Spark Operator is designed? Are there any ways to avoid this situation? In practice, we try to alleviate this problem by increasing the operator's resources, but it doesn't feel like a reasonable choice.

I checked the code in branch 2.2.1 and found that the strawberry's conditional logic still checks whether the driver's POD exists.

Reproduction Code

func (c *Controller) getAndUpdateDriverState(app *v1beta2.SparkApplication) error {
// Either the driver pod doesn't exist yet or its name has not been updated.
if app.Status.DriverInfo.PodName == "" {
return fmt.Errorf("empty driver pod name with application state %s", app.Status.AppState.State)
}

driverPod, err := c.getDriverPod(app)
if err != nil {
	return err
}

if driverPod == nil {
	app.Status.AppState.ErrorMessage = "driver pod not found"
	app.Status.AppState.State = v1beta2.FailingState
	app.Status.TerminationTime = metav1.Now()
	return nil
}

Expected behavior

The app status can be correctly determined.

Actual behavior

Error in judgment

Environment & Versions

  • Kubernetes Version: 1.24
  • Spark Operator Version: 1.1.25
  • Apache Spark Version: 3.4.1

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The app's final status determination failed, impacting usability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions