driver pod completed before its status is processed by the operator. Then spark application was failed incorrectly.

### What happened?

- [x] ✋ I have searched the open/closed issues and my issue is not listed.

Some small Spark jobs, lasting from tens of seconds to a little over a minute, may have one driver and four executors. With limited Spark Operator resources (around 2 cores and 4GB), and 20 jobs running continuously, the Spark Operator may classify correctly exited jobs as failed. This is because when the Spark Operator is slow in processing events in the queue, it queries the driver pod and finds it missing, thus setting the app status to failed.

I'd like to ask the experts: is this how Spark Operator is designed? Are there any ways to avoid this situation? In practice, we try to alleviate this problem by increasing the operator's resources, but it doesn't feel like a reasonable choice.


I checked the code in branch 2.2.1 and found that the strawberry's conditional logic still checks whether the driver's POD exists.


### Reproduction Code

func (c *Controller) getAndUpdateDriverState(app *v1beta2.SparkApplication) error {
	// Either the driver pod doesn't exist yet or its name has not been updated.
	if app.Status.DriverInfo.PodName == "" {
		return fmt.Errorf("empty driver pod name with application state %s", app.Status.AppState.State)
	}

	driverPod, err := c.getDriverPod(app)
	if err != nil {
		return err
	}

	if driverPod == nil {
		app.Status.AppState.ErrorMessage = "driver pod not found"
		app.Status.AppState.State = v1beta2.FailingState
		app.Status.TerminationTime = metav1.Now()
		return nil
	}



### Expected behavior

The app status can be correctly determined.

### Actual behavior

Error in judgment

### Environment & Versions

- Kubernetes Version: 1.24
- Spark Operator Version: 1.1.25
- Apache Spark Version: 3.4.1


### Additional context

_No response_

### Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The app's final status determination failed, impacting usability.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

driver pod completed before its status is processed by the operator. Then spark application was failed incorrectly. #2755

What happened?

Reproduction Code

Expected behavior

Actual behavior

Environment & Versions

Additional context

Impacted by this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

driver pod completed before its status is processed by the operator. Then spark application was failed incorrectly. #2755

Description

What happened?

Reproduction Code

Expected behavior

Actual behavior

Environment & Versions

Additional context

Impacted by this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions