Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Dec 7, 2025

Why are the changes needed?

Support to wait the batch recovery appliction submission to throttle the load on the system.

Add a new config to control it

Whether a metadata recovery task should wait for its corresponding engine submission to complete before finishing. All recovery tasks are submitted to a fixed thread pool controlled by kyuubi.metadata.recovery.threads. If true, a task blocks until the engine submission is done, helping throttle the load on the system if kyuubi.session.engine.startup.waitCompletion is false. If false, the task returns immediately after opening the session without waiting.

Close #7226

How was this patch tested?

GA.

Was this patch authored or co-authored using generative AI tooling?

No.

@turboFei turboFei added this to the v1.10.3 milestone Dec 7, 2025
@turboFei turboFei self-assigned this Dec 7, 2025
@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2025

Codecov Report

❌ Patch coverage is 0% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (fba1f94) to head (ea6282d).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...ache/kyuubi/server/KyuubiRestFrontendService.scala 0.00% 10 Missing ⚠️
...in/scala/org/apache/kyuubi/config/KyuubiConf.scala 0.00% 9 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7262   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         696     697    +1     
  Lines       43530   43576   +46     
  Branches     5883    5891    +8     
======================================
- Misses      43530   43576   +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@turboFei turboFei requested a review from pan3793 December 8, 2025 02:20
@turboFei turboFei changed the title Support to wait the batch recovery appliction submission to throttle the load on the system [KYUUBI #7226] Support to wait the batch recovery appliction submission to throttle the load on the system Dec 8, 2025
@JoonPark1
Copy link

@turboFei So, basically Fei, you updated the recoveryBatches() method so that by sourcing in this newly introduced property kyuubi.session.engine.startup.waitCompletion forces it so that when enabled, the batch recovery task for specific batch job doesn't return and blocks until the associated application actually executes from associated compute engine spin up by kyuubi server? Is this so that the kyuubi server isn't throttled from too many batches being recovered by recoveryNumThreads sized threadpool trying to recovery too many at once, because it can early return too often concurrently without the batch job actually being recovered?

if (recoveryWaitAppSubmission) {
                info(s"Waiting for batch[$batchId] application submission during recovery")
                val batchOp = batchSession.batchJobSubmissionOp
                while (!batchOp.appStarted && !OperationState.isTerminal(batchOp.getStatus.state)) {
                  Thread.sleep(300)
                }
              }

@turboFei
Copy link
Member Author

turboFei commented Dec 9, 2025

Hi @JoonPark1
With kyuubi.session.engine.startup.waitCompletion=false, after the app submitted, it would release the Spark submit process.

And the kyuubi server only need to monitor the application states.

I think it should be fine.

@turboFei turboFei closed this in 572cef8 Dec 11, 2025
turboFei added a commit that referenced this pull request Dec 11, 2025
…on to throttle the load on the system

### Why are the changes needed?

Support to wait the batch recovery appliction submission to throttle the load on the system.

Add a new config to control it

Whether a metadata recovery task should wait for its corresponding engine submission to complete before finishing. All recovery tasks are submitted to a fixed thread pool controlled by kyuubi.metadata.recovery.threads. If true, a task blocks until the engine submission is done, helping throttle the load on the system if kyuubi.session.engine.startup.waitCompletion is false. If false, the task returns immediately after opening the session without waiting.

Close #7226
### How was this patch tested?

GA.
### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #7262 from turboFei/recover_concurrent.

Closes #7226

ea6282d [Wang, Fei] config
2b0403d [Wang, Fei] refine docs
b5c5101 [Wang, Fei] refine
f6b510c [Wang, Fei] 1.10.3
b892c71 [Wang, Fei] Support to wait the batch recovery appliction submission to throttle the load on the system
c4740dc [Wang, Fei] conf

Authored-by: Wang, Fei <[email protected]>
Signed-off-by: Wang, Fei <[email protected]>
(cherry picked from commit 572cef8)
Signed-off-by: Wang, Fei <[email protected]>
@turboFei turboFei deleted the recover_concurrent branch December 11, 2025 07:08
@turboFei
Copy link
Member Author

thanks, merged to master(1.11.0) and 1.10.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Kyuubi Pod OOM when polling for status of spark driver for Simultaneous Large Number of Batch Jobs

4 participants