-
Notifications
You must be signed in to change notification settings - Fork 19
Closed
Description
While playing with Add AWS CDK-based benchmarking environment, I'm seeing that queries are likely to hang forever while running the benchmarks, printing Query still running... in a loop with nothing happening.
I wonder if there is an issue that's not captured by the current tests because of the fact they use local files rather than S3 ones.
I attempted printing the StageKeys in the ttl_map that stores the state of each query to see if there's something wrong there, and this is what I saw:
output log
[2025-11-18T12:40:46Z INFO worker] Executing query...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:40:51Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:40:56Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:01Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:06Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:11Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:16Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:21Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:26Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:31Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:36Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:41Z INFO worker] Query still running...
[StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 8, task_number: 0 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 4, task_number: 1 }, StageKey { query_id: b"\xb2\x7f\xe9\xa6\xd6\xc2O\x86\x82\xd1\x96\x03\\\x04\x1e0", stage_id: 3, task_number: 1 }]
[2025-11-18T12:41:46Z INFO worker] Query still running...
[]
[2025-11-18T12:41:51Z INFO worker] Query still running...
[]
[2025-11-18T12:41:56Z INFO worker] Query still running...
[]
[2025-11-18T12:42:01Z INFO worker] Query still running...
[]
[2025-11-18T12:42:06Z INFO worker] Query still running...
[]
[2025-11-18T12:42:11Z INFO worker] Query still running...
...
...
...
it goes like this forever
It looks like there's a deadlock somewhere in the code that completely stalls the query.
It seems to be triggered by some queries in particular, for example, TPCH query 7. It can be reproduced ~50% of the times running:
npm run datafusion-bench -- --sf 10 --files-per-task 4 --query 7With the remote benchmarks
Metadata
Metadata
Assignees
Labels
No labels