-
Notifications
You must be signed in to change notification settings - Fork 738
Description
As we figured out in this issue: #28520 (comment)
there are problems with spilling to the system disk.
TLDR: during heavy spilling we write too much data to the system disk → it becomes unresponsive → the whole node becomes unresponsive.
Here are the disk write speed graphs from three failed requests (all ended with Node Disconnected):
https://nda.ya.ru/t/Pm3q0fya7P2PLq
Spilling plots for the same time period:
https://nda.ya.ru/t/eHRYgKP_7P2PZW
Default value of spilling directory is empty:
in this case spilling chooses /tmp:
| spillingRoot = NYql::NDq::GetTmpSpillingRootForCurrentUser(); |
| auto root = TFsPath{GetSystemTempDir()}; |
which is located on a system disk.
! All the databases with spilling enabled may be impacted.
To prevent these failures in the future, one of the following actions is required:
Option 1: Limit system disk bandwidth for the kikimr process
I tested limiting the bandwidth to 30 MB/s, and it prevented the crashes:
#28520 (comment)
However, the query then timed out (2 hours).
This requires adding the following setting to /lib/systemd/system/[email protected]:
IOWriteBandwidthMax=/dev/md3 30M
Option 2: Use a separate disk for spilling
I switched spilling to a different disk (took one from the storage).
The request completed successfully in about 35 minutes: