Add limits for spilling on the system disk

As we figured out in this issue: https://github.com/ydb-platform/ydb/issues/28520#issuecomment-3580846432

there are problems with spilling to the system disk.

TLDR: during heavy spilling we write too much data to the system disk → it becomes unresponsive → the whole node becomes unresponsive.

Here are the disk write speed graphs from three failed requests (all ended with Node Disconnected):
https://nda.ya.ru/t/Pm3q0fya7P2PLq

Spilling plots for the same time period:
https://nda.ya.ru/t/eHRYgKP_7P2PZW

Default value of spilling directory is empty:
in this case spilling chooses `/tmp`:

https://github.com/ydb-platform/ydb/blob/1ca1acd058f165ef479f5eb70e8e8fcb856e9fad/ydb/core/kqp/proxy_service/kqp_proxy_service.cpp#L254

https://github.com/ydb-platform/ydb/blob/1ca1acd058f165ef479f5eb70e8e8fcb856e9fad/ydb/library/yql/dq/actors/spilling/spilling_file.cpp#L1063

which is located on a system disk.

! All the databases with spilling enabled may be impacted.

To prevent these failures in the future, one of the following actions is required:

## Option 1: Limit system disk bandwidth for the kikimr process

I tested limiting the bandwidth to 30 MB/s, and it prevented the crashes:
https://github.com/ydb-platform/ydb/issues/28520#issuecomment-3586225891

However, the query then timed out (2 hours).

This requires adding the following setting to `/lib/systemd/system/kikimr-multi@.service`:

```
IOWriteBandwidthMax=/dev/md3 30M
```

## Option 2: Use a separate disk for spilling

I switched spilling to a different disk (took one from the storage).
The request completed successfully in about 35 minutes:

https://github.com/ydb-platform/ydb/issues/28520#issuecomment-3596475166




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add limits for spilling on the system disk #30012

Option 1: Limit system disk bandwidth for the kikimr process

Option 2: Use a separate disk for spilling

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add limits for spilling on the system disk #30012

Description

Option 1: Limit system disk bandwidth for the kikimr process

Option 2: Use a separate disk for spilling

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions