Commit f7688c4
authored
fix: use single writer for all partition streams (#3870)
# Description
Collecting the streams concurrently and sending them over a sync channel
provides the benefit of us keeping a single writer instance where we
write through. This prevents the edge case of potentially writing many
small files, when datafusion decided to lots of partitioned streams that
are relatively small. Since we now maintain one writer we just keep
writing from all partition streams into that.
I guess this stacks well with your PR @abhiaagarwal? Wdyt?
- closes #3871
We now only create one file with the same code of the above issue
instead of 3 small files:
```json
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"0d1846c0-71e6-4c6d-aba1-5b9f8d752937","name":null,"description":null,"format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"foo\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"createdTime":1760889081860,"configuration":{}}}
{"add":{"path":"part-00000-352305f0-eff1-44d3-993e-6bd31cdd118c-c000.snappy.parquet","partitionValues":{},"size":510,"modificationTime":1760889081877,"dataChange":true,"stats":"{\"numRecords\":9,\"minValues\":{\"foo\":1},\"maxValues\":{\"foo\":3},\"nullCount\":{\"foo\":0}}","tags":null,"baseRowId":null,"defaultRowCommitVersion":null,"clusteringProvider":null}}
{"commitInfo":{"timestamp":1760889081878,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists"},"engineInfo":"delta-rs:py-1.2.0","clientVersion":"delta-rs.py-1.2.0","operationMetrics":{"execution_time_ms":19,"num_added_files":1,"num_added_rows":9,"num_partitions":0,"num_removed_files":0}}}
```
---------
Signed-off-by: Ion Koutsouris <[email protected]>
Signed-off-by: Ion Koutsouris1 parent d0ae849 commit f7688c4
File tree
2 files changed
+241
-168
lines changed- crates/core/src/operations/write
- python
2 files changed
+241
-168
lines changed
0 commit comments