feat(compute): support casts between binary/string and binary_view/string_view#802
Open
zeroshade wants to merge 23 commits intoapache:mainfrom
Open
feat(compute): support casts between binary/string and binary_view/string_view#802zeroshade wants to merge 23 commits intoapache:mainfrom
zeroshade wants to merge 23 commits intoapache:mainfrom
Conversation
…ring_view
Adds the missing cast kernels so that compute.CastArray (and the 'cast'
compute function) can convert between the view variants (binary_view,
string_view) and the existing base-binary types (binary, large_binary,
string, large_string, fixed_size_binary). Registers three new cast
paths:
* base-binary -> view (CastBinaryToBinaryView)
* view -> base-binary (CastBinaryViewToBinary)
* view <-> view (CastBinaryViewToBinaryView, zero-copy)
UTF-8 validation mirrors the behavior of the existing binary->string
kernels: when casting a non-utf8 source into a utf8 destination, every
non-null value is validated unless CastOptions.AllowInvalidUtf8 is set.
Also fixes two related gaps in the compute/array plumbing that
otherwise prevent view types from flowing through generic code paths:
* arrow/compute/exec.getNumBuffers now returns 3 for binary_view and
string_view (bitmap + view headers + a single overflow data buffer),
matching ArraySpan's [3]BufferSpan limit. Previously it fell through
to the default case and dropped the data buffer, which crashed any
cast that walked non-inline view values through input.MakeArray().
* arrow/array.getMaxBufferLen and the nullArrayFactory learn about
BinaryViewDataType so that MakeArrayOfNull (called by the executor
when building empty-input results) works for view outputs.
Tests exercising short (<=12 byte inline) and long (out-of-line) values,
null handling, UTF-8 validation, CanCast coverage, and FSB -> view
are added to arrow/compute/cast_test.go. The tests use a direct array
comparison helper rather than the existing checkCast -- that helper
iterates scalar.GetScalar, which does not yet support view types
(a separate limitation outside the scope of this change).
Fixes apache#184
CastBinaryToBinaryView relied on BinaryViewBuilder's default 32KB block size. As soon as the cumulative out-of-line (>12 byte) payload of the cast output exceeded that, the builder allocated a second overflow block, producing a 4-buffer Array. exec.ArraySpan only has three BufferSpan slots, so out.TakeOwnership(result.Data()) indexed past Buffers[2] and panicked. The kernel now walks the input once to compute the total out-of-line payload, errors out if it exceeds math.MaxInt32 (the single-buffer offset limit on ViewHeader.BufferOffset), and pre-reserves the builder's data buffer to exactly that size before appending. With the builder's first block sized to fit everything, subsequent per-value ReserveData calls return without allocating new blocks, guaranteeing a single data buffer in the output. Also wires a reserveData callback through the internal binaryAppender helper. *StringViewBuilder shadows the embedded BinaryViewBuilder's Append([]byte) with its own Append(string), so it doesn't satisfy the array.BinaryLikeBuilder interface - type-switching via binaryAppender is the same pattern already used for appendBytes. Adds TestBinaryLikeToBinaryViewLargePayload which casts 500x200-byte values (100 KB total out-of-line, well over the 32KB block default) to both string_view and binary_view, verifies no panic, correct content, and that the output fits in \u22643 buffers. Addresses review 809 feedback on apache#802.
Addresses two more multi-buffer-view failure modes caught in review 811 of the original apacheGH-184 PR: 1. Multi-buffer view **inputs**. exec.ArraySpan is a fixed [3]BufferSpan, so a BinaryView/StringView array with more than one overflow data buffer (e.g. built via BinaryViewBuilder with >32KB of non-inline payload) panics in SetMembers before any kernel runs. The cast meta-function now calls coalesceMultiBufferViewDatum before dispatch; if the input datum is a view with >3 buffers, it rebuilds the array with a single overflow buffer (sized via ReserveData) and substitutes that into the argument list. 2. Numeric/temporal casts to string_view. addNumericAndTemporalToStringCasts reuses the shared numericToStringCastExec/timeToStringCastExec kernels for string_view output. Those kernels did not reserve data capacity, so StringViewBuilder could spill into a second overflow block and produce a 4-buffer result, which would then panic in TakeOwnership. Both kernels now call bldr.ReserveData(input.Len * maxFormattedBytes) based on a per-type upper bound derived from the formatters. This guarantees the builder's first block is large enough to hold the entire formatted output, so no spill occurs. castTimestampToString already handled this via its existing strlen/ReserveData pattern; boolToStringCastExec is safe because bool formats to at most 5 bytes (always inline). Adds three regression tests under arrow/compute/cast_test.go: - TestMultiBufferViewInputCoalesced: builds multi-buffer BinaryView/ StringView inputs and casts them to binary/utf8/string_view. - TestNumericToStringViewLargePayload: casts 5000 int64/timestamp/time64 values to string_view and asserts the output stays within one data buffer. - TestBinaryLikeToBinaryViewLargePayload (from the earlier fix) continues to cover the output-side pre-reservation path. Addresses review 811 feedback on apache#802.
…flow Addresses three more gaps in view-cast support caught in review 814 of the apacheGH-184 PR: 1. **Chunked view inputs bypassed coalescing (HIGH).** coalesceMultiBufferViewDatum only handled *ArrayDatum. A ChunkedDatum whose chunks are multi-buffer BinaryView/StringView still crashed when the executor called SetMembers on a second chunk. The meta-function helper now walks ChunkedDatum.Chunks(), rebuilds any multi-buffer chunk via rebuildViewSingleBuffer, retains single-buffer chunks as-is, and returns a new *ChunkedDatum. 2. **maxFormattedBytes(FLOAT16) bound was 12 (MEDIUM).** float16.Num.String() calls strconv.FormatFloat(..., 32) which can emit up to 15 bytes. Raised the bound to 16. (No direct cast test added because FLOAT16 -> string_view isn't registered through addNumericAndTemporalToStringCasts - numericTypes excludes Float16 - so the bound is defensive for the eventual case where it is.) 3. **Numeric/temporal ReserveData could overflow int32 (MEDIUM).** numericToStringCastExec, timeToStringCastExec, and castTimestampToString now call a new reserveFormattedData helper that computes (Len - Nulls) * perValueBytes in int64 and returns arrow.ErrInvalid when the result exceeds math.MaxInt32, rather than panicking inside BinaryViewBuilder.ReserveData for very large inputs. Tests: - TestChunkedMultiBufferViewInputCoalesced (arrow/compute/cast_test.go): 2 multi-buffer StringView chunks cast to utf8, verifies correct content in each output chunk. - TestReserveFormattedDataOverflowGuard (arrow/compute/internal/kernels/string_casts_test.go): unit-tests the overflow guard directly against a fake ArraySpan so we don't have to allocate >2GB through the full executor to trigger it. Addresses review 814 feedback on apache#802.
reserveFormattedData hard-failed any payload above MaxInt32 regardless of the destination builder, rejecting valid large_utf8 casts that LargeStringBuilder could represent via int64 offsets. Split the limit per builder: MaxInt32 for StringBuilder/StringViewBuilder (int32 offsets / single overflow buffer), MaxInt64 for LargeStringBuilder. Also fix a double-release in TestChunkedMultiBufferViewInputCoalesced: ChunkedDatum.Chunks() returns the slice owned by the datum, so per-chunk Release() under defer out.Release() double-freed the arrays.
…ilds formattedDataLimit returned math.MaxInt64 for LargeStringBuilder, but reserveFormattedData later casts total to int for ReserveData. On 32-bit builds where int is 32-bit, any total above math.MaxInt32 would overflow the conversion, bypassing the guard and producing a negative or wrapped value inside BinaryBuilder.ReserveData. Clamp the returned limit to int64(math.MaxInt) so it always fits the platform's int. On 64-bit builds this is a no-op (MaxInt == MaxInt64); on 32-bit it caps large_utf8 at MaxInt32. Added TestFormattedDataLimitFitsInPlatformInt regression test that holds on both platforms by construction.
TestReserveFormattedDataAllowsLargeUtf8AboveInt32 previously asserted formattedDataLimit(large_utf8) > MaxInt32 unconditionally. After the 32-bit clamp landed, formattedDataLimit clamps the large_utf8 limit to math.MaxInt32 on 32-bit builds, so the test would fail on the very platform the clamp is meant to protect. Split the assertion: on 64-bit require > MaxInt32 (MaxInt64 exposes the int64 offset capacity); on 32-bit require == MaxInt32 (the clamp-enforced ceiling).
unpackDictionary previously expanded every dictionary through TakeArray, but array_take registers no kernel for binary_view/string_view, so casts whose input was a dictionary with view-typed values failed with 'function array_take has no kernel matching input types (string_view, int32)'. Detect view-typed dictionary values and materialize them directly with StringViewBuilder/BinaryViewBuilder, preserving index-level nulls. Regression test TestCasts/TestViewDictionaryUnpackCast covers string_view and binary_view dictionaries through cast to utf8, string_view, binary, and with nulls. array_take itself still lacks view-type kernels; that is tracked separately.
…ries unpackViewDictionary only checked index-level nulls (dictArr.IsNull(i)), so a dictionary whose values array contained a null entry would materialize that slot as an empty string/byte slice instead of a null. Also check vals.IsNull(idx) for each referenced dictionary slot, which matches TakeArray's behavior on the non-view path. Extended the existing regression test with a 'null in values array' subtest.
…casts Two remaining view-producing paths could emit multi-buffer view arrays and blow past exec.ArraySpan's fixed [3]BufferSpan. unpackViewDictionary now pre-scans the dictionary to sum non-inline payload, rejects totals above MaxInt32, and calls ReserveData up-front so the unpacked view array lives in a single overflow buffer even on the dict<view> -> view type-equal fast path. boolToStringCastExec now matches the numeric and temporal kernels: it Reserves header slots and pre-reserves up to 5 bytes per row (the longer of 'true'/'false') via reserveFormattedData. Added a large-non-inline-payload regression subtest to TestViewDictionaryUnpackCast and TestBoolToStringViewStaysSingleBuffer asserting <= 3 output buffers.
…view dict test Two follow-ups from the apacheGH-184 review rotation: 1) boolToStringCastExec previously reserved 5*nonNull bytes unconditionally, which treated every row as 'false'. For large all-true casts the true payload is 4*nonNull, and the looser bound would reject valid inputs near MaxInt32. Count exact bytes by walking the data bitmap on non-null positions ('true'=4, 'false'=5) and pre-reserve exactly that total. Extracted reserveFormattedDataExact(bldr, total) so the per-value and exact paths share the single-buffer limit check. 2) TestViewDictionaryUnpackCast only exercised the string_view fast path; added a mirror binary_view-dict-to-binary_view subtest with 100KB of non-inline payload, asserting <=3 output buffers and value correctness. Extended TestBoolToStringViewStaysSingleBuffer with all-true, all-false, and alternating cases that assert the output data buffer equals the exact formatted payload size.
Job 824 flagged that TestViewDictionaryUnpackCast/string_view_dict_with_null_in_values_array only exercised the string_view branch of unpackViewDictionary; the binary_view branch had the same vals.IsNull(idx) logic but no test, so it could regress independently. Added a mirror subtest with a *BinaryView dictionary whose second value is null and indices that reference it twice, asserting the output null count and per-element null mask match.
…values For view destinations (string_view / binary_view), values that fit inside the view header's inline slot consume no overflow data, so there is no overflow buffer to reserve and no per-buffer limit to enforce. Add an inline-skip in reserveFormattedData so large casts whose per-value upper bound is inline (bool via 5 bytes, int8..int32 via <=11, date32, etc.) are not rejected against a limit they never touch. Apply the same logic in boolToStringCastExec's exact-count path. Added TestReserveFormattedDataInlineViewSkip asserting inline pass-through at the bool / boundary (11) and the overflow-buffer limit firing at 12+. Also restored TestReserveFormattedDataAllowsLargeUtf8AboveInt32's platform-aware branching which an earlier edit accidentally reverted.
…line ViewHeader.IsInline uses size <= 12 but arrow.IsViewInline uses strict < 12, so a 12-byte formatted value (time32[ms] 'HH:MM:SS.sss') is stored inline by the builder yet my earlier guard treated it as non-inline and rejected the cast against the overflow-buffer limit. Introduce a local viewHeaderStoresInline helper (n <= 12) that matches the storage rule, use it in reserveFormattedData's inline-skip, and add TestCasts/ TestNumericToStringViewLargePayload/time32[ms]_to_string_view_stays_inline as the cast-level smoke test. TestReserveFormattedDataInlineViewSkip now asserts that perValueBytes=12 passes inline-skip and 13 hits the limit.
ViewHeader documents inlining for '12 bytes or fewer' and ViewHeader.IsInline implements that as size <= viewInlineSize, but the exported arrow.IsViewInline used strict '<'. The inconsistency caused downstream view-payload accounting (reserveFormattedData, rebuildViewSingleBuffer, sumViewDictionaryOutOfLine, the string/binary->view cast kernel) to treat 12-byte values as non-inline and count them against the single-buffer limit even though the builder stores them inline, so large time32[ms] -> string_view casts were falsely rejected. Fixing IsViewInline at the source removes the mismatch without touching downstream call sites or introducing a parallel helper. Removed my local viewHeaderStoresInline helper added earlier and routed the inline-skip in reserveFormattedData back through arrow.IsViewInline. Tightened TestCasts/TestNumericToStringViewLargePayload/ time32[ms]_to_string_view_stays_inline to assert Buffers()[2] is empty/absent so the test actually exercises the inline path.
exec.ArraySpan.SetMembers recurses into dictionary children, so a dict<string_view>/dict<binary_view> whose dictionary values are multi-buffer trips the [3]BufferSpan limit before unpackViewDictionary gets a chance to run. needsViewCoalesce now recurses into dictionary values, and coalesceMultiBufferViewDatum dispatches via a new coalesceArrayData helper that rebuilds either a top-level view array or a dictionary whose values are a view array (reusing the existing index buffers via array.NewDataWithDictionary). Added TestCasts/TestViewDictionaryUnpackCast/dict<string_view_multi-buffer_values>_to_utf8 regression test that builds a dictionary with 2x20KB unique values (guaranteed multi-buffer with the default 32KB block size) and asserts the cast completes with the expected values.
exec.ArraySpan.SetMembers recurses through both Dictionary() and
Children(), so a nested type (list<string_view>, struct{f: binary_view},
list<dict<string_view>>, etc.) whose descendant view array is
multi-buffer still hits the fixed [3]BufferSpan limit before dispatch.
Extended needsViewCoalesce to walk Children() as well, and
coalesceArrayData now rebuilds non-view/non-dictionary nested arrays by
keeping their own buffers, recursively coalescing only the children that
need it (retaining the others), and reconstructing via array.NewData.
Added TestCasts/TestViewDictionaryUnpackCast/list<string_view_multi-buffer_child>_to_list<utf8>
regression test that builds a list<string_view> with a multi-buffer
values array (4x20KB unique entries) and asserts the list cast
completes end-to-end with the expected values.
…scing Extension arrays reuse their underlying storage buffers/children in place, so exec.ArraySpan.SetMembers iterates the storage layout directly and trips the [3]BufferSpan limit before CastFromExtension unwraps anything. needsViewCoalesce now reads through ExtensionType.StorageType() during the view/dictionary switch, and coalesceArrayData unwraps extension inputs, recursively coalesces the storage-typed ArrayData, and rewraps with the original extension datatype so downstream dispatch still sees the extension. Added StringViewExtType to the internal test types package (extension backed by string_view) and TestCasts/TestViewDictionaryUnpackCast/ extension<string_view_multi-buffer>_to_utf8 regression test that registers the extension, builds a multi-buffer storage array, wraps it, and casts to utf8 end-to-end.
array.NewData does not carry a Data's Dictionary() field, so the previous extension-type unwrap/rewrap path dropped the dictionary for extension<dictionary<...view...>>: the recursive DICTIONARY branch inside coalesceArrayData then saw a nil Dictionary() and panicked or produced malformed data. Routed both the unwrap and the rewrap through a new reshapeArrayDataType helper that honours Dictionary() via array.NewDataWithDictionary when present. Added TestCoalesceArrayDataExtensionOverDictOfView as an internal-package regression (direct coalesceArrayData call) because the full cast pipeline for extension<dict<...>> hits a pre-existing SetMembers gap (exec.ArraySpan.SetMembers does not recognize a dictionary under an extension type) that is unrelated to view coalescing. Added DictStringViewExtType to internal/types to support the regression.
Pulls the three branch-introduced duplicate patterns identified by roborev analyze duplication into shared helpers. 1) validateUtf8, validateUtf8Fsb, and validateUtf8View (in kernels/binary_view_casts.go) shared the same VisitBitBlocksShort + utf8.Valid + error-format skeleton and differed only in how they obtained the byte slice per row. Extracted validateUTF8Sequence in kernels/string_casts.go; each validator now composes it with its own valueAt closure. Removed the now-unused utf8/bitutils imports in binary_view_casts.go. 2) rebuildViewSingleBuffer and sumViewDictionaryOutOfLine duplicated the 'sum out-of-line bytes, error on > MaxInt32' loop. Collapsed them into a single sumOutOfLineBytes(n, isNull, valueLen) helper. unpackViewDictionary now calls the helper with closures that route dictionary positions through GetValueIndex. 3) unpackViewDictionary's StringView and BinaryView branches were ~95% similar copies of the same null-aware append loop. Extracted buildFromDictionary(dictArr, valIsNull, appendValue, appendNull) so both branches share the walk and the two-level null rule. 4) Promoted the TestViewDictionaryUnpackCast local buildDict closure (9 call sites within one test) to CastSuite.buildInt32Dictionary so it is discoverable and reusable across the suite. Net LOC change is close to zero (code moved into named helpers rather than removed), but the cognitive and copy/paste-drift surface area drops substantially. All apacheGH-184 regression tests still pass.
Second pass driven by the re-run roborev analyze duplication job (842): two residual branch-introduced patterns were still 60-97% similar after the first cleanup. Both are consolidated without changing behavior. 1) unpackViewDictionary's StringView and BinaryView branches shared the sum-outOfLine + builder.Reserve + ReserveData + buildFromDictionary pipeline verbatim; only the concrete builder type and Value() return type differed. Extracted unpackViewDictionaryIntoBuilder which takes a viewValuesNull (the part of the dictionary-values interface that can be unified in Go) plus closures for reserveData / appendValue / appendNull keyed to the concrete builder. The two switch arms are now just 'new builder + release + call helper + NewArray'. 2) CastBinaryToBinaryView and CastBinaryViewToBinary shared a newBinaryAppender + per-type value accessor + null-preserving append-loop skeleton. Extracted binaryLikeValueAccessor for the input type dispatch (covers all 7 binary-like arrow array types in one place) and appendBinaryValues for the null-routing loop. Each cast kernel still owns its own direction-specific validation and size-limit check (view->binary uses MaxOf[OutOffsetT]; binary->view uses the overflow-buffer MaxInt32 limit), so the correctness- sensitive pieces stay explicit and auditable. Net LOC is slightly higher because the new helpers carry docstrings, but the copy-paste-drift surface is substantially smaller and the six-fold similarity the reviewer flagged is gone.
… adapter Third-round roborev analyze complexity job (848) flagged the helper's 7-parameter signature with 3 callback parameters as harder to read than its ~30 LOC warrants. Packaged the builder + reserveData + appendValue + appendNull closures into a viewDictBuilderAdapter struct, dropping the signature to 4 parameters. Each switch arm now constructs the adapter with a named-field initializer, making the role of each closure obvious at the call site. No behaviour change; all existing regression tests pass unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
Fixes #184.
compute.CastArraycurrently fails withunsupported cast to string_view from utf8(and symmetrically forbinary_view, inverse directions, and FixedSizeBinary) because no cast kernels are registered for the view variants. PyArrow has supported this since 18.0.0 (apache/arrow#43302); this PR brings arrow-go to parity.What changes are included in this PR?
New cast kernels in
arrow/compute/internal/kernels/binary_view_casts.go:binary,large_binary,string,large_string,fixed_size_binary→binary_view/string_viewCastBinaryToBinaryViewBinaryViewBuilder/StringViewBuilder. UTF-8 validated when destination isstring_viewand source is non-utf8, unlessCastOptions.AllowInvalidUtf8is set.binary_view,string_view→binary,large_binary,string,large_stringCastBinaryViewToBinary[OutOffsetT]binary_view→string/large_string, unlessAllowInvalidUtf8is set. Checksint32overflow when targetingstring/binary.binary_view↔string_view(and identity)CastBinaryViewToBinaryViewZeroCopyCastExec; UTF-8 validated onbinary_view→string_viewdirection.These are wired into the dispatch table via
cast_binary_viewandcast_string_viewentries incast.go, andaddToBinaryKernelsnow also registersview → base-binarykernels so the existingcast_binary/cast_string/etc. paths accept view inputs.Related executor/array fixes (required to get view types through the generic dispatch):
arrow/compute/exec.getNumBuffersnow returns 3 forbinary_view/string_view(bitmap + view headers + one overflow data buffer), matchingArraySpan's fixed[3]BufferSpan. Previously this fell through to thedefault: return 2case, so the data buffer was dropped whenArraySpan.MakeDatasliced tobufs[:NumBuffers()], which crashed any cast that read non-inline values viainput.MakeArray().arrow/array.getMaxBufferLenand thenullArrayFactorynow handleBinaryViewDataType, soMakeArrayOfNullworks for view outputs. Without this, casting a zero-length input array to a view type panicked witharrayofnull not implemented for type string_view.Are these changes tested?
Yes. Added four new
CastSuitetests inarrow/compute/cast_test.go:TestBinaryLikeToBinaryView- all base-binary → view, FSB → view, short (inline) + long (out-of-line) values + nulls, UTF-8 validation + opt-out viaAllowInvalidUtf8.TestBinaryViewToBinaryLike- view → all base-binary, UTF-8 validation forbinary_view→string.TestBinaryViewToBinaryView- cross-view casts with UTF-8 validation.TestCanCastViewTypes-CanCastcoverage for the new paths.These use a simple array-level comparison (
checkCastArrayOnly) rather than the existingcheckCastbecause the latter iterates viascalar.GetScalar, which does not yet support view types (separate limitation, outside scope). Fullarrow/compute/...andarrow/array/...test suites pass;arrow/computealso passes with-race.The exact reproducer from #184 now succeeds:
Are there any user-facing changes?
Yes - previously-unsupported cast paths now work:
compute.CastArray(ctx, arr, {ToType: BinaryTypes.StringView})and inverse.compute.CastArray(ctx, arr, {ToType: BinaryTypes.BinaryView})and inverse.binary_view↔string_viewcasts.binary_view/string_view.compute.CanCastadvertises all of the above.No existing behavior changes. The change to
getNumBufferscould in principle affect other compute paths that manipulate view arrays, but the previous value of2was incorrect for views (causing data-buffer loss), so this is a bug fix rather than a behavioral change.