Skip to content

feat(compute): support casts between binary/string and binary_view/string_view#802

Open
zeroshade wants to merge 23 commits intoapache:mainfrom
zeroshade:fix/gh-184-cast-string-view
Open

feat(compute): support casts between binary/string and binary_view/string_view#802
zeroshade wants to merge 23 commits intoapache:mainfrom
zeroshade:fix/gh-184-cast-string-view

Conversation

@zeroshade
Copy link
Copy Markdown
Member

Rationale for this change

Fixes #184. compute.CastArray currently fails with unsupported cast to string_view from utf8 (and symmetrically for binary_view, inverse directions, and FixedSizeBinary) because no cast kernels are registered for the view variants. PyArrow has supported this since 18.0.0 (apache/arrow#43302); this PR brings arrow-go to parity.

What changes are included in this PR?

New cast kernels in arrow/compute/internal/kernels/binary_view_casts.go:

Direction Kernel Notes
binary, large_binary, string, large_string, fixed_size_binarybinary_view/string_view CastBinaryToBinaryView Uses BinaryViewBuilder/StringViewBuilder. UTF-8 validated when destination is string_view and source is non-utf8, unless CastOptions.AllowInvalidUtf8 is set.
binary_view, string_viewbinary, large_binary, string, large_string CastBinaryViewToBinary[OutOffsetT] Materializes into a contiguous data buffer. UTF-8 validated when casting binary_viewstring/large_string, unless AllowInvalidUtf8 is set. Checks int32 overflow when targeting string/binary.
binary_viewstring_view (and identity) CastBinaryViewToBinaryView Zero-copy via ZeroCopyCastExec; UTF-8 validated on binary_viewstring_view direction.

These are wired into the dispatch table via cast_binary_view and cast_string_view entries in cast.go, and addToBinaryKernels now also registers view → base-binary kernels so the existing cast_binary/cast_string/etc. paths accept view inputs.

Related executor/array fixes (required to get view types through the generic dispatch):

  • arrow/compute/exec.getNumBuffers now returns 3 for binary_view/string_view (bitmap + view headers + one overflow data buffer), matching ArraySpan's fixed [3]BufferSpan. Previously this fell through to the default: return 2 case, so the data buffer was dropped when ArraySpan.MakeData sliced to bufs[:NumBuffers()], which crashed any cast that read non-inline values via input.MakeArray().
  • arrow/array.getMaxBufferLen and the nullArrayFactory now handle BinaryViewDataType, so MakeArrayOfNull works for view outputs. Without this, casting a zero-length input array to a view type panicked with arrayofnull not implemented for type string_view.

Are these changes tested?

Yes. Added four new CastSuite tests in arrow/compute/cast_test.go:

  • TestBinaryLikeToBinaryView - all base-binary → view, FSB → view, short (inline) + long (out-of-line) values + nulls, UTF-8 validation + opt-out via AllowInvalidUtf8.
  • TestBinaryViewToBinaryLike - view → all base-binary, UTF-8 validation for binary_viewstring.
  • TestBinaryViewToBinaryView - cross-view casts with UTF-8 validation.
  • TestCanCastViewTypes - CanCast coverage for the new paths.

These use a simple array-level comparison (checkCastArrayOnly) rather than the existing checkCast because the latter iterates via scalar.GetScalar, which does not yet support view types (separate limitation, outside scope). Full arrow/compute/... and arrow/array/... test suites pass; arrow/compute also passes with -race.

The exact reproducer from #184 now succeeds:

input: ["a" "b" "c" "this is a longer string beyond 12 bytes"]
cast to string_view: ["a" "b" "c" "this is a longer string beyond 12 bytes"]
type: string_view

Are there any user-facing changes?

Yes - previously-unsupported cast paths now work:

  • compute.CastArray(ctx, arr, {ToType: BinaryTypes.StringView}) and inverse.
  • compute.CastArray(ctx, arr, {ToType: BinaryTypes.BinaryView}) and inverse.
  • binary_viewstring_view casts.
  • FSB → binary_view/string_view.
  • compute.CanCast advertises all of the above.

No existing behavior changes. The change to getNumBuffers could in principle affect other compute paths that manipulate view arrays, but the previous value of 2 was incorrect for views (causing data-buffer loss), so this is a bug fix rather than a behavioral change.

zeroshade added 20 commits May 5, 2026 12:37
…ring_view

Adds the missing cast kernels so that compute.CastArray (and the 'cast'
compute function) can convert between the view variants (binary_view,
string_view) and the existing base-binary types (binary, large_binary,
string, large_string, fixed_size_binary). Registers three new cast
paths:

  * base-binary -> view (CastBinaryToBinaryView)
  * view -> base-binary (CastBinaryViewToBinary)
  * view <-> view (CastBinaryViewToBinaryView, zero-copy)

UTF-8 validation mirrors the behavior of the existing binary->string
kernels: when casting a non-utf8 source into a utf8 destination, every
non-null value is validated unless CastOptions.AllowInvalidUtf8 is set.

Also fixes two related gaps in the compute/array plumbing that
otherwise prevent view types from flowing through generic code paths:

  * arrow/compute/exec.getNumBuffers now returns 3 for binary_view and
    string_view (bitmap + view headers + a single overflow data buffer),
    matching ArraySpan's [3]BufferSpan limit. Previously it fell through
    to the default case and dropped the data buffer, which crashed any
    cast that walked non-inline view values through input.MakeArray().

  * arrow/array.getMaxBufferLen and the nullArrayFactory learn about
    BinaryViewDataType so that MakeArrayOfNull (called by the executor
    when building empty-input results) works for view outputs.

Tests exercising short (<=12 byte inline) and long (out-of-line) values,
null handling, UTF-8 validation, CanCast coverage, and FSB -> view
are added to arrow/compute/cast_test.go. The tests use a direct array
comparison helper rather than the existing checkCast -- that helper
iterates scalar.GetScalar, which does not yet support view types
(a separate limitation outside the scope of this change).

Fixes apache#184
CastBinaryToBinaryView relied on BinaryViewBuilder's default 32KB block
size. As soon as the cumulative out-of-line (>12 byte) payload of the
cast output exceeded that, the builder allocated a second overflow
block, producing a 4-buffer Array. exec.ArraySpan only has three
BufferSpan slots, so out.TakeOwnership(result.Data()) indexed past
Buffers[2] and panicked.

The kernel now walks the input once to compute the total out-of-line
payload, errors out if it exceeds math.MaxInt32 (the single-buffer
offset limit on ViewHeader.BufferOffset), and pre-reserves the
builder's data buffer to exactly that size before appending. With the
builder's first block sized to fit everything, subsequent per-value
ReserveData calls return without allocating new blocks, guaranteeing a
single data buffer in the output.

Also wires a reserveData callback through the internal binaryAppender
helper. *StringViewBuilder shadows the embedded BinaryViewBuilder's
Append([]byte) with its own Append(string), so it doesn't satisfy the
array.BinaryLikeBuilder interface - type-switching via binaryAppender
is the same pattern already used for appendBytes.

Adds TestBinaryLikeToBinaryViewLargePayload which casts 500x200-byte
values (100 KB total out-of-line, well over the 32KB block default) to
both string_view and binary_view, verifies no panic, correct content,
and that the output fits in \u22643 buffers.

Addresses review 809 feedback on apache#802.
Addresses two more multi-buffer-view failure modes caught in review
811 of the original apacheGH-184 PR:

1. Multi-buffer view **inputs**. exec.ArraySpan is a fixed [3]BufferSpan,
   so a BinaryView/StringView array with more than one overflow data
   buffer (e.g. built via BinaryViewBuilder with >32KB of non-inline
   payload) panics in SetMembers before any kernel runs. The cast
   meta-function now calls coalesceMultiBufferViewDatum before dispatch;
   if the input datum is a view with >3 buffers, it rebuilds the array
   with a single overflow buffer (sized via ReserveData) and substitutes
   that into the argument list.

2. Numeric/temporal casts to string_view. addNumericAndTemporalToStringCasts
   reuses the shared numericToStringCastExec/timeToStringCastExec kernels
   for string_view output. Those kernels did not reserve data capacity,
   so StringViewBuilder could spill into a second overflow block and
   produce a 4-buffer result, which would then panic in TakeOwnership.
   Both kernels now call bldr.ReserveData(input.Len * maxFormattedBytes)
   based on a per-type upper bound derived from the formatters. This
   guarantees the builder's first block is large enough to hold the
   entire formatted output, so no spill occurs. castTimestampToString
   already handled this via its existing strlen/ReserveData pattern;
   boolToStringCastExec is safe because bool formats to at most 5 bytes
   (always inline).

Adds three regression tests under arrow/compute/cast_test.go:
- TestMultiBufferViewInputCoalesced: builds multi-buffer BinaryView/
  StringView inputs and casts them to binary/utf8/string_view.
- TestNumericToStringViewLargePayload: casts 5000 int64/timestamp/time64
  values to string_view and asserts the output stays within one data
  buffer.
- TestBinaryLikeToBinaryViewLargePayload (from the earlier fix)
  continues to cover the output-side pre-reservation path.

Addresses review 811 feedback on apache#802.
…flow

Addresses three more gaps in view-cast support caught in review 814 of
the apacheGH-184 PR:

1. **Chunked view inputs bypassed coalescing (HIGH).**
   coalesceMultiBufferViewDatum only handled *ArrayDatum. A ChunkedDatum
   whose chunks are multi-buffer BinaryView/StringView still crashed when
   the executor called SetMembers on a second chunk. The meta-function
   helper now walks ChunkedDatum.Chunks(), rebuilds any multi-buffer
   chunk via rebuildViewSingleBuffer, retains single-buffer chunks
   as-is, and returns a new *ChunkedDatum.

2. **maxFormattedBytes(FLOAT16) bound was 12 (MEDIUM).**
   float16.Num.String() calls strconv.FormatFloat(..., 32) which can emit
   up to 15 bytes. Raised the bound to 16. (No direct cast test added
   because FLOAT16 -> string_view isn't registered through
   addNumericAndTemporalToStringCasts - numericTypes excludes Float16 -
   so the bound is defensive for the eventual case where it is.)

3. **Numeric/temporal ReserveData could overflow int32 (MEDIUM).**
   numericToStringCastExec, timeToStringCastExec, and castTimestampToString
   now call a new reserveFormattedData helper that computes
   (Len - Nulls) * perValueBytes in int64 and returns arrow.ErrInvalid
   when the result exceeds math.MaxInt32, rather than panicking inside
   BinaryViewBuilder.ReserveData for very large inputs.

Tests:

- TestChunkedMultiBufferViewInputCoalesced (arrow/compute/cast_test.go):
  2 multi-buffer StringView chunks cast to utf8, verifies correct
  content in each output chunk.
- TestReserveFormattedDataOverflowGuard
  (arrow/compute/internal/kernels/string_casts_test.go):
  unit-tests the overflow guard directly against a fake ArraySpan so we
  don't have to allocate >2GB through the full executor to trigger it.

Addresses review 814 feedback on apache#802.
reserveFormattedData hard-failed any payload above MaxInt32 regardless of
the destination builder, rejecting valid large_utf8 casts that LargeStringBuilder
could represent via int64 offsets. Split the limit per builder: MaxInt32 for
StringBuilder/StringViewBuilder (int32 offsets / single overflow buffer),
MaxInt64 for LargeStringBuilder. Also fix a double-release in
TestChunkedMultiBufferViewInputCoalesced: ChunkedDatum.Chunks() returns the
slice owned by the datum, so per-chunk Release() under defer out.Release()
double-freed the arrays.
…ilds

formattedDataLimit returned math.MaxInt64 for LargeStringBuilder, but
reserveFormattedData later casts total to int for ReserveData. On 32-bit
builds where int is 32-bit, any total above math.MaxInt32 would overflow
the conversion, bypassing the guard and producing a negative or wrapped
value inside BinaryBuilder.ReserveData. Clamp the returned limit to
int64(math.MaxInt) so it always fits the platform's int. On 64-bit
builds this is a no-op (MaxInt == MaxInt64); on 32-bit it caps
large_utf8 at MaxInt32. Added TestFormattedDataLimitFitsInPlatformInt
regression test that holds on both platforms by construction.
TestReserveFormattedDataAllowsLargeUtf8AboveInt32 previously asserted
formattedDataLimit(large_utf8) > MaxInt32 unconditionally. After the
32-bit clamp landed, formattedDataLimit clamps the large_utf8 limit to
math.MaxInt32 on 32-bit builds, so the test would fail on the very
platform the clamp is meant to protect. Split the assertion: on 64-bit
require > MaxInt32 (MaxInt64 exposes the int64 offset capacity); on
32-bit require == MaxInt32 (the clamp-enforced ceiling).
unpackDictionary previously expanded every dictionary through TakeArray,
but array_take registers no kernel for binary_view/string_view, so casts
whose input was a dictionary with view-typed values failed with
'function array_take has no kernel matching input types (string_view, int32)'.
Detect view-typed dictionary values and materialize them directly with
StringViewBuilder/BinaryViewBuilder, preserving index-level nulls.
Regression test TestCasts/TestViewDictionaryUnpackCast covers
string_view and binary_view dictionaries through cast to utf8,
string_view, binary, and with nulls. array_take itself still lacks
view-type kernels; that is tracked separately.
…ries

unpackViewDictionary only checked index-level nulls (dictArr.IsNull(i)),
so a dictionary whose values array contained a null entry would
materialize that slot as an empty string/byte slice instead of a null.
Also check vals.IsNull(idx) for each referenced dictionary slot, which
matches TakeArray's behavior on the non-view path. Extended the existing
regression test with a 'null in values array' subtest.
…casts

Two remaining view-producing paths could emit multi-buffer view arrays
and blow past exec.ArraySpan's fixed [3]BufferSpan. unpackViewDictionary
now pre-scans the dictionary to sum non-inline payload, rejects totals
above MaxInt32, and calls ReserveData up-front so the unpacked view
array lives in a single overflow buffer even on the dict<view> -> view
type-equal fast path. boolToStringCastExec now matches the numeric and
temporal kernels: it Reserves header slots and pre-reserves up to 5
bytes per row (the longer of 'true'/'false') via reserveFormattedData.
Added a large-non-inline-payload regression subtest to
TestViewDictionaryUnpackCast and TestBoolToStringViewStaysSingleBuffer
asserting <= 3 output buffers.
…view dict test

Two follow-ups from the apacheGH-184 review rotation:

1) boolToStringCastExec previously reserved 5*nonNull bytes unconditionally,
   which treated every row as 'false'. For large all-true casts the true
   payload is 4*nonNull, and the looser bound would reject valid inputs
   near MaxInt32. Count exact bytes by walking the data bitmap on
   non-null positions ('true'=4, 'false'=5) and pre-reserve exactly that
   total. Extracted reserveFormattedDataExact(bldr, total) so the per-value
   and exact paths share the single-buffer limit check.
2) TestViewDictionaryUnpackCast only exercised the string_view fast path;
   added a mirror binary_view-dict-to-binary_view subtest with 100KB of
   non-inline payload, asserting <=3 output buffers and value correctness.
   Extended TestBoolToStringViewStaysSingleBuffer with all-true, all-false,
   and alternating cases that assert the output data buffer equals the
   exact formatted payload size.
Job 824 flagged that TestViewDictionaryUnpackCast/string_view_dict_with_null_in_values_array
only exercised the string_view branch of unpackViewDictionary; the
binary_view branch had the same vals.IsNull(idx) logic but no test, so
it could regress independently. Added a mirror subtest with a
*BinaryView dictionary whose second value is null and indices that
reference it twice, asserting the output null count and per-element
null mask match.
…values

For view destinations (string_view / binary_view), values that fit inside
the view header's inline slot consume no overflow data, so there is no
overflow buffer to reserve and no per-buffer limit to enforce. Add an
inline-skip in reserveFormattedData so large casts whose per-value upper
bound is inline (bool via 5 bytes, int8..int32 via <=11, date32, etc.)
are not rejected against a limit they never touch. Apply the same logic
in boolToStringCastExec's exact-count path. Added
TestReserveFormattedDataInlineViewSkip asserting inline pass-through at
the bool / boundary (11) and the overflow-buffer limit firing at 12+.
Also restored TestReserveFormattedDataAllowsLargeUtf8AboveInt32's
platform-aware branching which an earlier edit accidentally reverted.
…line

ViewHeader.IsInline uses size <= 12 but arrow.IsViewInline uses strict
< 12, so a 12-byte formatted value (time32[ms] 'HH:MM:SS.sss') is stored
inline by the builder yet my earlier guard treated it as non-inline and
rejected the cast against the overflow-buffer limit. Introduce a local
viewHeaderStoresInline helper (n <= 12) that matches the storage rule,
use it in reserveFormattedData's inline-skip, and add TestCasts/
TestNumericToStringViewLargePayload/time32[ms]_to_string_view_stays_inline
as the cast-level smoke test. TestReserveFormattedDataInlineViewSkip now
asserts that perValueBytes=12 passes inline-skip and 13 hits the limit.
ViewHeader documents inlining for '12 bytes or fewer' and ViewHeader.IsInline
implements that as size <= viewInlineSize, but the exported
arrow.IsViewInline used strict '<'. The inconsistency caused downstream
view-payload accounting (reserveFormattedData, rebuildViewSingleBuffer,
sumViewDictionaryOutOfLine, the string/binary->view cast kernel) to
treat 12-byte values as non-inline and count them against the
single-buffer limit even though the builder stores them inline, so
large time32[ms] -> string_view casts were falsely rejected. Fixing
IsViewInline at the source removes the mismatch without touching
downstream call sites or introducing a parallel helper. Removed my
local viewHeaderStoresInline helper added earlier and routed the
inline-skip in reserveFormattedData back through arrow.IsViewInline.
Tightened TestCasts/TestNumericToStringViewLargePayload/
time32[ms]_to_string_view_stays_inline to assert Buffers()[2] is
empty/absent so the test actually exercises the inline path.
exec.ArraySpan.SetMembers recurses into dictionary children, so a
dict<string_view>/dict<binary_view> whose dictionary values are
multi-buffer trips the [3]BufferSpan limit before unpackViewDictionary
gets a chance to run. needsViewCoalesce now recurses into dictionary
values, and coalesceMultiBufferViewDatum dispatches via a new
coalesceArrayData helper that rebuilds either a top-level view array or
a dictionary whose values are a view array (reusing the existing index
buffers via array.NewDataWithDictionary). Added
TestCasts/TestViewDictionaryUnpackCast/dict<string_view_multi-buffer_values>_to_utf8
regression test that builds a dictionary with 2x20KB unique values
(guaranteed multi-buffer with the default 32KB block size) and
asserts the cast completes with the expected values.
exec.ArraySpan.SetMembers recurses through both Dictionary() and
Children(), so a nested type (list<string_view>, struct{f: binary_view},
list<dict<string_view>>, etc.) whose descendant view array is
multi-buffer still hits the fixed [3]BufferSpan limit before dispatch.
Extended needsViewCoalesce to walk Children() as well, and
coalesceArrayData now rebuilds non-view/non-dictionary nested arrays by
keeping their own buffers, recursively coalescing only the children that
need it (retaining the others), and reconstructing via array.NewData.
Added TestCasts/TestViewDictionaryUnpackCast/list<string_view_multi-buffer_child>_to_list<utf8>
regression test that builds a list<string_view> with a multi-buffer
values array (4x20KB unique entries) and asserts the list cast
completes end-to-end with the expected values.
…scing

Extension arrays reuse their underlying storage buffers/children in
place, so exec.ArraySpan.SetMembers iterates the storage layout
directly and trips the [3]BufferSpan limit before CastFromExtension
unwraps anything. needsViewCoalesce now reads through
ExtensionType.StorageType() during the view/dictionary switch, and
coalesceArrayData unwraps extension inputs, recursively coalesces the
storage-typed ArrayData, and rewraps with the original extension
datatype so downstream dispatch still sees the extension.
Added StringViewExtType to the internal test types package (extension
backed by string_view) and TestCasts/TestViewDictionaryUnpackCast/
extension<string_view_multi-buffer>_to_utf8 regression test that
registers the extension, builds a multi-buffer storage array, wraps
it, and casts to utf8 end-to-end.
array.NewData does not carry a Data's Dictionary() field, so the
previous extension-type unwrap/rewrap path dropped the dictionary for
extension<dictionary<...view...>>: the recursive DICTIONARY branch
inside coalesceArrayData then saw a nil Dictionary() and panicked or
produced malformed data. Routed both the unwrap and the rewrap through
a new reshapeArrayDataType helper that honours Dictionary() via
array.NewDataWithDictionary when present. Added
TestCoalesceArrayDataExtensionOverDictOfView as an internal-package
regression (direct coalesceArrayData call) because the full cast
pipeline for extension<dict<...>> hits a pre-existing SetMembers gap
(exec.ArraySpan.SetMembers does not recognize a dictionary under an
extension type) that is unrelated to view coalescing. Added
DictStringViewExtType to internal/types to support the regression.
@zeroshade zeroshade requested review from amoeba, kou and lidavidm May 6, 2026 15:02
zeroshade added 3 commits May 6, 2026 11:18
Pulls the three branch-introduced duplicate patterns identified by
roborev analyze duplication into shared helpers.

1) validateUtf8, validateUtf8Fsb, and validateUtf8View (in
   kernels/binary_view_casts.go) shared the same VisitBitBlocksShort +
   utf8.Valid + error-format skeleton and differed only in how they
   obtained the byte slice per row. Extracted validateUTF8Sequence in
   kernels/string_casts.go; each validator now composes it with its own
   valueAt closure. Removed the now-unused utf8/bitutils imports in
   binary_view_casts.go.

2) rebuildViewSingleBuffer and sumViewDictionaryOutOfLine duplicated the
   'sum out-of-line bytes, error on > MaxInt32' loop. Collapsed them
   into a single sumOutOfLineBytes(n, isNull, valueLen) helper.
   unpackViewDictionary now calls the helper with closures that route
   dictionary positions through GetValueIndex.

3) unpackViewDictionary's StringView and BinaryView branches were
   ~95% similar copies of the same null-aware append loop. Extracted
   buildFromDictionary(dictArr, valIsNull, appendValue, appendNull) so
   both branches share the walk and the two-level null rule.

4) Promoted the TestViewDictionaryUnpackCast local buildDict closure
   (9 call sites within one test) to CastSuite.buildInt32Dictionary so
   it is discoverable and reusable across the suite.

Net LOC change is close to zero (code moved into named helpers rather
than removed), but the cognitive and copy/paste-drift surface area
drops substantially. All apacheGH-184 regression tests still pass.
Second pass driven by the re-run roborev analyze duplication job (842):
two residual branch-introduced patterns were still 60-97% similar after
the first cleanup. Both are consolidated without changing behavior.

1) unpackViewDictionary's StringView and BinaryView branches shared the
   sum-outOfLine + builder.Reserve + ReserveData + buildFromDictionary
   pipeline verbatim; only the concrete builder type and Value() return
   type differed. Extracted unpackViewDictionaryIntoBuilder which takes
   a viewValuesNull (the part of the dictionary-values interface that
   can be unified in Go) plus closures for reserveData / appendValue /
   appendNull keyed to the concrete builder. The two switch arms are
   now just 'new builder + release + call helper + NewArray'.

2) CastBinaryToBinaryView and CastBinaryViewToBinary shared a
   newBinaryAppender + per-type value accessor + null-preserving
   append-loop skeleton. Extracted binaryLikeValueAccessor for the
   input type dispatch (covers all 7 binary-like arrow array types in
   one place) and appendBinaryValues for the null-routing loop. Each
   cast kernel still owns its own direction-specific validation and
   size-limit check (view->binary uses MaxOf[OutOffsetT]; binary->view
   uses the overflow-buffer MaxInt32 limit), so the correctness-
   sensitive pieces stay explicit and auditable.

Net LOC is slightly higher because the new helpers carry docstrings,
but the copy-paste-drift surface is substantially smaller and the
six-fold similarity the reviewer flagged is gone.
… adapter

Third-round roborev analyze complexity job (848) flagged the helper's
7-parameter signature with 3 callback parameters as harder to read than
its ~30 LOC warrants. Packaged the builder + reserveData + appendValue
+ appendNull closures into a viewDictBuilderAdapter struct, dropping
the signature to 4 parameters. Each switch arm now constructs the
adapter with a named-field initializer, making the role of each
closure obvious at the call site. No behaviour change; all existing
regression tests pass unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unsupported cast to string_view from utf8 in v18

1 participant