Skip to content

Support improved chunking_config in File Search Store for languages without word spacing (e.g., Japanese/Chinese) #1935

@motomura-s

Description

@motomura-s

Problem

When importing text into a File Search Store, the current chunking_config only supports white_space_config (which assumes space-separated tokens).
This does not work well for languages without spaces between words (e.g., Japanese, Chinese), and mixed language content (e.g., Japanese + English) is also not chunked consistently.

Desired Behavior

  • Allow correctly specifying max_tokens_per_chunk and max_overlap_tokens regardless of the presence of word boundaries.
  • Improve control over chunk size and boundaries for mixed language text.

Metadata

Metadata

Assignees

Labels

priority: p3Desirable enhancement or fix. May not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions