Problem
When importing text into a File Search Store, the current chunking_config only supports white_space_config (which assumes space-separated tokens).
This does not work well for languages without spaces between words (e.g., Japanese, Chinese), and mixed language content (e.g., Japanese + English) is also not chunked consistently.
Desired Behavior
- Allow correctly specifying
max_tokens_per_chunk and max_overlap_tokens regardless of the presence of word boundaries.
- Improve control over chunk size and boundaries for mixed language text.