Skip to content

Conversation

@einsitang
Copy link

@einsitang einsitang commented Jun 30, 2025

AloneTokenOption

After defining the token, if there are consecutive letters after the token value, it will be split independently.

...
tokenizer.DefineTokens(HelloKey,[]string{"hello"})
...
input:="helloworld"
stream:=tokenizer.ParseString(input)

for stream.IsVail() {
  token:=stream.CurrentToken()
  // hello,world
  stream.GoNext()
}

after AloneTokenOption

...
tokenizer.DefineTokens(HelloKey,[]string{"hello"},AloneTokenOption)
...
input:="helloworld"
stream:=tokenizer.ParseString(input)

for stream.IsVail() {
  token:=stream.CurrentToken()
  // helloworld
  stream.GoNext()
}

Only supports independent match hello

IgnoreCaseTokenOption

make token value case-insensitive match (#12 )

use example:

tokenizer.DefineTokens(HelloKey,[]string{"hello"},IgnoreCaseTokenOption)
...

Non-breaking Change: tokenizer.DefineTokens API

@bzick
Copy link
Owner

bzick commented Jun 30, 2025

It doesn't work with unicode, but unicode one of the main feature.

@einsitang
Copy link
Author

It doesn't work with unicode, but unicode one of the main feature.

you mean IgnoreCase with unicode not work?

I checked the encoding table. Both unicode and ascii have lowercase letters, and the difference is 32

@einsitang
Copy link
Author

but unicode one of the main feature.

How is the progress of the Unicode feature currently?

@bzick
Copy link
Owner

bzick commented Jul 7, 2025

I mean:
• that the shift by 32 doesn’t work with alphabets of other languages
• you’re working with only 1 byte and as a result…
• … isAlphabet checks 1 byte, but it should work with runes (multi-byte)
and so on and so forth
See https://symbl.cc/en/unicode-table/

@einsitang
Copy link
Author

The problem you mentioned does indeed exist.

however, when the tokenizer performs the defineToken operation, it uses the first byte for indexing, and during parsing, it also moves in byte units. You need to modify the indexing and moving method to use runes instead so that I can correctly check.

@einsitang
Copy link
Author

There is a new implementation method.

IgnoreCaseTokenOption only support alphabet token , define token with special word will panic.

tokenizer.DefineTokens(HelloKey,[]string{"hello","哈喽"}, IgnoreCaseTokenOption) // panic , because "哈喽" is not alphabet

@n-peugnet
Copy link
Contributor

Non-breaking Change: tokenizer.DefineTokens API

This is a Breaking API change though (see: https://go.dev/blog/module-compatibility#adding-to-a-function).

You should add a new function like DefineTokensOptions() and make DefineTokens() call it underneath for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants