You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Remove the outdated handle_kludgy_ordinals option from the CLI and
tokenization API. Kludgy ordinals (e.g. '1sti', '3ja') are now always
passed through unchanged as word tokens, which was the default behavior.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+4-31Lines changed: 4 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,7 +118,6 @@ Other options can be specified on the command line:
118
118
|`-g`, `--keep_composite_glyphs`| Do not replace composite glyphs using Unicode COMBINING codes with their accented/umlaut counterparts. |
119
119
|`-e`, `--replace_html_escapes`| HTML escape codes replaced by their meaning, such as `á` -> `á`. |
120
120
|`-c`, `--convert_numbers`| English-style decimal points and thousands separators in numbers changed to Icelandic style. |
121
-
|`-k N`, `--handle_kludgy_ordinals N`| Kludgy ordinal handling defined. 0: Returns the original mixed word form, 1. Kludgy ordinal returned as pure word forms, 2: Kludgy ordinals returned as pure numbers. |
122
121
123
122
Type `tokenize -h` or `tokenize --help` to get a short help message.
124
123
@@ -453,31 +452,6 @@ functions:
453
452
454
453
The default value for the `replace_html_escapes` option is `False`.
455
454
456
-
*`handle_kludgy_ordinals=[value]`
457
-
458
-
This options controls the way Tokenizer handles 'kludgy' ordinals, such as
459
-
*1sti*, *4ðu*, or *2ja*. By default, such ordinals are returned unmodified
460
-
('passed through') as word tokens (`TOK.WORD`).
461
-
However, this can be modified as follows:
462
-
463
-
*`tokenizer.KLUDGY_ORDINALS_MODIFY`: Kludgy ordinals are corrected
464
-
to become 'proper' word tokens, i.e. *1sti* becomes *fyrsti* and
465
-
*2ja* becomes *tveggja*.
466
-
467
-
*`tokenizer.KLUDGY_ORDINALS_TRANSLATE`: Kludgy ordinals that represent
468
-
proper ordinal numbers are translated to ordinal tokens (`TOK.ORDINAL`),
469
-
with their original text and their ordinal value. *1sti* thus
470
-
becomes a `TOK.ORDINAL` token with a value of 1, and *3ja* becomes
471
-
a `TOK.ORDINAL` with a value of 3.
472
-
473
-
*`tokenizer.KLUDGY_ORDINALS_PASS_THROUGH` is the default value of
474
-
the option. It causes kludgy ordinals to be returned unmodified as
475
-
word tokens.
476
-
477
-
Note that versions of Tokenizer prior to 1.4 behaved as if
478
-
`handle_kludgy_ordinals` were set to
479
-
`tokenizer.KLUDGY_ORDINALS_TRANSLATE`.
480
-
481
455
## Dash and Hyphen Handling
482
456
483
457
Tokenizer distinguishes between three dash types and handles them contextually:
@@ -578,9 +552,8 @@ with the following exceptions:
578
552
can be disabled; see the `replace_composite_glyphs` option described
579
553
above.)
580
554
581
-
* If the appropriate options are specified (see above), it converts
582
-
kludgy ordinals (*3ja*) to proper ones (*þriðja*), and English-style
583
-
thousand and decimal separators to Icelandic ones
555
+
* If the `convert_numbers` option is specified (see above), English-style
556
+
thousand and decimal separators are converted to Icelandic ones
584
557
(*10,345.67* becomes *10.345,67*).
585
558
586
559
* If the `replace_html_escapes` option is set, Tokenizer replaces
@@ -812,8 +785,8 @@ can be found in the file `test/toktest_normal_gold_expected.txt`.
812
785
`TOK.SERIALNUMBER` token kinds; abbreviations can now have multiple
813
786
meanings.
814
787
* Version 1.4.0: Added the `**options` parameter to the
815
-
`tokenize()` function, giving control over the handling of numbers,
816
-
telephone numbers, and 'kludgy' ordinals.
788
+
`tokenize()` function, giving control over the handling of numbers
789
+
and telephone numbers.
817
790
* Version 1.3.0: Added `TOK.DOMAIN` and `TOK.HASHTAG` token types;
818
791
improved handling of capitalized month name *Ágúst*, which is
819
792
now recognized when following an ordinal number; improved recognition
0 commit comments