Skip to content

Commit f9bb989

Browse files
Remove deprecated --handle_kludgy_ordinals flag
Remove the outdated handle_kludgy_ordinals option from the CLI and tokenization API. Kludgy ordinals (e.g. '1sti', '3ja') are now always passed through unchanged as word tokens, which was the default behavior. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 28ce86e commit f9bb989

8 files changed

Lines changed: 13 additions & 211 deletions

File tree

README.md

Lines changed: 4 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,6 @@ Other options can be specified on the command line:
118118
| `-g`, `--keep_composite_glyphs` | Do not replace composite glyphs using Unicode COMBINING codes with their accented/umlaut counterparts. |
119119
| `-e`, `--replace_html_escapes` | HTML escape codes replaced by their meaning, such as `&aacute;` -> `á`. |
120120
| `-c`, `--convert_numbers` | English-style decimal points and thousands separators in numbers changed to Icelandic style. |
121-
| `-k N`, `--handle_kludgy_ordinals N` | Kludgy ordinal handling defined. 0: Returns the original mixed word form, 1. Kludgy ordinal returned as pure word forms, 2: Kludgy ordinals returned as pure numbers. |
122121

123122
Type `tokenize -h` or `tokenize --help` to get a short help message.
124123

@@ -453,31 +452,6 @@ functions:
453452

454453
The default value for the `replace_html_escapes` option is `False`.
455454

456-
* `handle_kludgy_ordinals=[value]`
457-
458-
This options controls the way Tokenizer handles 'kludgy' ordinals, such as
459-
*1sti*, *4ðu*, or *2ja*. By default, such ordinals are returned unmodified
460-
('passed through') as word tokens (`TOK.WORD`).
461-
However, this can be modified as follows:
462-
463-
* `tokenizer.KLUDGY_ORDINALS_MODIFY`: Kludgy ordinals are corrected
464-
to become 'proper' word tokens, i.e. *1sti* becomes *fyrsti* and
465-
*2ja* becomes *tveggja*.
466-
467-
* `tokenizer.KLUDGY_ORDINALS_TRANSLATE`: Kludgy ordinals that represent
468-
proper ordinal numbers are translated to ordinal tokens (`TOK.ORDINAL`),
469-
with their original text and their ordinal value. *1sti* thus
470-
becomes a `TOK.ORDINAL` token with a value of 1, and *3ja* becomes
471-
a `TOK.ORDINAL` with a value of 3.
472-
473-
* `tokenizer.KLUDGY_ORDINALS_PASS_THROUGH` is the default value of
474-
the option. It causes kludgy ordinals to be returned unmodified as
475-
word tokens.
476-
477-
Note that versions of Tokenizer prior to 1.4 behaved as if
478-
`handle_kludgy_ordinals` were set to
479-
`tokenizer.KLUDGY_ORDINALS_TRANSLATE`.
480-
481455
## Dash and Hyphen Handling
482456

483457
Tokenizer distinguishes between three dash types and handles them contextually:
@@ -578,9 +552,8 @@ with the following exceptions:
578552
can be disabled; see the `replace_composite_glyphs` option described
579553
above.)
580554

581-
* If the appropriate options are specified (see above), it converts
582-
kludgy ordinals (*3ja*) to proper ones (*þriðja*), and English-style
583-
thousand and decimal separators to Icelandic ones
555+
* If the `convert_numbers` option is specified (see above), English-style
556+
thousand and decimal separators are converted to Icelandic ones
584557
(*10,345.67* becomes *10.345,67*).
585558

586559
* If the `replace_html_escapes` option is set, Tokenizer replaces
@@ -812,8 +785,8 @@ can be found in the file `test/toktest_normal_gold_expected.txt`.
812785
`TOK.SERIALNUMBER` token kinds; abbreviations can now have multiple
813786
meanings.
814787
* Version 1.4.0: Added the `**options` parameter to the
815-
`tokenize()` function, giving control over the handling of numbers,
816-
telephone numbers, and 'kludgy' ordinals.
788+
`tokenize()` function, giving control over the handling of numbers
789+
and telephone numbers.
817790
* Version 1.3.0: Added `TOK.DOMAIN` and `TOK.HASHTAG` token types;
818791
improved handling of capitalized month name *Ágúst*, which is
819792
now recognized when following an ordinal number; improved recognition

src/tokenizer/__init__.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,6 @@
3636
TP_WORD,
3737
EN_DASH,
3838
EM_DASH,
39-
KLUDGY_ORDINALS_PASS_THROUGH,
40-
KLUDGY_ORDINALS_MODIFY,
41-
KLUDGY_ORDINALS_TRANSLATE,
4239
BIN_Tuple,
4340
BIN_TupleList,
4441
)
@@ -80,9 +77,6 @@
8077
"EM_DASH",
8178
"EN_DASH",
8279
"generate_raw_tokens",
83-
"KLUDGY_ORDINALS_MODIFY",
84-
"KLUDGY_ORDINALS_PASS_THROUGH",
85-
"KLUDGY_ORDINALS_TRANSLATE",
8680
"mark_paragraphs",
8781
"normalized_text_from_tokens",
8882
"normalized_text",

src/tokenizer/definitions.py

Lines changed: 2 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -605,20 +605,8 @@ class PersonNameTuple(NamedTuple):
605605
)
606606

607607

608-
# If the handle_kludgy_ordinals option is set to
609-
# KLUDGY_ORDINALS_PASS_THROUGH, we do not convert
610-
# kludgy ordinals but pass them through as word tokens.
611-
KLUDGY_ORDINALS_PASS_THROUGH = 0
612-
# If the handle_kludgy_ordinals option is set to
613-
# KLUDGY_ORDINALS_MODIFY, we convert '1sti' to 'fyrsti', etc.,
614-
# and return the modified word as a token.
615-
KLUDGY_ORDINALS_MODIFY = 1
616-
# If the handle_kludgy_ordinals option is set to
617-
# KLUDGY_ORDINALS_TRANSLATE, we convert '1sti' to TOK.Ordinal('1sti', 1), etc.,
618-
# but otherwise pass the original word through as a word token ('2ja').
619-
KLUDGY_ORDINALS_TRANSLATE = 2
620-
621-
# Incorrectly written ('kludgy') ordinals
608+
# Incorrectly written ('kludgy') ordinals: these are passed through unchanged
609+
# as word tokens, but they need to be recognized so they are not parsed as numbers
622610
ORDINAL_ERRORS: Mapping[str, str] = {
623611
"1sti": "fyrsti",
624612
"1sta": "fyrsta",
@@ -639,22 +627,6 @@ class PersonNameTuple(NamedTuple):
639627
"4ra": "fjögurra",
640628
}
641629

642-
# Translations of kludgy ordinal words into numbers
643-
ORDINAL_NUMBERS: Mapping[str, int] = {
644-
"1sti": 1,
645-
"1sta": 1,
646-
"1stu": 1,
647-
"3ji": 3,
648-
"3ja": 3,
649-
"3ju": 3,
650-
"4ði": 4,
651-
"4ða": 4,
652-
"4ðu": 4,
653-
"5ti": 5,
654-
"5ta": 5,
655-
"5tu": 5,
656-
}
657-
658630
# Handling of Roman numerals
659631

660632
RE_ROMAN_NUMERAL = re.compile(

src/tokenizer/main.py

Lines changed: 0 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -149,19 +149,6 @@
149149
),
150150
)
151151

152-
parser.add_argument(
153-
"-k",
154-
"--handle_kludgy_ordinals",
155-
type=int,
156-
default=0,
157-
help=(
158-
"Kludgy ordinal handling defined.\n"
159-
"\t0: Returns the original word form.\n"
160-
"\t1: Ordinals returned as pure words.\n"
161-
"\t2: Ordinals returned as numbers."
162-
),
163-
)
164-
165152
parser.add_argument(
166153
"-v",
167154
"--version",
@@ -263,9 +250,6 @@ def val(t: Tok, quote_word: bool = False) -> Any:
263250
if args.one_sent_per_line:
264251
options["one_sent_per_line"] = True
265252

266-
if args.handle_kludgy_ordinals:
267-
options["handle_kludgy_ordinals"] = args.handle_kludgy_ordinals
268-
269253
if args.original:
270254
options["original"] = args.original
271255

src/tokenizer/tokenizer.py

Lines changed: 7 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1730,42 +1730,22 @@ def _is_letter(self, char: str) -> bool:
17301730
class NumberParser:
17311731
"""Parses a sequence of digits off the front of a raw token"""
17321732

1733-
def __init__(
1734-
self, rt: Tok, handle_kludgy_ordinals: int, convert_numbers: bool
1735-
) -> None:
1733+
def __init__(self, rt: Tok, convert_numbers: bool) -> None:
17361734
self.rt = rt
1737-
self.handle_kludgy_ordinals = handle_kludgy_ordinals
17381735
self.convert_numbers = convert_numbers
17391736

17401737
def parse(self) -> Iterable[Tok]:
17411738
"""Parse the raw token, yielding result tokens"""
17421739
# Handle kludgy ordinals: '3ji', '5ti', etc.
1740+
# Yield them unchanged as word tokens (pass-through behavior)
17431741
rt = self.rt
1744-
handle_kludgy_ordinals = self.handle_kludgy_ordinals
17451742
convert_numbers = self.convert_numbers
1746-
for key, val in ORDINAL_ERRORS.items():
1743+
for key in ORDINAL_ERRORS:
17471744
rtxt = rt.txt
17481745
if rtxt.startswith(key):
1749-
# This is a kludgy ordinal
1746+
# This is a kludgy ordinal: yield it unchanged as a word token
17501747
key_tok, rt = rt.split(len(key))
1751-
if handle_kludgy_ordinals == KLUDGY_ORDINALS_MODIFY:
1752-
# Convert ordinals to corresponding word tokens:
1753-
# '1sti' -> 'fyrsti', '3ji' -> 'þriðji', etc.
1754-
key_tok.substitute_longer((0, len(key)), val)
1755-
yield TOK.Word(key_tok)
1756-
elif (
1757-
handle_kludgy_ordinals == KLUDGY_ORDINALS_TRANSLATE
1758-
and key in ORDINAL_NUMBERS
1759-
):
1760-
# Convert word-form ordinals into ordinal tokens,
1761-
# i.e. '1sti' -> TOK.Ordinal('1sti', 1),
1762-
# but leave other kludgy constructs ('2ja')
1763-
# as word tokens
1764-
yield TOK.Ordinal(key_tok, ORDINAL_NUMBERS[key])
1765-
else:
1766-
# No special handling of kludgy ordinals:
1767-
# yield them unchanged as word tokens
1768-
yield TOK.Word(key_tok)
1748+
yield TOK.Word(key_tok)
17691749
break # This skips the for loop 'else'
17701750
else:
17711751
# Not a kludgy ordinal: eat tokens starting with a digit
@@ -1898,7 +1878,6 @@ def parse(self, rt: Tok) -> Iterable[Tok]:
18981878

18991879
def parse_mixed(
19001880
rt: Tok,
1901-
handle_kludgy_ordinals: int,
19021881
convert_numbers: bool,
19031882
replace_composite_glyphs: bool = True,
19041883
) -> Iterable[Tok]:
@@ -1994,7 +1973,7 @@ def parse_mixed(
19941973
rtxt[0] in DIGITS_PREFIX
19951974
or (rtxt[0] in SIGN_PREFIX and len(rtxt) >= 2 and rtxt[1] in DIGITS_PREFIX)
19961975
):
1997-
np = NumberParser(rt, handle_kludgy_ordinals, convert_numbers)
1976+
np = NumberParser(rt, convert_numbers)
19981977
yield from np.parse()
19991978
rt = np.rt
20001979
ate = True
@@ -2072,12 +2051,6 @@ def parse_tokens(txt: Union[str, Iterable[str]], **options: Any) -> Iterator[Tok
20722051
replace_html_escapes: bool = options.get("replace_html_escapes", False)
20732052
one_sent_per_line: bool = options.get("one_sent_per_line", False)
20742053

2075-
# The default behavior for kludgy ordinals is to pass them
2076-
# through as word tokens
2077-
handle_kludgy_ordinals: int = options.get(
2078-
"handle_kludgy_ordinals", KLUDGY_ORDINALS_PASS_THROUGH
2079-
)
2080-
20812054
# This code proceeds roughly as follows:
20822055
# 1) The text is split into raw tokens on whitespace boundaries.
20832056
# 2) (By far the most common case:) Raw tokens that are purely
@@ -2178,9 +2151,7 @@ def parse_tokens(txt: Union[str, Iterable[str]], **options: Any) -> Iterator[Tok
21782151
yield TOK.Punctuation(punct, normalized="‚")
21792152

21802153
# More complex case of mixed punctuation, letters and numbers
2181-
yield from parse_mixed(
2182-
rt, handle_kludgy_ordinals, convert_numbers, replace_composite_glyphs
2183-
)
2154+
yield from parse_mixed(rt, convert_numbers, replace_composite_glyphs)
21842155

21852156
# Yield a sentinel token at the end that will be cut off by the final generator
21862157
yield TOK.End_Sentinel()

test/test_cli.py

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -200,13 +200,4 @@ def test_cli(capsys: CaptureFixture[str], monkeypatch: MonkeyPatch) -> None:
200200
== "Hann fékk 7,5 í meðaleinkunn en bara 3,3 í íþróttum , og hlaut 2.000,5 USD fyrir ."
201201
)
202202

203-
# Handle kludgy ordinals
204-
# --handle_kludgy_ordinals flag
205-
t = "Hann var 1sti maðurinn til að heimsækja tunglið."
206-
r = run_cli(c, m, ["-", "-", "--handle_kludgy_ordinals", "1"], t)
207-
assert r == "Hann var fyrsti maðurinn til að heimsækja tunglið ."
208-
# TODO: Broken functionality, needs to be fixed
209-
# r = run_cli(c, m, ["-", "-", "--handle_kludgy_ordinals", "2"], t)
210-
# assert r == "Hann var 1. maðurinn til að heimsækja tunglið ."
211-
212203
# TODO: Add more tests for the CLI to achieve 100% coverage

test/test_index_calculation.py

Lines changed: 0 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -638,26 +638,6 @@ def test_composite_phrases() -> None:
638638
assert byte_indexes == [0, 25, 26]
639639

640640

641-
def test_lengthening_substitutions() -> None:
642-
s = "Þetta er 3ji báturinn!"
643-
# 0123456789012345678901
644-
# ^ ^ ^ ^ ^
645-
# x x
646-
# ! lengthening happens here (3ji->þriðji)
647-
toks = tokenizer.parse_tokens(
648-
s, handle_kludgy_ordinals=tokenizer.KLUDGY_ORDINALS_MODIFY
649-
)
650-
char_indexes, byte_indexes = tokenizer.calculate_indexes(toks)
651-
assert char_indexes == [0, 5, 8, 12, 21]
652-
assert byte_indexes == [0, 6, 9, 13, 23]
653-
toks = tokenizer.parse_tokens(
654-
s, handle_kludgy_ordinals=tokenizer.KLUDGY_ORDINALS_MODIFY
655-
)
656-
char_indexes, byte_indexes = tokenizer.calculate_indexes(toks, last_is_end=True)
657-
assert char_indexes == [0, 5, 8, 12, 21, 22]
658-
assert byte_indexes == [0, 6, 9, 13, 23, 24]
659-
660-
661641
def test_converted_measurements() -> None:
662642
s = "Stillið ofninn á 12° C til að baka kökuna."
663643
# 012345678901234567890123456789012345678901

test/test_tokenizer.py

Lines changed: 0 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -498,21 +498,6 @@ def test_single_tokens() -> None:
498498
("1-800-1234-545566", TOK.SERIALNUMBER),
499499
]
500500

501-
TEST_CASES_KLUDGY_MODIFY = [
502-
("1sti", [Tok(TOK.WORD, "fyrsti", None)]),
503-
("4ðu", [Tok(TOK.WORD, "fjórðu", None)]),
504-
("2svar", [Tok(TOK.WORD, "tvisvar", None)]),
505-
("4ra", [Tok(TOK.WORD, "fjögurra", None)]),
506-
("2ja", [Tok(TOK.WORD, "tveggja", None)]),
507-
]
508-
509-
TEST_CASES_KLUDGY_TRANSLATE = [
510-
("1sti", [Tok(TOK.ORDINAL, "1sti", 1)]),
511-
("4ðu", [Tok(TOK.ORDINAL, "4ðu", 4)]),
512-
("2svar", [Tok(TOK.WORD, "2svar", None)]),
513-
("4ra", [Tok(TOK.WORD, "4ra", None)]),
514-
]
515-
516501
TEST_CASES_CONVERT_TELNOS: List[TestCase] = [
517502
("525-4764", TOK.TELNO),
518503
("4204200", [Tok(TOK.TELNO, "4204200", ("420-4200", "354"))]),
@@ -602,10 +587,6 @@ def run_test(test_cases: Iterable[TestCase], **options: Any) -> None:
602587

603588
run_test(cast(Iterable[TestCase], TEST_CASES))
604589
run_test(cast(Iterable[TestCase], TEST_CASES_CONVERT_TELNOS))
605-
run_test(TEST_CASES_KLUDGY_MODIFY, handle_kludgy_ordinals=t.KLUDGY_ORDINALS_MODIFY)
606-
run_test(
607-
TEST_CASES_KLUDGY_TRANSLATE, handle_kludgy_ordinals=t.KLUDGY_ORDINALS_TRANSLATE
608-
)
609590
run_test(TEST_CASES_CONVERT_NUMBERS, convert_numbers=True)
610591
run_test(
611592
cast(Iterable[TestCase], TEST_CASES_COALESCE_PERCENT), coalesce_percent=True
@@ -1051,42 +1032,6 @@ def test_correction() -> None:
10511032
"""Hann „gaf“ mér €10.780,65.""",
10521033
),
10531034
]
1054-
SENT_KLUDGY_ORDINALS_MODIFY = [
1055-
(
1056-
"""Hann sagði: ´Þú ert fífl´! Farðu í 3ja herbergja íbúð.""",
1057-
"""Hann sagði: ‚Þú ert fífl‘! Farðu í þriggja herbergja íbúð.""",
1058-
),
1059-
(
1060-
"""Hann sagði: ´Þú ert fífl´! Farðu í 1sta sinn.""",
1061-
"""Hann sagði: ‚Þú ert fífl‘! Farðu í fyrsta sinn.""",
1062-
),
1063-
(
1064-
"""Hann sagði: ´Þú ert fífl´! Farðu 2svar í bað.""",
1065-
"""Hann sagði: ‚Þú ert fífl‘! Farðu tvisvar í bað.""",
1066-
),
1067-
(
1068-
"""Ég keypti 4ra herbergja íbúð á verði 2ja herbergja.""",
1069-
"""Ég keypti fjögurra herbergja íbúð á verði tveggja herbergja.""",
1070-
),
1071-
]
1072-
SENT_KLUDGY_ORDINALS_TRANSLATE = [
1073-
(
1074-
"""Hann sagði: ´Þú ert fífl´! Farðu í 3ja sinn.""",
1075-
"""Hann sagði: ‚Þú ert fífl‘! Farðu í 3ja sinn.""",
1076-
),
1077-
(
1078-
"""Hann sagði: ´Þú ert fífl´! Farðu í 1sta sinn.""",
1079-
"""Hann sagði: ‚Þú ert fífl‘! Farðu í 1sta sinn.""",
1080-
),
1081-
(
1082-
"""Hann sagði: ´Þú ert fífl´! Farðu 2svar í bað.""",
1083-
"""Hann sagði: ‚Þú ert fífl‘! Farðu 2svar í bað.""",
1084-
),
1085-
(
1086-
"""Ég keypti 4ra herbergja íbúð á verði 2ja herbergja.""",
1087-
"""Ég keypti 4ra herbergja íbúð á verði 2ja herbergja.""",
1088-
),
1089-
]
10901035
SENT_CONVERT_NUMBERS = [
10911036
(
10921037
"""Hann "gaf" mér 10,780.65 dollara.""",
@@ -1102,14 +1047,6 @@ def test_correction() -> None:
11021047
s = t.tokenize(sent)
11031048
txt = t.detokenize(s, normalize=True)
11041049
assert txt == correct
1105-
for sent, correct in SENT_KLUDGY_ORDINALS_MODIFY:
1106-
s = t.tokenize(sent, handle_kludgy_ordinals=t.KLUDGY_ORDINALS_MODIFY)
1107-
txt = t.detokenize(s, normalize=True)
1108-
assert txt == correct
1109-
for sent, correct in SENT_KLUDGY_ORDINALS_TRANSLATE:
1110-
s = t.tokenize(sent, handle_kludgy_ordinals=t.KLUDGY_ORDINALS_TRANSLATE)
1111-
txt = t.detokenize(s, normalize=True)
1112-
assert txt == correct
11131050
for sent, correct in SENT_CONVERT_NUMBERS:
11141051
s = t.tokenize(sent, convert_numbers=True)
11151052
txt = t.detokenize(s, normalize=True)

0 commit comments

Comments
 (0)