Skip to content

Utf8 regression#1000

Merged
wasade merged 6 commits intobiocore:masterfrom
wasade:utf8-regression
Dec 23, 2025
Merged

Utf8 regression#1000
wasade merged 6 commits intobiocore:masterfrom
wasade:utf8-regression

Conversation

@wasade
Copy link
Copy Markdown
Member

@wasade wasade commented Dec 22, 2025

A regression on parse of non-trivial UTF-8 characters was introduced. Reported here.

Copy link
Copy Markdown
Member

@ebolyen ebolyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @wasade! One minor question on the test setup.

Comment thread biom/tests/test_parse.py
def test_parse_biom_table_hdf5_accent_utf8_regression(self):
tab = example_table.copy()
tab.update_ids({i: f'{i}fóó' for i in tab.ids()}, inplace=True)
with NamedTemporaryFile(delete=False) as tmpfile:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should delete be False? I think this would leave behind a file on the system.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but err'ing on following the style of the rest of the script. And, a left over file in TMPDIR isn't the best but not the worst thing to do

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. It looks like those add this to the end of the test:

self.to_remove.append(tmpfile.name)

It's mostly github's problem these days I suppose. So your call, but wanted to mention it.

Comment thread biom/table.py
ids_dtype = 'U%d' % max([len(v) for v in ids])
ids = np.asarray(ids, dtype=ids_dtype)
else:
# .asstr does not handle an empty dataset
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫤 that feels rude of h5py

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be HDF5 as the C-API is touchy. Or, we may not be correctly enforcing types on empty tables

@wasade wasade merged commit 20ace60 into biocore:master Dec 23, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants