-
Notifications
You must be signed in to change notification settings - Fork 116
Description
In projects that use plain C strings or std::string_view, methods like Encode and EncodeBatch construct rather large strings, which increases memory allocations and copies. If these methods are declared to take string views, like shown below, then no extra allocation and copying will be performed.
- virtual std::vector<int32_t> Encode(const std::string& text) = 0;
+ virtual std::vector<int32_t> Encode(const std::string_view& text) = 0;
Seeing how some pull requests are sitting in the queue for over a year, I won't create one, but you can apply the patch attached to this post to change all string references to string views, which will work for both, strings and string views.
The source at the release tag is broken (see another issue I created), so this patch is against the hash acbdc5a, and can be applied with this command, assuming it runs in a directory above tokenizers-cpp-0.0.1 (otherwise remove -d tokenizers-cpp-0.0.1).
patch --unified -p1 -d tokenizers-cpp-0.0.1 --input ../patches/tokenizers-cpp-0.0.1.patch