Skip to content

The Encode and EncodeBatch methods methods unnecessarily require std::string instances #95

@gh-andre

Description

@gh-andre

In projects that use plain C strings or std::string_view, methods like Encode and EncodeBatch construct rather large strings, which increases memory allocations and copies. If these methods are declared to take string views, like shown below, then no extra allocation and copying will be performed.

-  virtual std::vector<int32_t> Encode(const std::string& text) = 0;
+  virtual std::vector<int32_t> Encode(const std::string_view& text) = 0;

Seeing how some pull requests are sitting in the queue for over a year, I won't create one, but you can apply the patch attached to this post to change all string references to string views, which will work for both, strings and string views.

The source at the release tag is broken (see another issue I created), so this patch is against the hash acbdc5a, and can be applied with this command, assuming it runs in a directory above tokenizers-cpp-0.0.1 (otherwise remove -d tokenizers-cpp-0.0.1).

patch --unified -p1 -d tokenizers-cpp-0.0.1 --input ../patches/tokenizers-cpp-0.0.1.patch

tokenizers-cpp-0.1.1.patch

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions