The Encode and EncodeBatch methods methods unnecessarily require std::string instances

In projects that use plain C strings or `std::string_view`, methods like `Encode` and `EncodeBatch` construct rather large strings, which increases memory allocations and copies. If these methods are declared to take string views, like shown below, then no extra allocation and copying will be performed.
```
-  virtual std::vector<int32_t> Encode(const std::string& text) = 0;
+  virtual std::vector<int32_t> Encode(const std::string_view& text) = 0;
```
Seeing how some pull requests are sitting in the queue for over a year, I won't create one, but you can apply the patch attached to this post to change all string references to string views, which will work for both, strings and string views.

The source at the release tag is broken (see another issue I created), so this patch is against the hash acbdc5a2, and can be applied with this command, assuming it runs in a directory above `tokenizers-cpp-0.0.1` (otherwise remove `-d tokenizers-cpp-0.0.1`).
```
patch --unified -p1 -d tokenizers-cpp-0.0.1 --input ../patches/tokenizers-cpp-0.0.1.patch
```

[tokenizers-cpp-0.1.1.patch](https://github.com/user-attachments/files/26317279/tokenizers-cpp-0.1.1.patch)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Encode and EncodeBatch methods methods unnecessarily require std::string instances #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The Encode and EncodeBatch methods methods unnecessarily require std::string instances #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions