Tokenizer.from_bytes() not available in python bindings #1567

RamvigneshPasupathy · 2024-07-11T10:15:04Z

Looking for a "Tokenizer.from_bytes()" support in python, similar to the one in Rust - #1013

Currently, it is not available in the python bindings code - https://github.com/huggingface/tokenizers/blob/v0.19.1/bindings/python/src/tokenizer.rs

Why this is needed?

We have "safetensors" format to serialize models and them store as remote objects and which then can loaded back in memory and deserialized into Model objects safely - without any need for file write operations in an app server.
But in case of Tokenizers, I have to use a "zip" file for pre_trained tokenizers to save them as remote objects - where I have to download the remote zip and perform a file write operation in my server before loading the Tokenizer object back - performing this file write is not convenient. If I have to skip the file write operation, only way left for me is to save a pickled copy of my Tokenizer object created at runtime and then unpickled from my remote object server - which may not be a safer option.. Having a "from_bytes" option for Tokenizer can hence be helpful.

ArthurZucker · 2024-07-11T16:50:16Z

Would you like to open a PR to add this featuyr? 🤗

RamvigneshPasupathy · 2024-07-15T06:51:36Z

Hi @ArthurZucker

I was going through the code once more with a view of contributing the method that I asked Tokenizer.from_bytes(); but then I figured out that the feature that I am expecting is already available in a different method name Tokenizer.from_buffer().

Tried a PoC of loading the tokenizer from file bytes of a tokenizer.json, and it works. Attaching screenshots; Plz close this issue if you find the PoC is good and this code will be an enough reference for anyone who is using huggingface tokenizers..

Page 1	Page 2

ArthurZucker · 2024-07-15T12:27:35Z

yeah maybe update the doc to make from buffer more findable?

github-actions · 2024-08-15T01:50:53Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker added the Feature Request label Jul 11, 2024

github-actions bot added the Stale label Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer.from_bytes() not available in python bindings #1567

Tokenizer.from_bytes() not available in python bindings #1567

RamvigneshPasupathy commented Jul 11, 2024 •

edited

Loading

ArthurZucker commented Jul 11, 2024

RamvigneshPasupathy commented Jul 15, 2024

ArthurZucker commented Jul 15, 2024

github-actions bot commented Aug 15, 2024

Tokenizer.from_bytes() not available in python bindings #1567

Tokenizer.from_bytes() not available in python bindings #1567

Comments

RamvigneshPasupathy commented Jul 11, 2024 • edited Loading

ArthurZucker commented Jul 11, 2024

RamvigneshPasupathy commented Jul 15, 2024

ArthurZucker commented Jul 15, 2024

github-actions bot commented Aug 15, 2024

RamvigneshPasupathy commented Jul 11, 2024 •

edited

Loading