Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.from_bytes() not available in python bindings #1567

Open
RamvigneshPasupathy opened this issue Jul 11, 2024 · 4 comments
Open

Tokenizer.from_bytes() not available in python bindings #1567

RamvigneshPasupathy opened this issue Jul 11, 2024 · 4 comments

Comments

@RamvigneshPasupathy
Copy link

RamvigneshPasupathy commented Jul 11, 2024

Looking for a "Tokenizer.from_bytes()" support in python, similar to the one in Rust - #1013

Currently, it is not available in the python bindings code - https://github.com/huggingface/tokenizers/blob/v0.19.1/bindings/python/src/tokenizer.rs

Why this is needed?

  • We have "safetensors" format to serialize models and them store as remote objects and which then can loaded back in memory and deserialized into Model objects safely - without any need for file write operations in an app server.
  • But in case of Tokenizers, I have to use a "zip" file for pre_trained tokenizers to save them as remote objects - where I have to download the remote zip and perform a file write operation in my server before loading the Tokenizer object back - performing this file write is not convenient. If I have to skip the file write operation, only way left for me is to save a pickled copy of my Tokenizer object created at runtime and then unpickled from my remote object server - which may not be a safer option.. Having a "from_bytes" option for Tokenizer can hence be helpful.
@ArthurZucker
Copy link
Collaborator

Would you like to open a PR to add this featuyr? 🤗

@RamvigneshPasupathy
Copy link
Author

Hi @ArthurZucker

I was going through the code once more with a view of contributing the method that I asked Tokenizer.from_bytes(); but then I figured out that the feature that I am expecting is already available in a different method name Tokenizer.from_buffer().

Tried a PoC of loading the tokenizer from file bytes of a tokenizer.json, and it works. Attaching screenshots; Plz close this issue if you find the PoC is good and this code will be an enough reference for anyone who is using huggingface tokenizers..

Page 1Page 2
image image

@ArthurZucker
Copy link
Collaborator

yeah maybe update the doc to make from buffer more findable?

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants