Load pre-trained tokenizer from memory #1013

HaoboGu · 2022-06-20T13:02:41Z

Hello guys,

tokenizers::Tokenizer has two methods to create a pre-trained tokenizer: Tokenizer::from_pretrained and Tokenizer::from_file. It's quite common that we need to download tokenizer.json from remote, now we have to save the remote data to a local file and then load the tokenizer from local file. We also have to delete it after loading.

If we have something like Tokenizer::from_memory(data_bytes: &[u8]), we can just download the remote tokenizer to memory and then load it directly. It's quick and safe, you'll never need to delete the local file after using.

The text was updated successfully, but these errors were encountered:

Narsil · 2022-07-04T15:03:02Z

This is a totally reasonable thing to ask.

Actually if you are in pure Rust, it already supports serde so

let tokenizer: Tokenizer = serde_json::from_slice(&data_bytes).unwrap(); should work out of the box.

That being said adding a new function is definitely something we could do. from_bytes or from_slice seem like better names considering std but overall would still be a nice addition IMHO.

HaoboGu mentioned this issue Jul 11, 2022

Add from_bytes approach for creating tokenizers #1024

Merged

Narsil closed this as completed in #1024 Jul 18, 2022

RamvigneshPasupathy mentioned this issue Jul 11, 2024

Tokenizer.from_bytes() not available in python bindings #1567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load pre-trained tokenizer from memory #1013

Load pre-trained tokenizer from memory #1013

HaoboGu commented Jun 20, 2022 •

edited

Loading

Narsil commented Jul 4, 2022

Load pre-trained tokenizer from memory #1013

Load pre-trained tokenizer from memory #1013

Comments

HaoboGu commented Jun 20, 2022 • edited Loading

Narsil commented Jul 4, 2022

HaoboGu commented Jun 20, 2022 •

edited

Loading