Comparing changes

We are using tiktoken in various production scenarios and sometimes have the problem that the download of `.tiktoken` files (e.g., `cl100k_base.tiktoken`) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances. More often, when this happens, `Encoder.encode()` will throw an exception such as ``` pyo3_runtime.PanicException: no entry found for key ``` which turns out to be quite hard to track down. In an effort to make tiktoken more robust for production use, this PR adds the `sha256` hash of each of the downloaded files to `openai_public.py` and augments `read_file` to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong. This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.

By combining the contractions to a single non-capturing group prefixed by `'`, we can speed up matches by roughly 20%. By using possessive quantifiers for the `cl100k_base` in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences. Co-authored-by: Lőrinc <[email protected]>

Library doesn't support two new embedding model's encoding mapper - `text-embedding-3-small` - `text-embedding-3-large` Added Encoding mapper for 2 new embedding models. The source of mapping is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

And hide it behind a Rank type to make it easier to separate it from other numeric values

This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens

@paplorinc

Based on suggestion in #239 (specifically 8f5dd7d) Like that commit, this: - Does the init in a single loop and saves a loop if there are no merges - Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values) Unlike that commit: - We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version. - I preserve some of the comments from the original that I think are still useful Co-authored-by: @paplorinc --------- Co-authored-by: Lőrinc Pap <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Dec 3, 2023

Commits on Jan 30, 2024

Commits on Feb 9, 2024

Commits on Feb 11, 2024

Commits on May 13, 2024

This comparison is taking too long to generate.