-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ai #315
Closed
Closed
ai #315
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: messense <[email protected]>
…#230) We are using tiktoken in various production scenarios and sometimes have the problem that the download of `.tiktoken` files (e.g., `cl100k_base.tiktoken`) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances. More often, when this happens, `Encoder.encode()` will throw an exception such as ``` pyo3_runtime.PanicException: no entry found for key ``` which turns out to be quite hard to track down. In an effort to make tiktoken more robust for production use, this PR adds the `sha256` hash of each of the downloaded files to `openai_public.py` and augments `read_file` to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong. This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.
By combining the contractions to a single non-capturing group prefixed by `'`, we can speed up matches by roughly 20%. By using possessive quantifiers for the `cl100k_base` in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences. Co-authored-by: Lőrinc <[email protected]>
Library doesn't support two new embedding model's encoding mapper - `text-embedding-3-small` - `text-embedding-3-large` Added Encoding mapper for 2 new embedding models. The source of mapping is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
And hide it behind a Rank type to make it easier to separate it from other numeric values
This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens
Based on suggestion in openai#239 (specifically 8f5dd7d) Like that commit, this: - Does the init in a single loop and saves a loop if there are no merges - Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values) Unlike that commit: - We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version. - I preserve some of the comments from the original that I think are still useful Co-authored-by: @paplorinc --------- Co-authored-by: Lőrinc Pap <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ai for me