Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ai #315

Closed
wants to merge 35 commits into from
Closed

ai #315

wants to merge 35 commits into from

Conversation

MITCHELLNEAL1
Copy link

ai for me

hauntsaninja and others added 30 commits March 2, 2023 11:54
…#230)

We are using tiktoken in various production scenarios and sometimes have
the problem that the download of `.tiktoken` files (e.g.,
`cl100k_base.tiktoken`) will get interrupted or fail, causing the cached
file to be corrupted in some way. In those cases, the results returned
from the encoder will be incorrect and could be damaging to our
production instances.

More often, when this happens, `Encoder.encode()` will throw an
exception such as
```
pyo3_runtime.PanicException: no entry found for key
```
which turns out to be quite hard to track down.

In an effort to make tiktoken more robust for production use, this PR
adds the `sha256` hash of each of the downloaded files to
`openai_public.py` and augments `read_file` to check for the hash, if
provided, when the file is accessed from the cache or downloaded
directly. This causes errors to be flagged at file load time, rather
than when the files are used, and provides a more meaningful error
message indicating what might have gone wrong.

This also protects users of tiktoken from scenarios where a network
issue or MITM attack could have corrupted these files in transit.
By combining the contractions to a single non-capturing group prefixed
by `'`, we can speed up matches by roughly 20%.

By using possessive quantifiers for the `cl100k_base` in the word and
punctuation groups we're avoiding some backtracking.

The last whitespace groups can also be simplified to have a single
newline matched explicitly, since the previous whitespace would already
match it.

Overall the regex matches the exact same sequence of characters as
before for any case and for unicode sequences.

Co-authored-by: Lőrinc <[email protected]>
Library doesn't support two new embedding model's encoding mapper
- `text-embedding-3-small`
- `text-embedding-3-large`

Added Encoding mapper for 2 new embedding models. The source of mapping
is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
And hide it behind a Rank type to make it easier to separate it from other numeric values
This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens
Lőrinc and others added 5 commits February 9, 2024 13:10
Based on suggestion in openai#239
(specifically 8f5dd7d)

Like that commit, this:
- Does the init in a single loop and saves a loop if there are no merges
- Simplifies get_rank and no longer uses it in init (so you don't need
multiple skip values)

Unlike that commit:
- We drop optimisations enabled by ignoring single tokens. These didn't
show any benefit on benchmarks for me (this makes sense given typical
piece sizes, but let me know if that's unexpected!). Given this, I opted
for the simpler version.
- I preserve some of the comments from the original that I think are
still useful

Co-authored-by: @paplorinc

---------

Co-authored-by: Lőrinc Pap <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet