ai #315

MITCHELLNEAL1 · 2024-06-21T14:26:19Z

ai for me

Co-authored-by: messense <[email protected]>

…#230) We are using tiktoken in various production scenarios and sometimes have the problem that the download of `.tiktoken` files (e.g., `cl100k_base.tiktoken`) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances. More often, when this happens, `Encoder.encode()` will throw an exception such as ``` pyo3_runtime.PanicException: no entry found for key ``` which turns out to be quite hard to track down. In an effort to make tiktoken more robust for production use, this PR adds the `sha256` hash of each of the downloaded files to `openai_public.py` and augments `read_file` to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong. This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.

By combining the contractions to a single non-capturing group prefixed by `'`, we can speed up matches by roughly 20%. By using possessive quantifiers for the `cl100k_base` in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences. Co-authored-by: Lőrinc <[email protected]>

Library doesn't support two new embedding model's encoding mapper - `text-embedding-3-small` - `text-embedding-3-large` Added Encoding mapper for 2 new embedding models. The source of mapping is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

And hide it behind a Rank type to make it easier to separate it from other numeric values

This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens

@paplorinc

Based on suggestion in openai#239 (specifically 8f5dd7d) Like that commit, this: - Does the init in a single loop and saves a loop if there are no merges - Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values) Unlike that commit: - We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version. - I preserve some of the comments from the original that I think are still useful Co-authored-by: @paplorinc --------- Co-authored-by: Lőrinc Pap <[email protected]>

hauntsaninja and others added 30 commits March 2, 2023 11:54

Bump version, sync codebase

ec7c121

Build aarch64 wheels under emulation (openai#54)

b2e85f1

Co-authored-by: messense <[email protected]>

Bump version, sync codebase

3e86200

Bump version, sync codebase

446cb49

Bump version in pyproject.toml

529de22

Update the perf image path in readme (openai#68)

82facf9

Closes openai#67

Bump version, sync codebase

e1c661e

Fix typo in error message (openai#93)

46287bf

Sync codebase

f19feec

Bump version, sync codebase

095924e

Sync codebase

affbd6e

Sync codebase

c373d9b

Fix up cibuildwheel

95b03bf

Use future annotations in tiktoken._educational (openai#143)

5d970c1

Sync codebase

a793779

Run only the most minimal test in emulated builds

cc1848c

Replace <|endoftext|> with constant (openai#186)

52fceb8

Sync codebase

39f29ce

Sync codebase

6267f91

Sync codebase

9e79899

Clarify language models in README (openai#203)

db5bda9

Sync codebase

89153d7

Update cibuildwheel

01df436

Allow use of gpt-2 and gpt-3.5 in encoding_for_model (openai#185)

84d88dc

Store tokens in u32 instead of usize

c2960c1

And hide it behind a Rank type to make it easier to separate it from other numeric values

Add finer grained gratitude

6e4851a

Avoid calling byte_pair_encode for existing tokens

b4c687e

This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens

Lőrinc and others added 5 commits February 9, 2024 13:10

Inline custom mapping function in _byte_pair_merge

6defed5

Sync codebase

9d01e56

Bump cibuildwheel

bfe00ad

update README to mention gpt-4o

c0ba74c

hauntsaninja closed this Jun 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai #315

ai #315

MITCHELLNEAL1 commented Jun 21, 2024

ai #315

ai #315

Conversation

MITCHELLNEAL1 commented Jun 21, 2024