-
Notifications
You must be signed in to change notification settings - Fork 798
Permalink
Choose a base ref
{{ refName }}
default
Choose a head ref
{{ refName }}
default
Comparing changes
Choose two branches to see what’s changed or to start a new pull request.
If you need to, you can also or
learn more about diff comparisons.
Open a pull request
Create a new pull request by comparing changes across two branches. If you need to, you can also .
Learn more about diff comparisons here.
base repository: openai/tiktoken
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.5.1
Could not load branches
Nothing to show
Loading
Could not load tags
Nothing to show
{{ refName }}
default
Loading
...
head repository: openai/tiktoken
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.7.0
Could not load branches
Nothing to show
Loading
Could not load tags
Nothing to show
{{ refName }}
default
Loading
- 16 commits
- 10 files changed
- 8 contributors
Commits on Dec 3, 2023
-
Configuration menu - View commit details
-
Copy full SHA for 6267f91 - Browse repository at this point
Copy the full SHA 6267f91View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 9e79899 - Browse repository at this point
Copy the full SHA 9e79899View commit details
Commits on Jan 30, 2024
-
Add support for checking hash of downloaded files before use. (#230)
We are using tiktoken in various production scenarios and sometimes have the problem that the download of `.tiktoken` files (e.g., `cl100k_base.tiktoken`) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances. More often, when this happens, `Encoder.encode()` will throw an exception such as ``` pyo3_runtime.PanicException: no entry found for key ``` which turns out to be quite hard to track down. In an effort to make tiktoken more robust for production use, this PR adds the `sha256` hash of each of the downloaded files to `openai_public.py` and augments `read_file` to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong. This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.
1Configuration menu - View commit details
-
Copy full SHA for 3ee6c35 - Browse repository at this point
Copy the full SHA 3ee6c35View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for db5bda9 - Browse repository at this point
Copy the full SHA db5bda9View commit details
Commits on Feb 9, 2024
-
Optimize regular expressions used for splitting by ~20% (#234)
By combining the contractions to a single non-capturing group prefixed by `'`, we can speed up matches by roughly 20%. By using possessive quantifiers for the `cl100k_base` in the word and punctuation groups we're avoiding some backtracking. The last whitespace groups can also be simplified to have a single newline matched explicitly, since the previous whitespace would already match it. Overall the regex matches the exact same sequence of characters as before for any case and for unicode sequences. Co-authored-by: Lőrinc <[email protected]>
1Configuration menu - View commit details
-
Copy full SHA for 6cc3a46 - Browse repository at this point
Copy the full SHA 6cc3a46View commit details -
added two new embedding model's encoding (#247)
Library doesn't support two new embedding model's encoding mapper - `text-embedding-3-small` - `text-embedding-3-large` Added Encoding mapper for 2 new embedding models. The source of mapping is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
1Configuration menu - View commit details
-
Copy full SHA for 55c8d83 - Browse repository at this point
Copy the full SHA 55c8d83View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 89153d7 - Browse repository at this point
Copy the full SHA 89153d7View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 01df436 - Browse repository at this point
Copy the full SHA 01df436View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 84d88dc - Browse repository at this point
Copy the full SHA 84d88dcView commit details -
Store tokens in u32 instead of usize
And hide it behind a Rank type to make it easier to separate it from other numeric values
1Configuration menu - View commit details
-
Copy full SHA for c2960c1 - Browse repository at this point
Copy the full SHA c2960c1View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 6e4851a - Browse repository at this point
Copy the full SHA 6e4851aView commit details -
Avoid calling byte_pair_encode for existing tokens
This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens
1Configuration menu - View commit details
-
Copy full SHA for b4c687e - Browse repository at this point
Copy the full SHA b4c687eView commit details -
1
Configuration menu - View commit details
-
Copy full SHA for 6defed5 - Browse repository at this point
Copy the full SHA 6defed5View commit details
Commits on Feb 11, 2024
-
Simplify byte_pair_merge (#255)
Based on suggestion in #239 (specifically 8f5dd7d) Like that commit, this: - Does the init in a single loop and saves a loop if there are no merges - Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values) Unlike that commit: - We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version. - I preserve some of the comments from the original that I think are still useful Co-authored-by: @paplorinc --------- Co-authored-by: Lőrinc Pap <[email protected]>
1Configuration menu - View commit details
-
Copy full SHA for 1b9faf2 - Browse repository at this point
Copy the full SHA 1b9faf2View commit details
Commits on May 13, 2024
-
1
Configuration menu - View commit details
-
Copy full SHA for 9d01e56 - Browse repository at this point
Copy the full SHA 9d01e56View commit details -
1
Configuration menu - View commit details
-
Copy full SHA for bfe00ad - Browse repository at this point
Copy the full SHA bfe00adView commit details
Loading
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff 0.5.1...0.7.0