Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: openai/tiktoken
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 0.5.1
Choose a base ref
...
head repository: openai/tiktoken
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: 0.7.0
Choose a head ref
  • 16 commits
  • 10 files changed
  • 8 contributors

Commits on Dec 3, 2023

  1. Sync codebase

    hauntsaninja committed Dec 3, 2023
    Configuration menu
    Copy the full SHA
    6267f91 View commit details
    Browse the repository at this point in the history
  2. Sync codebase

    hauntsaninja committed Dec 3, 2023
    1 Configuration menu
    Copy the full SHA
    9e79899 View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2024

  1. Add support for checking hash of downloaded files before use. (#230)

    We are using tiktoken in various production scenarios and sometimes have
    the problem that the download of `.tiktoken` files (e.g.,
    `cl100k_base.tiktoken`) will get interrupted or fail, causing the cached
    file to be corrupted in some way. In those cases, the results returned
    from the encoder will be incorrect and could be damaging to our
    production instances.
    
    More often, when this happens, `Encoder.encode()` will throw an
    exception such as
    ```
    pyo3_runtime.PanicException: no entry found for key
    ```
    which turns out to be quite hard to track down.
    
    In an effort to make tiktoken more robust for production use, this PR
    adds the `sha256` hash of each of the downloaded files to
    `openai_public.py` and augments `read_file` to check for the hash, if
    provided, when the file is accessed from the cache or downloaded
    directly. This causes errors to be flagged at file load time, rather
    than when the files are used, and provides a more meaningful error
    message indicating what might have gone wrong.
    
    This also protects users of tiktoken from scenarios where a network
    issue or MITM attack could have corrupted these files in transit.
    mdwelsh committed Jan 30, 2024
    1 Configuration menu
    Copy the full SHA
    3ee6c35 View commit details
    Browse the repository at this point in the history
  2. 1 Configuration menu
    Copy the full SHA
    db5bda9 View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2024

  1. Optimize regular expressions used for splitting by ~20% (#234)

    By combining the contractions to a single non-capturing group prefixed
    by `'`, we can speed up matches by roughly 20%.
    
    By using possessive quantifiers for the `cl100k_base` in the word and
    punctuation groups we're avoiding some backtracking.
    
    The last whitespace groups can also be simplified to have a single
    newline matched explicitly, since the previous whitespace would already
    match it.
    
    Overall the regex matches the exact same sequence of characters as
    before for any case and for unicode sequences.
    
    Co-authored-by: Lőrinc <[email protected]>
    l0rinc and Lőrinc committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    6cc3a46 View commit details
    Browse the repository at this point in the history
  2. added two new embedding model's encoding (#247)

    Library doesn't support two new embedding model's encoding mapper
    - `text-embedding-3-small`
    - `text-embedding-3-large`
    
    Added Encoding mapper for 2 new embedding models. The source of mapping
    is taken from https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb
    Praneet460 committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    55c8d83 View commit details
    Browse the repository at this point in the history
  3. Sync codebase

    hauntsaninja committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    89153d7 View commit details
    Browse the repository at this point in the history
  4. Update cibuildwheel

    hauntsaninja committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    01df436 View commit details
    Browse the repository at this point in the history
  5. 1 Configuration menu
    Copy the full SHA
    84d88dc View commit details
    Browse the repository at this point in the history
  6. Store tokens in u32 instead of usize

    And hide it behind a Rank type to make it easier to separate it from other numeric values
    Lőrinc authored and hauntsaninja committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    c2960c1 View commit details
    Browse the repository at this point in the history
  7. 1 Configuration menu
    Copy the full SHA
    6e4851a View commit details
    Browse the repository at this point in the history
  8. Avoid calling byte_pair_encode for existing tokens

    This was byte_pair_encode can be optimized further, assuming we'll always have at least 2 tokens
    Lőrinc authored and hauntsaninja committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    b4c687e View commit details
    Browse the repository at this point in the history
  9. Inline custom mapping function in _byte_pair_merge

    Lőrinc authored and hauntsaninja committed Feb 9, 2024
    1 Configuration menu
    Copy the full SHA
    6defed5 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2024

  1. Simplify byte_pair_merge (#255)

    Based on suggestion in #239
    (specifically 8f5dd7d)
    
    Like that commit, this:
    - Does the init in a single loop and saves a loop if there are no merges
    - Simplifies get_rank and no longer uses it in init (so you don't need
    multiple skip values)
    
    Unlike that commit:
    - We drop optimisations enabled by ignoring single tokens. These didn't
    show any benefit on benchmarks for me (this makes sense given typical
    piece sizes, but let me know if that's unexpected!). Given this, I opted
    for the simpler version.
    - I preserve some of the comments from the original that I think are
    still useful
    
    Co-authored-by: @paplorinc
    
    ---------
    
    Co-authored-by: Lőrinc Pap <[email protected]>
    hauntsaninja and l0rinc committed Feb 11, 2024
    1 Configuration menu
    Copy the full SHA
    1b9faf2 View commit details
    Browse the repository at this point in the history

Commits on May 13, 2024

  1. Sync codebase

    hauntsaninja committed May 13, 2024
    1 Configuration menu
    Copy the full SHA
    9d01e56 View commit details
    Browse the repository at this point in the history
  2. Bump cibuildwheel

    hauntsaninja committed May 13, 2024
    1 Configuration menu
    Copy the full SHA
    bfe00ad View commit details
    Browse the repository at this point in the history
Loading