Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tiktoken #1044

Merged
merged 9 commits into from
Mar 13, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
bypassing load_tiktoken_bpe to avoid blobfile dep
  • Loading branch information
jongwook committed Mar 13, 2023
commit a0bd014f13593101b95ea30ca7155356267f519b
7 changes: 5 additions & 2 deletions whisper/tokenizer.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import base64
import os
import string
from dataclasses import dataclass, field
from functools import cached_property, lru_cache
from typing import Dict, List, Optional, Tuple

import tiktoken
from tiktoken.load import load_tiktoken_bpe
from tiktoken_ext.openai_public import gpt2

LANGUAGES = {
Expand Down Expand Up @@ -315,7 +315,10 @@ def split_tokens_on_spaces(self, tokens: List[int]):
@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2"):
vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
ranks = load_tiktoken_bpe(vocab_path)
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in open(vocab_path) if line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you opened the file and left a unclosed handler.

}
n_vocab = len(ranks)
special_tokens = {}

Expand Down