Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to tiktoken-based tokenizer #3

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

balgillo
Copy link

@balgillo balgillo commented Nov 6, 2023

Thanks for creating this fork of whisper!

The latest code is failing for me as follows:

pip install git+https://github.com/zhuzilin/whisper-openvino.git
whisper --language en --model tiny test_data/at_the_time.wav 

Traceback (most recent call last):
  File "/home/azureuser/whisper-openvino-venv/bin/whisper", line 8, in <module>
    sys.exit(cli())
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 286, in cli
    result = transcribe(model, audio_path, temperature=temperature, **args)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 171, in transcribe
    result = decode_with_fallback(segment)[0]
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 99, in decode_with_fallback
    results = model.decode(segment, options)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 695, in decode
    result = DecodingTask(model, options).run(mel)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 463, in __init__
    self.sot_index: int = self.initial_tokens.index(tokenizer.sot)
ValueError: tuple.index(x): x not in tuple

The get_tokenizer function and _get_single_token_id("<|startoftranscript|>") in Tokenizer disagree on the value of the sot token: 50258 in the former, 50335 in the latter.

I've been able to fix this by bringing in the latest tokenizer.py from upstream, along with the associated tiktoken dependency and token files. This PR contains those changes. It's not a full catch-up merge with upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant