Switch to tiktoken-based tokenizer #3

balgillo · 2023-11-06T10:15:04Z

Thanks for creating this fork of whisper!

The latest code is failing for me as follows:

pip install git+https://github.com/zhuzilin/whisper-openvino.git
whisper --language en --model tiny test_data/at_the_time.wav 

Traceback (most recent call last):
  File "/home/azureuser/whisper-openvino-venv/bin/whisper", line 8, in <module>
    sys.exit(cli())
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 286, in cli
    result = transcribe(model, audio_path, temperature=temperature, **args)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 171, in transcribe
    result = decode_with_fallback(segment)[0]
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/transcribe.py", line 99, in decode_with_fallback
    results = model.decode(segment, options)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 695, in decode
    result = DecodingTask(model, options).run(mel)
  File "/home/azureuser/whisper-openvino-venv/lib/python3.8/site-packages/whisper/decoding.py", line 463, in __init__
    self.sot_index: int = self.initial_tokens.index(tokenizer.sot)
ValueError: tuple.index(x): x not in tuple

The get_tokenizer function and _get_single_token_id("<|startoftranscript|>") in Tokenizer disagree on the value of the sot token: 50258 in the former, 50335 in the latter.

I've been able to fix this by bringing in the latest tokenizer.py from upstream, along with the associated tiktoken dependency and token files. This PR contains those changes. It's not a full catch-up merge with upstream.

Switch to tiktoken-based tokenizer

5f18946

kyakuno mentioned this pull request Jan 9, 2024

Update whisper decoding algorithm axinc-ai/ailia-models#1355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to tiktoken-based tokenizer #3

Switch to tiktoken-based tokenizer #3

balgillo commented Nov 6, 2023

Switch to tiktoken-based tokenizer #3

Are you sure you want to change the base?

Switch to tiktoken-based tokenizer #3

Conversation

balgillo commented Nov 6, 2023