Update whisper decoding algorithm #1355

kyakuno · 2023-12-27T06:16:13Z

ailia-modelsのwhisperをエクスポートしたのが2022年10月で、そこから最近のwhisperはコード改良で繰り返しが発生しにくくなっている。そこで、最新のwhisperの変更をportする。

コミットログ：https://github.com/openai/whisper/commits/main/
デコーダ：https://github.com/openai/whisper/commits/main/whisper/decoding.py
メイン：https://github.com/openai/whisper/commits/main/whisper/transcribe.py

kyakuno · 2023-12-27T06:17:05Z

評価対象。

ailia

python3 whisper.py -m base -i input.wav

whisper official

beam_size = 1

import numpy as np
import librosa
import soundfile as sf
import time

import whisper
whisper_small = whisper.load_model("base")

start = int(round(time.time() * 1000))
result = whisper_small.transcribe("sample.mp3", language="ja", beam_size = beam_size, verbose=True)
print("Whisper Small", result["text"])
end = int(round(time.time() * 1000))
estimation_time = (end - start)
print(f'\ttotal processing time {estimation_time} ms')

kyakuno · 2023-12-27T06:20:20Z

まずは公式のwhisperの各バージョンの変更と性能を把握する。

kyakuno · 2023-12-28T01:49:38Z

2023/03/08 -> 2023/03/14で劇的にbaseの性能が上がっている。

kyakuno · 2023-12-28T01:53:46Z

更新内容。

openai/whisper#1044
TokenizerをHuggingFaceからtiktokenに移行することでtensorflowへの依存を削除

openai/whisper#1087
word timestampの改善

openai/whisper#1076
githubのlanguage statsの表示改善

openai/whisper#1089
不正なunicodeが出現することへの対策

openai/whisper#1090
空入力でエラーが起きる問題を修正

kyakuno · 2023-12-28T02:08:40Z

性能差はtiktokenへの移行によって発生している。
tiktokenとtransformersは等価な論理かと思ったが、何か別の変更も入っている？
openai/whisper#1044

kyakuno · 2023-12-29T07:43:47Z

2023/03/08だと2セグメント目でtempratureが上がっていく。
原因は、古いtokenizerがsot_prev APIが50361を返さず、50324を返す。
2023/03/14のtiktoken版だと、正しく50361を返す。
ここを50361を返すように修正すると、2023/03/08でも変な出力が出ない。
この問題は下記でも報告されている。
zhuzilin/whisper-openvino#3

kyakuno · 2023-12-29T12:29:34Z

kv_cacheをnormalとdynamicで比較すると結果は一致する。

python3 whisper.py -m base -i sample.mp3 --normal
python3 whisper.py -m base -i sample.mp3 --dynamic_kv_cache

normalとoptは結果は一致しない。

python3 whisper.py -m base -i sample.mp3 --normal
python3 whisper.py -m base -i sample.mp3

これは、MeanVarianceNormalizationがepsilonを持てないため、OptimizerでMeanVarianceNormalizationに変換した結果、torchとonnxで誤差が出ているためである。

kyakuno · 2023-12-29T12:46:30Z

本来、epsilonの影響は小さいが、baseはそもそもモデルが小さくて不安定なので、揺らぎが発生しやすい。

kyakuno · 2023-12-29T12:47:46Z

順当に最新のtimestampの扱いをマージする。

kyakuno · 2023-12-29T12:52:22Z

v3対応の際に、transcribe.pyの変更は取り込まれている。
#1313 (comment)
decoding.pyの変更はまだ取り込まれていない。

kyakuno · 2024-01-05T05:50:46Z

MeanVarianceNormalizationのepsilonが問題のようなので、LayerNormalizationになるようにwhisperをopset=17で再エクスポートした。opset=17であれば、epsilonの誤差は発生しない。
axinc-ai/whisper-export#2

kyakuno · 2024-01-09T00:59:21Z

whisperでは、デコードしたテキストをgzipで圧縮し、圧縮率によって繰り返し判定し、decode_fallbackを発生させ、繰り返しを抑制している。decode_fallbackでは、tempretureが上がり、サンプリングになるので、結果が推論のたびに変化する。

ただし、whisper-exportのリポジトリでbaseだと、torchでもdecode_fallbackが発生し、繰り返しが発生するが、最新のwhisper-officialだとdecode_fallbackが発生しない。

これは、timestampルールが影響しているようで、timestampルールを最新にすると、whisper-exportでもdecode_fallbackが発生せず、tempretureは上がらない。

kyakuno · 2024-01-09T01:29:52Z

下記を無効にすると、decode_fallbackが発生するので、timestampの、まきもどり検知は重要そう。

            if timestamps.numel() > 0:
                # timestamps shouldn't decrease; forbid timestamp tokens smaller than the last
                # also force each segment to have a nonzero length, to prevent infinite looping
                if last_was_timestamp and not penultimate_was_timestamp:
                    timestamp_last = timestamps[-1]
                else:
                    timestamp_last = timestamps[-1] + 1
                logits[k, self.tokenizer.timestamp_begin : timestamp_last] = -np.inf

openai/whisper#914

kyakuno · 2024-01-09T06:22:15Z

baseモデルはかなりセンシティブなようで、flg_ffmpegの有効・無効で結果が大きく異なる。
また、mp3を一度、wavに事前変換してもfallbackが発生する。

kyakuno self-assigned this Dec 27, 2023

kyakuno mentioned this issue Dec 29, 2023

Added latest version of timestamp rule to whisper, Added layer normalization models #1356

Merged

kyakuno closed this as completed Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update whisper decoding algorithm #1355

Update whisper decoding algorithm #1355

kyakuno commented Dec 27, 2023 •

edited

Loading

kyakuno commented Dec 27, 2023

kyakuno commented Dec 27, 2023

kyakuno commented Dec 28, 2023

kyakuno commented Dec 28, 2023 •

edited

Loading

kyakuno commented Dec 28, 2023

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023

kyakuno commented Dec 29, 2023

kyakuno commented Jan 5, 2024 •

edited

Loading

kyakuno commented Jan 9, 2024 •

edited

Loading

kyakuno commented Jan 9, 2024 •

edited

Loading

kyakuno commented Jan 9, 2024

Update whisper decoding algorithm #1355

Update whisper decoding algorithm #1355

Comments

kyakuno commented Dec 27, 2023 • edited Loading

kyakuno commented Dec 27, 2023

kyakuno commented Dec 27, 2023

kyakuno commented Dec 28, 2023

kyakuno commented Dec 28, 2023 • edited Loading

kyakuno commented Dec 28, 2023

kyakuno commented Dec 29, 2023 • edited Loading

kyakuno commented Dec 29, 2023 • edited Loading

kyakuno commented Dec 29, 2023 • edited Loading

kyakuno commented Dec 29, 2023

kyakuno commented Dec 29, 2023

kyakuno commented Jan 5, 2024 • edited Loading

kyakuno commented Jan 9, 2024 • edited Loading

kyakuno commented Jan 9, 2024 • edited Loading

kyakuno commented Jan 9, 2024

kyakuno commented Dec 27, 2023 •

edited

Loading

kyakuno commented Dec 28, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Dec 29, 2023 •

edited

Loading

kyakuno commented Jan 5, 2024 •

edited

Loading

kyakuno commented Jan 9, 2024 •

edited

Loading

kyakuno commented Jan 9, 2024 •

edited

Loading