DE19710545C1

DE19710545C1 - Time scale modification method for speech signals

Info

Publication number: DE19710545C1
Application number: DE19710545A
Authority: DE
Inventors: Holger Dr Carl
Original assignee: Grundig AG
Current assignee: Grundig Multimedia BV
Priority date: 1997-03-14
Filing date: 1997-03-14
Publication date: 1997-12-04
Anticipated expiration: 2017-03-15
Also published as: ATE255763T1; EP0865026A2; DE59810302D1; EP0865026A3; EP0865026B1

Abstract

The modification method has an analogue speech signal converted into a corresponding digital signal. The digital signal is entered in a memory, with lengthening or shortening of the signal by a pre-defined factor, using an add overlap method. The stored speech signal is divided into segments which are weighted via a window function with a first rising section, a constant section and a falling section. There is a comparison made of the weighted segments, for waveform similarity and addition of the segments, when the similarity has a maximum value.

Description

Gegenstand der Erfindung ist ein Verfahren zur Geschwindigkeitsmodifikation von Sprachsignalen im Zeitbereich, insbesondere eine effiziente Overlap-Add-Methode.The invention relates to a method for speed modification of Speech signals in the time domain, especially an efficient overlap add method.

In verschiedenen Bereichen der Verarbeitung von Sprach- und Audiosignalen ist eine Veränderung der Wiedergabegeschwindigkeit dieser Signale erwünscht, möglichst ohne daß damit eine Beeinträchtigung ihrer Natürlichkeit und - im Fall von Sprache - ihrer Verständlichkeit verbunden wäre. Dieses Ziel, den Klangcharakter zu erhalten, kann man aus technischer Sicht folgendermaßen formulieren: Trotz einer Modifikation der Zeitskala dieser Signale sollen ihre Kurzzeitspektraleigenschaften unverändert bleiben. Insbesondere bedeutet das für Sprachsignale, daß Grundfrequenz und Formanten bei der Geschwindigkeitsmodifikation erhalten bleiben müssen.In various areas of processing voice and audio signals a change in the playback speed of these signals is desired, if possible without impairing their naturalness and - in the case of Language - would be connected to its intelligibility. This goal, the sound character too From a technical point of view, one can formulate the following: Despite one Modification of the time scale of these signals is said to have their short-term spectral properties remain unchanged. In particular, for speech signals, this means that Main frequency and formants obtained in speed modification have to stay.

Die Zeitstauchung oder Zeitdehnung von Audiosignalen wird in Studios eingesetzt, zum Beispiel mit dem Ziel, Werbesendungen auf die vorgesehene Länge zu trimmen. Auch in der Diktiertechnik ist die Anpassung der Wiedergabegeschwindigkeit an die Bedürfnisse bzw. Fähigkeiten der Schreibkraft von Bedeutung. Eine weitere Anwendung besteht bei der Echtzeitübertragung von Sprachsignalen, bei der Datenpakete mit variabler Verzögerung beim Empfänger eintreffen. Durch Anwendung der Geschwindigkeitsmodifikation kann man hier die Über-Alles-Verzögerung im Mittel geringer halten als das Worst-Case Delay der Übertragungsstrecke, ohne daß ein zu spät eintreffendes Datenpaket zu Aussetzern oder anderen, ähnlich störenden Effekten führen würde.The time compression or time expansion of audio signals is used in studios, for example with the goal of advertising mail to the intended length trim. Also in dictation technology is the adaptation of the Playback speed to the needs or skills of the typist significant. Another application is in the real-time transmission of Voice signals in which data packets with variable delay at the receiver arrive. By applying the speed modification you can here Keep the overall delay less than the worst case delay Transmission route without a data packet arriving too late causing dropouts or other, similarly disruptive effects.

Für viele Anwendungen ergeben sich neben dem Wunsch nach möglichst hoher Klangqualität die folgenden zusätzlichen Anforderungen an das Verfahren: For many applications, in addition to the desire for the highest possible Sound quality the following additional process requirements:

Eine kostengünstige Echtzeitrealisierung muß erzielbar sein, und es muß zur Laufzeit eine nach Möglichkeit stufenlose Änderung des Geschwindigkeitsmodifikationsfaktors möglich sein. Von Vorteil ist ohne Zweifel auch, wenn der Algorithmus ohne eine stets fehlerbehaftete Pitch-Schätzung auskommt.Inexpensive real-time implementation must be achievable, and it must be Term an infinitely variable change of the Speed modification factor may be possible. It is an advantage without a doubt even if the algorithm lacks a pitch estimate that is always faulty gets along.

Aus "Method for Time or Frequency Compression-Expansion of Speed", von G. Fairbaks und R. P. Jaeger, Inst. of Radio Engineers Trans. on Audio, Vol. AU-2, No. 1, pp. 7-12, Jan. 1954, sind erste Untersuchungen zur Sprachsignalstauchung bzw. Sprachsignaldehnung bekannt. Häufig wurden seitdem Frequenzbereichsverfahren eingesetzt - naheliegend, da, wie eingangs erwähnt, die Kurzzeitspektraleigenschaften des Sprachsignals erhalten bleiben sollen. Seit Mitte der achtziger Jahre sind vergleichsweise einfache im Zeitbereich arbeitende Overlap-Add-Verfahren bekannt, mit denen sehr gut klingende zeitskalierte Sprachsignale erzeugt werden können.From "Method for Time or Frequency Compression-Expansion of Speed", by G. Fairbaks and R.P. Jaeger, Inst. Of Radio Engineers Trans. On Audio, Vol. AU-2, No. 1, pp. 7-12, Jan. 1954, are the first examinations of speech signal compression or Voice signal stretch known. Frequency domain procedures have been common since then used - obvious, since, as mentioned at the beginning, the Short-term spectral properties of the speech signal should be preserved. Since the middle The eighties are comparatively simple working in the time domain Overlap-add method known, with which very good sounding time-scaled Speech signals can be generated.

In "Signal Estimation from Modified Short-Time Fourier Transform", von D. W. Griffin, in IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-32, No. 2, pp. 236-242, Apr. 1984, berichten Griffin und Lim von Experimenten mit einer sehr aufwendigen iterativ arbeitenden Phasenbestimmung. Auf diesen Ansatz nimmt wiederum die Veröffentlichung von S. Roucos und A. M. Wilgus "High Quality Time- Scale Modification for Speech", IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 493-496, 1985, Bezug, die eine Zeitbereichsmethode vorgeschlagen, die mittels eines Overlap-Add-Ansatzes zeitskalierte Sprechsignale erzeugt. Bei diesem sogenannten SOLA-Verfahren (SOLA = Synchronized OverLap-Add) erfolgt eine Synchronisation der in regelmäßigen Abständen dem Originalsignal entnommenen Abschnitte durch Verschiebung vor der jeweils entsprechenden Fensterung und Addition im Zielsignal. Dies entspricht im weiteren Sinne der Phasenoptimierung, wie sie in den Frequenzbereichsverfahren durchgeführt wird. Eng mit dem SOLA-Algorithmus verwandt ist das sogenannte WSOLA-Verfahren (WSOLA = Waveform Similarity Overlap-Add), das W. Verhelst und M. Roelands in "An Overlapp-Add Technique Based on Waveform Similarity (WSOLa) for High Quality Time-Scale Modification of Speed", IEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, und "Wafeform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation", Int. Conf. on Speech Communication and Technology, pp. 337-340, 1993, vorstellen. Der Hauptunterschied zwischen diesen beiden Ansätzen besteht in der Synchronisation, die im WSOLA-Verfahren durch versetztes Entnehmen von Segmenten aus dem Originalsignal durchgeführt wird, was sich gegenüber dem SOLA-Prinzip vor allem aufwandsmindernd auswirkt.In "Signal Estimation from Modified Short-Time Fourier Transform", by D. W. Griffin, in IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-32, No. 2, pp. 236-242, Apr. 1984, Griffin and Lim report experiments with a very elaborate iterative phase determination. Take this approach again the publication of S. Roucos and A. M. Wilgus "High Quality Time- Scale Modification for Speech ", IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 493-496, 1985, reference that suggested a time domain method which generates time-scaled speech signals using an overlap add approach. At this so-called SOLA process (SOLA = Synchronized OverLap-Add) takes place a synchronization of the original signal at regular intervals removed sections by shifting in front of the corresponding one Windowing and addition in the target signal. In a broader sense, this corresponds to Phase optimization as it is carried out in the frequency domain method. The so-called WSOLA method is closely related to the SOLA algorithm (WSOLA = Waveform Similarity Overlap-Add) by W. Verhelst and M. Roelands in "An Overlapp-Add Technique Based on Waveform Similarity (WSOLa) for High Quality Time-Scale Modification of Speed ", IEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, and "Wafeform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation ", Int. Conf. on Speech Communication and Technology, pp. 337-340, 1993. The main difference between these two approaches is that Synchronization in the WSOLA process by staggered removal of Segments from the original signal is performed, which is different from that The SOLA principle primarily has a cost-reducing effect.

Aufgabe der Erfindung ist es, ein Verfahren zur Geschwindigkeitsmodifikation von Sprachsignalen im Zeitbereich anzugeben, das besonders effizient arbeitet und gegenüber dem St.d.T. weniger Aufwand erfordert.The object of the invention is to provide a method for speed modification of To specify voice signals in the time domain that work particularly efficiently and against that St.d.T. requires less effort.

Diese Aufgabe wird durch die Merkmale der Ansprüche 1 und 2 gelöst. Vorteilhafte Ausgestaltungen der Erfindung sind in der nachfolgenden Beschreibung angegeben.This object is solved by the features of claims 1 and 2. Beneficial Embodiments of the invention are in the following description given.

Die Erzeugung der mit dem Faktor α zeitskalierten Version y(k) eines Sprachsignals x(k) erfolgt gemäß der SyntheseThe generation of the version y (k) of a speech signal time-scaled by the factor α x (k) follows the synthesis

mit einer Fensterfunktionwith a window function

Die hierin vorkommende für k = 0, . . ., N-1 definierte Funktion v(k) ist dabei sinnvollerweise zwischen ihren Extrema v(0)=ε₀ mit 0<ε₀»1 und v(N-1)=1-ε₁ mit 0<ε₁»1 monoton wachsend.The one for k = 0,. . ., N-1 defined function v (k) is included sensibly between their extremes v (0) = ε₀ with 0 <ε₀ »1 and v (N-1) = 1-ε₁ with 0 <ε₁ »1 growing monotonously.

Die angegebene w(k)-Definition stellt sicher, daß die für sinnvolles Overlap-Add notwendige BedingungThe specified w (k) definition ensures that the useful overlap add necessary condition

erfüllt ist.is satisfied.

Die in obiger Synthesegleichung enthaltene Verschiebevariable Δ_λ ist zwecks der erwähnten Synchronisation aus einem "Toleranzbereich" -Δ_max, . . ., Δ_max zu bestimmen. The shift variable Δ _λ contained in the above synthesis equation is from a "tolerance range" -Δ _max ,. . . To determine Δ _max .

Die prinzipielle Vorgehensweise ist wie folgt: Aus dem Originalsignal x(k) werden in - abgesehen von einem synchronisationsbedingten "Jitter" - regelmäßigen αL Werte betragenden Abständen Segmente der Länge L+N entnommen und nach Gewichtung mit w(k) jeweils um L Abtastwerte versetzt aufaddiert. Das auf diese Weise erhaltene Signal y(k) ist gegenüber x(k) um den Faktor α beschleunigt, das heißt, daß eine im Originalsignal x(k) enthaltene Äußerung von K Abtastwerten Länge durch dieses Vorgehen auf einen y(k)-Abschnitt der Länge K/α abgebildet, also verkürzt und damit in der Wiedergabe beschleunigt für α<1, bzw. verlängert, das heißt verlangsamt, wird, wenn α<1 ist.The basic procedure is as follows: The original signal x (k) becomes - apart from one Synchronization-related "jitter" - regular intervals of αL values Taken segments of length L + N and after weighting with w (k) each by L Samples added up offset. The signal obtained in this way is y (k) accelerated compared to x (k) by the factor α, that is, one in the original signal x (k) contained utterance of K samples length by this approach depicts a y (k) section of length K / α, i.e. shortened and thus in the Playback accelerates for α <1, or extends, i.e. slows down, if α <1.

Die Synchronisation der zu überlappenden Abschnitte ist für die resultierende Klangqualität von großer Bedeutung. Hierzu wird der folgende Ansatz verwendet: Während der Abarbeitung des Verfahrens kann zu jedem dem Signal x(k) entnommenen Segment für den nächsten Schritt als "Idealsegment" der um L Abtastwerte versetzte Abschnitt von x(k) angesehen werden, da durch diese Wahl die Overlap-Add-Operation wieder das Originalsignal x(k) reproduzieren würde. Die erwünschte Zeitskalierung erfordert nun aber, daß für die Overlap-Add-Synthese i. a. ein anderer, gegenüber dem "Idealsegment" versetzter Abschnitt von x(k) ausgewählt wird. Die bestmögliche Synchronisation ist gegeben, wenn der für die Overlap-Add-Operation benutzte Abschnitt größtmögliche Ähnlichkeit ("Waveform Similarity") mit dem "Idealsegment" aufweist.The synchronization of the sections to be overlapped is for the resulting one Sound quality of great importance. The following approach is used: During the execution of the method, the signal x (k) removed segment for the next step as the "ideal segment" of L Sampled section of x (k) can be viewed because of this choice the overlap add operation would reproduce the original signal x (k) again. The Desired time scaling, however, now requires that for the overlap-add synthesis i. a. another section of x (k) offset from the "ideal segment" is selected. The best possible synchronization is given when the for the Overlap add operation used section greatest possible similarity ("Waveform Similarity ") with the" ideal segment ".

Als Kriterium für die Ähnlichkeit der genannten Segmente bieten sich verschiedene Maße an. Naheliegend ist beispielsweise die Benutzung des Korrelationskoeffizienten. Während W. Verhelst und M. Roelands in "An Overlap- Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speed", in IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, und "Waveform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation" in Int. Conf. on Speech Communication and Technology, pp. 337-340, 1993, für die Auswertung des Ähnlichkeitsmaßes das kompette Segment der Länge L+N herangezogen haben, erscheint es als vollkommen ausreichend, die Berechnung auf den Bereich der N Abtastwerte zu beschränken, in dem die Segmente tatsächlich überlappen.There are various criteria for the similarity of the segments mentioned Dimensions. For example, the use of the Correlation coefficients. While W. Verhelst and M. Roelands in "An Overlap- Add Technique Based on Waveform Similarity (WSOLA) for High Quality Time-Scale Modification of Speed ", in IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, and "Waveform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation "in Int. Conf. On Speech Communication and Technology, pp. 337-340, 1993, for the evaluation of the Similarity measure used the entire segment of length L + N, it appears to be completely sufficient to calculate the range of N Limit samples in which the segments actually overlap.

Für die weiteren Darstellungen ist es hilfreich, die folgende Vektornotation einzuführen: Der N Werte lange Abschnitt des "Idealsegment", in dem die Überlappung mit dem neu zu bestimmenden Segment stattfinden wird, sei mit x bezeichnet, die ersten N Werte des verschobenen Segments mit x_q. Die Gewichtung dieses Abschnitts mit der steigenden Flanke des Fensters wird durch Multiplikation dieses Vektors mit einer Diagnoalmatrix V repräsentiert, die mit den Werten v(0), . . ., v(N-1) besetzt ist. Entsprechend wird die Gewichtung des Idealsegmentabschnitts x mit der fallenden Flanke des Fensters durch Multiplikation mit 1-V dargestellt, wobei 1 die N×N-Einheitsmatrix bezeichnet. Der im kritischen Überlappungsbereich aus der Overlap-Add-Synthese resultierende y(k)-Abschnitt lautet damitFor the further representations, it is helpful to introduce the following vector notation: The section of the "ideal segment" with N values in which the overlap with the segment to be newly determined will take place is denoted by x, the first N values of the shifted segment by x _q . The weighting of this section with the rising edge of the window is represented by multiplying this vector by a diagnostic matrix V, which has the values v (0),. . ., v (N-1) is occupied. Accordingly, the weighting of the ideal segment section x is represented by the falling edge of the window by multiplication by 1-V, where 1 denotes the N × N unit matrix. The y (k) section resulting from the overlap-add synthesis in the critical overlap region is thus

y = (1-V)x + Vx_q y = (1-V) x + Vx _q

Beispielsweise läßt sich nun als Maß für die Ähnlichkeit der hierbei beteiligten Komponenten eine Kreuzkorreliertenberechnung gemäßFor example, you can now measure the similarity of those involved Components according to a cross-correlation calculation

C_δ = x^T (1-V)^T Vx_q C _δ = x ^T (1-V) ^T Vx _q

angeben. Maximierung dieses Ausdrucks bezüglich der sich in x_q wiederfindenden Verschiebung δ ∈ {-Δ_max, . . ., Δ_max} liefert die für das betrachtete Segment im Sinne der angesetzten Ähnlichkeitsmaßes optimale Verschiebung Δ_λ.specify. Maximizing this expression with respect to the _q in x again place shift δ ∈ {-Δ _max. . ., Δ _max } provides the optimal shift Δ _λ for the segment under consideration in the sense of the similarity _measure .

Die Berechnung der C_δ erfordert alle L Abtastwerte 2N Multiplikationen für die Vorabberechnung des Ausdrucks x^T (1-V)^TV sowie anschließend (2Δ_max+1)N Multiplikationen und Additionen. The calculation of the C _δ requires all L samples 2N multiplications for the pre-calculation of the expression x ^T (1-V) ^T V and then (2Δ _max +1) N multiplications and additions.

Dies stellt gegenüber W. Verhelst und M. Roelands in "An Overlap-Add Technique Based on Waveform Similary (WSOLA) for High Quality Time-Scale Modification of Speed", in IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, und "Waveform Similaritiy Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech, Signal Processing, pp. 554-557, 1993, und "Waveform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation" in Int. Conf. on Speech Communication and Technology, pp. 337-340, 1993, eine Aufwandsreduktion um den Faktor zwei dar, der sich für L<N sogar noch erhöht. Die Beschränkung der Ähnlichkeitsberechnung auf den Bereich der Überlappung hat keinerlei negative Auswirkungen auf die Qualität der zeitskalierten Sprachproben.This contrasts with W. Verhelst and M. Roelands in "An Overlap-Add Technique Based on Waveform Similary (WSOLA) for High Quality Time-Scale Modification of Speed ", in IEEE Proc. Int. Conf. Acoust., Speech, Signal Processing, pp. 554-557, 1993, and "Waveform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech, Signal Processing, pp. 554-557, 1993, and "Waveform Similarity Based Overlap-Add (WSOLA) for Time-Scale Modification of Speech: Structures and Evaluation "in Int. Conf. On Speech Communication and Technology, pp. 337-340, 1993, an effort reduction represents the factor two, which even increases for L <N. The limitation of Similarity calculation on the area of overlap has no negative Effects on the quality of the time-scaled speech samples.

Ein anderer Ansatz für die Synchronisation ist, anstelle der Maximierung der "Waveform Similarity" den Fehler zwischen dem synthetisierten Signal y und dem Originalsignal x zu minimieren. Eine einfache willkürliche Wahl ist, für diesen Fehler den quadratischen AusdruckAnother approach to synchronization is instead of maximizing the "Waveform Similarity" the error between the synthesized signal y and the Minimize original signal x. A simple arbitrary choice is for this mistake the square expression

E_δ = || x-y ||²E _δ = || xy || ²

anzusetzen.start.

Bei Vernachlässigung der Vorabberechnungen beläuft sich der für die Auswertung von E_δ anfallende Aufwand auf (2Δ_max+1)4N DSP-Operationen alle L Abtastwerte. Hierunter werden solche Operationen verstanden, die ein Signalprozessor mit gängiger Architektur in einem Schritt abarbeiten kann.If the precalculations are neglected, the effort for the evaluation of E _δ amounts to (2Δ _max +1) 4N DSP operations every L samples. These are understood to be operations that a signal processor with common architecture can process in one step.

Ein weiterer Ansatz besteht darin, anstelle des absoluten Fehlers den relativen FehlerAnother approach is to use relative rather than absolute error error

zu minimieren, was als SNR-Maximierung interpretiert werden kann. (2Δ_max+1)5N Operationen sind hier vor jeder Overlap-Add-Operation erforderlich.to minimize what can be interpreted as SNR maximization. (2Δ _max +1) 5N operations are required here before each overlap add operation.

Claims

1. Method for speed modification of speech signals, in particular digitized speech signals, in which

an analog speech signal is digitized, resulting in a digitized speech signal which is stored in a memory,
a factor α is defined by which the speech signal is lengthened or shortened,
- A window function is defined with a first rising section of length N, a second, constant section of length L directly adjoining the first section and a third falling section directly adjoining the second section, with a superposition of the first rising section of a window with the third falling section of another window and an addition of both sections in the overlap area, the result is one, which corresponds to the value of the second section of the window function,
segments of a length L + N are taken from the digitized, stored speech signal at irregular intervals of an average length αL,
these segments, taken from the digitized, stored speech signal, are weighted with the window function in the time domain,
the weighted segments are each added up offset by a defined number of L samples, as a result of which the resulting speech signal is extended by the factor α or shortened by 1 / α,

dadurch gekennzeichnetcharacterized

- That successively at the points of removal of the segments from the digitized speech signal, the extracted there, with the window function weighted segment with the one taken below, also with the Window function weighted, segment compared under similarity aspects becomes,
- That for a quick comparison of the similarity of the segments only the N values long third section of the weighted with the falling window section Segment with the first long, with increasing N values long Window section weighted sections of the following segment is compared
- That these segments are added to each other offset if the similarity of both compared segments is maximum and
- That a correlation is used to calculate the similarity, as its measure becomes.

2. Method for speed modification of speech signals, in particular digitized speech signals, in which

characterized,

that, at the points at which the segments are removed from the digitized speech signal, the segment extracted there is compared with the result of the synthesis with the segment subsequently extracted,
that for the quick comparison of the deviation of the respective synthesis result from the original signal, only the N section, long third section of the last segment taken, is used as a reference,
- That these segments are added to each other offset when the determined deviation is minimal and
- That the relative error or the absolute quadratic error is used as a measure of the deviation.