The voiced and unvoiced amplitude in speech synthesis

(1)

The voiced and unvoiced amplitude in speech synthesis

Citation for published version (APA):

van Hemert, J. P. (1987). The voiced and unvoiced amplitude in speech synthesis. (IPO rapport; Vol. 595). Instituut voor Perceptie Onderzoek (IPO).

Document status and date: Published: 17/07/1987

Document Version:

Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers)

Please check the document version of this publication:

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website.

• The final author version and the galley proof are versions of the publication after peer review.

• The final published version features the final layout of the paper including the volume, issue and page numbers.

Link to publication

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal.

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement:

www.tue.nl/taverne

Take down policy

If you believe that this document breaches copyright please contact us at:

openaccess@tue.nl

providing details and we will investigate your claim.

(2)

Rapport no. 595

The voiced and unvoiced amplitude in speech synthesis

(3)

Page 2 Contents.

Introduction . . . 3

1. The old method . . . 3

2. Problems with this method ... 3

3. The new method . . . 4

3 .1. The residual ene rgy . . . 4

3.2. Voiced excitation . . . 4

3.3. Unvoiced excitation . . . 5

4. Evaluation . . . 5

Literature . . . 6

(4)

Introduction.

In the IPO speech processing system (LVS) speech is synthesized using a source filter model. The filter models the acoustic transfer function of the vocal tract and is calculated with LPC-analysis (LPC: linear predictive coding). The source models the excitation of the vocal tract, which is excited by the vibration of the vocal cords in the voiced parts of the speech and by turbulence caused by forcing air through strictures in the vocal tract in unvoiced parts. In the voiced parts of the speech the LPC filter is excited with a periodic pulse train in which each pulse has amplitude

Av.

In the unvoiced regions the excitation consists of white noise that has a uniform distribution over the interval -Au/2 to Au/2. In this report the calculation of the voiced amplitude Av ana the unvoiced amplitude

A~ will be discussed.

1. The old method.

Vogten (1983) describes a method in which residual signal after linear prediction is

""

E

+

L

a,

(k)

7(,

(lJ

J..=1

the energy calculated: of the ( 1 ) The idea residual is given

is to give the filter excitation the same energy as the signal. During synthesis each pulse in the pulse train a height of:

2..

VE

(2)

The unvoiced amplitude was set proportional to the voiced amplitude and the multiplication factor was emperically determined, using a perception criterion. By listening to the original and the resynthesis of a few sentences the multiplication factor was adjusted until the unvoiced parts had approximately the proper amplitude. The unvoiced amplitude was set at a factor of two smaller than the voiced amplitude:

( 3)

2. Problems with this method.

This setting of the voiced and unvoiced amplitudes in the synthesis programs has functioned satisfactorily in most cases. However if the programs were used outside of the field for which the perception test was done strange effects were audible. I want to mention two of these cases. First, if the LVS-excitation is, locally in a sentence, interchanged with the residual signal, discontinuities in the total energy of the synthesized signal occur. These discontinuities are audible as plocks and indicate that the excitation does not have the same energy as the residual signal. Second, if a sampling frequency much higher than 10 kHz is used the unvoiced regions become too loud. From a female voice some utterances were sampled at 20 kHz. In the resyntheses of these utterances extremely sharp [s]-es occurred. These [s]-es were audibly louder than the [s]-es in the original speech and the resynthesis was not a good representation of the

(5)

original. The utterances were downsampled to 10 kHz and the 10 kHz resynthesis did not have this effect. An effect that may have caused this problem is that in the 20 kHz version the pitch pulse has a duration of 50 microseconds, putting less energy into the filter than in the 10 kHz version where the sampling time is 100 microseconds. This effect leads in the 20 kHz version to voiced portions of the utterances that have a relatively low amplitude as compared to the unvoiced parts.

3. The new method.

We will derive the voiced and unvoiced amplitude (A~ and A~) by matching the energy of the excitation with the residual energy. 3.1. The residual energy.

In Equation 1 the total residual energy in the analysis window is calculated:

E9

=

L

( 4 )

"'

~

w,;.

d

cnv

We want to know the average squared magnitude of the residual signal and therefore we have to divide by the effective number of samples in the window (N'). This does not equal the total number of samples because the points near the sides of the window are weighed less heavily than the points in the middle.

N

N'

=

L..

w' /,.._,)

(5)

The windowing function fora Hamming window is:

( 6 ) If the number of samples in the window is much larger than one, which is usually the case, the summation can be approximated by neglecting the eosine term which zeroes out during the summation and leaving the constant term which accumulates.

N

I -- 0. r '::;J '-/

N

( 7)

The average squared magnitude of the residual signal is now:

=

E /

N'

( 8)

3.2 Voiced excitation.

Every TO samples there is one pulse with a height _of Av· The average squared magnitude of the voiced excitation is now:

l - /'11.. /

x

l~) -

nv

'o

(9)

We want the excitation to have the same energy as the residual signal.

(6)

E /

N

( 10) The voiced amplitude is now:

A.,

=

VE-T

_{0 /}

N'

1

( 11) Mark that the pulse height is proportional to the square root of the number of samples in a pitch period. The variation of the pitch within a sentence is usually within one octave, indicating that the T0-correction on Avis usually within 3 dB.

3.3 Unvoiced excitation

For every sample a random number x is

uniformely distributed over the interval

Figure 1).

- A c., /?... 0

generated, that is

[-Aw/2, A~/2) (see

Figure 1: The propability density function P(x) of x. The average squared magnitude x can

the probability density function:

A'4 /

'l.

; , =

J

=

A\4

/2

be calculated by integrating ( 12) By setting get:

the excitation energy equal to the residual energy we

A

2. / I 2. (.,4 or:

-E

IN'

v,

1

E /

N'

4. Evaluation ( 13) ( 14)

The speech synthesis program SYN has been modified according to Equations 11 and 14. Sixteen sentences (eight spoken by a male and eight by a female speaker) were sampled at 10 kHz, analysed and synthesized with both the old and the new synthesis method.

Differences between the two were hardly audible. The unvoiced

amplitude was slightly lower in the new version. The variation of the voiced amplitude with the pitch was not audible at all.

This means that in traditional appli~ations the users of the

(7)

old method. However, when a higher sampling frequency is used or when the residual signa! is used to excite the filter in some parts of the speech, a substantial improvement can be heard.

Literature

Vogten L.L.M. (1983),

Analyse, zuinige codering en resynthese van spraakgeluid, Proefschrift Technische Hogeschool Eindhoven.

Symbol table

Av=

unvoiced amplitude

Av: voiced amplitude

a(k): k-th filterparameter

E: total energy of the residual signal in the analysis window e(n): residual signal

m: order of the LPC-analysis

N: number of samples in the analysis window

N': effective number of samples in the analysis window n: sample numbèr

P(x): propability density function of X R(k): the k-th autocorrelation

TO: pitch period in samples W(n): windowing function x(n): excitation signal