• No results found

Linear Predictive Modeling and Sparse Approximation of Audio Signals

N/A
N/A
Protected

Academic year: 2021

Share "Linear Predictive Modeling and Sparse Approximation of Audio Signals"

Copied!
31
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Linear Predictive Modeling

and Sparse Approximation of

Audio Signals

Toon van Waterschoot

Dept. ESAT-SCD, KU Leuven, Belgium

Statistics Seminar

(2)

Outline

Introduction

Recap: Linear prediction of speech signals

Linear prediction of audio signals

[modeling]

–  fundamental issues

–  alternative models

–  performance evaluation

Sparse linear prediction of audio signals

[estimation]

(3)

Outline

Introduction

Recap: Linear prediction of speech signals

Linear prediction of audio signals

–  fundamental issues

–  alternative models

–  performance evaluation

Sparse linear prediction of audio signals

(4)

Introduction (1): Motivation

Linear prediction (LP) provides a simple yet powerful

and widely used model for

speech signals

Musical

audio signals

are typically represented using

–  non-parametric (transform-domain) models (MP3, AAC, ...)

–  parametric (transients + sinusoids + noise) models

Envisaged applications:

  coding: represent, store, and transmit signals using relatively small

number of parameters

  analysis: summarize characteristic signal behavior in low-dimensional

parameter space

  synthesis: generate synthetic signals from limited number of

parameters

  whitening: invertible parametric signal models can be useful for signal

whitening

We claim that LP modeling of audio signals has high potential,

provided that some fundamental issues are dealt with

(5)

Introduction (2): LP models

All-pole / Autoregressive (AR) models & Pole-Zero /

Autoregressive Moving Average (ARMA) models:

  linear prediction interpretation: prediction of current signal

sample based on past signal samples and (moving average of) excitation signal

AR: ARMA:

  source-filter interpretation: modeling signal as output of

linear all-pole/pole-zero filter driven by excitation (source) signal AR:

(6)

Outline

Introduction

Recap: Linear prediction of speech signals

Linear prediction of audio signals

–  fundamental issues

–  alternative models

–  performance evaluation

Sparse linear prediction of audio signals

(7)

Linear prediction of speech signals (1):

Source-filter model of speech production

Correspondence between all-pole source-filter model

and human speech production system:

–  glottal air flow represented as broadband noise signal

(white noise excitation for unvoiced speech)

–  vocal cords shape glottal air flow into periodic signal

(impulse train excitation for voiced speech)

–  vocal tract behaves as time-varying all-pole filter

(spectral shaping filter applied to speech excitation signal)

impulse  train   excita/on  

white  noise  

excita/on   unvoiced/voiced  switch  

/me-­‐varying   all-­‐pole  filter  

(8)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 10 Real Part Im a g in a ry P a rt

Linear prediction of speech signals (2):

Practical issues

User choices:

  length of speech signal segment ( ) = compromise between

•  accurate estimation of autocorrelation function:

•  speech stationarity throughout signal segment:

rule of thumb: ~ 20 ms (e.g., at kHz) –  model order ( ) = compromise between

•  model accuracy: •  model complexity: rule of thumb: 0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum p = 10 AR spectrum p = 20 AR spectrum p = 50

(9)

Linear prediction of speech signals (3):

Pitch prediction

Voiced speech:

–  periodicity of voiced speech (originating at vocal cords) cannot be

modeled using (low-order) all-pole model

(all-pole model only represents signal autocorrelation up to lag )

–  cascade of two all-pole models (formant predictor+pitch predictor)

Pitch predictor:

–  pitch lag:

–  comb filter behavior

–  improvements:

•  interpolation filter

•  multi-tap pitch predictor -600 0.5 1 1.5 2 2.5 3

-50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum p = 10 AR spectrum p = 20 0 0.5 1 1.5 2 2.5 3 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum FP spectrum P = 10 PP spectrum FP+PP spectrum P = 10

(10)

Outline

Introduction

Recap: Linear prediction of speech signals

Linear prediction of audio signals*

–  fundamental issues

–  alternative models

–  performance evaluation

Sparse linear prediction of audio signals

Conclusion

* Toon van Waterschoot and Marc Moonen, "Comparison of linear prediction models for audio signals", EURASIP J. Audio, Speech, Music Process., vol. 2008, Article ID 706935, 24 p., 2008.

(11)

Linear prediction of audio signals (1):

Fundamental issues (1)

0 0.5 1 1.5 2 2.5 3 3.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 time (s) am pl it ude

ATTACK DECAY SUSTAIN RELEASE

0 0.5 1 1.5 2 x 104 -40 -30 -20 -10 0 10 20 30 40 frequency (Hz) ma g n it u d e ( d B )

Difference between speech and audio signals:

–  temporal structure: attack-decay-sustain-release

(stationarity interval ranges from a few millisec to a few sec)

–  spectral structure: tonality (damped sinusoidal components)

timbre (spectral envelope)

(12)

Linear prediction of audio signals (2):

Fundamental issues (2)

Sinusoidal signals:

–  sum of sinusoids admits AR( ) representation:

–  prediction error filter (PEF)

0 0.5 1 1.5 2 2.5 3 -250 -200 -150 -100 -50 0 50 100

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 30 Real Part Im a g in a ry P a rt

(13)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 30 Real Part Im a g in a ry P a rt

Linear prediction of audio signals (3):

Fundamental issues (3)

Sinusoidal signals with additive broadband noise:

  pole radii decrease: poles are drawn to the origin

  pole angles increase: poles tend to uniform angular distribution

0 0.5 1 1.5 2 2.5 3 -60 -50 -40 -30 -20 -10 0 10 20 30

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 30

audio signals are generally sampled at very high rate _ distribution of dominating tonal components ≠ uniform

(14)

Linear prediction of audio signals (4):

Fundamental issues (4)

Poles tend to uniform angle distribution:

Intuition

–  in noiseless case, behaves as cascade of notch filters

–  two-zero notch filter at exhibits high-frequency boost

–  cascade of two-zero notch filters exhibits very high boost

–  additive broadband noise component in PEF output is amplified

in if majority of notch filters are centered in

’ contradicts with the PEF output whitening property of LP

0 0.5 1 1.5 2 2.5 3 -200 -150 -100 -50 0 50 100 150 200

normalized frequency (rad)

ma g n it u d e ( d B ) PEF response -1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 30 Real Part Im a g in a ry P a rt

(15)

Linear prediction of audio signals (5):

Fundamental issues (5)

Poles tend to uniform angle distribution:

Proof

–  LP model parameter estimation criterion:

–  can be shown by considering

(16)

Linear prediction of audio signals (6):

Alternative models (1)

LP models different from low-order all-pole model

–  pole-zero model (PZLP)

–  high-order all-pole model (HOLP)

–  pitch prediction model (PLP)

Low-order all-pole model of transformed audio signal

–  warped all-pole model (WLP)

–  selective all-pole model (SLP)

Example: monophonic audio signal segment

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 30 Real Part Im a g in a ry P a rt 0 0.5 1 1.5 2 2.5 3 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 30

(17)

Linear prediction of audio signals (7):

Alternative models (2)

1.

Pole-zero model (PZLP):

–  sum sinusoids + white noise admits ARMA( ) representation

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10 20 30

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum ARMA spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Real Part Im a g in a ry P a rt

(18)

Linear prediction of audio signals (8):

Alternative models (3)

2.

High-order all-pole models (HOLP):

–  high-order all-pole model can closely approximate pole-zero model

–  compression still feasible by exploiting sparsity of all-pole model

parameter vector (sparse linear prediction, see later)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1024 2 2 Real Part Im a g in a ry P a rt 0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 1024

(19)

Linear prediction of audio signals (9):

Alternative models (4)

3.

Pitch prediction model (PLP):

–  comparable to cascade formant + pitch predictor for voiced speech

–  1st all-pole model=periodicity, 2nd all-pole model=spectral envelope

–  only appropriate for modeling monophonic signals (harmonicity)

0 0.5 1 1.5 2 2.5 3 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR+PP spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 159 Real Part Im a g in a ry P a rt

(20)

Linear prediction of audio signals (10):

Alternative models (5)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 512 Real Part Im a g in a ry P a rt

4.

Warped all-pole model (WLP):

–  frequency warping = non-uniform transformation of frequency axis

–  pole angle distribution: uniform (warped ) → logarithmic (linear )

–  model order: finite (warped ) → infinite (linear )

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum WAR spectrum P = 30

(21)

-1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 90 Real Part Im a g in a ry P a rt

Linear prediction of audio signals (11):

Alternative models (6)

5.

Selective all-pole model (SLP):

–  frequency zooming = uniform transformation of frequency axis

–  downsampling → conventional LP → upsampling

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum SAR spectrum P = 30

(22)

Linear prediction of audio signals (12):

Performance evaluation (1)

Mean square frequency estimation error

vs.

SNR

-50 0 50 -120 -100 -80 -60 -40 -20 0 20 SNR (dB ) M SF E (d B ) LPCOV LP AUTO PZLP HOLP PLP WLP SLP

more tonal behavior

hi gh er fre qu en cy accu ra cy

(23)

-50 -40 -30 -20 -10 0 10 20 30 40 50 -14 -12 -10 -8 -6 -4 -2 0 SNR (dB ) SF ME (d B ) LPCOV LPAUTO PZLP HOLP PLP WLP SLP

Linear prediction of audio signals (13):

Performance evaluation (2)

Spectral flatness measure (SFM) of LP residual

more tonal behavior

hi gh er sp ect ra l f la tn ess

(24)

Linear prediction of audio signals (14):

Performance evaluation (3)

Interpeak dip depth (IDD):

–  quantifies perceptual separability of spectral peaks in LP model

representing sum of two sinusoids separated by twice ERB

higher frequency be tte r pe rce pt ua l se pa ra bi lit y 102 103 104 0 10 20 30 40 50 60 70 80 90 f1 (H z) ID D (d B ) LPCOV LPAUTO PZLP HOLP WLP SLP

(25)

Outline

Introduction

Recap: Linear prediction of speech signals

Linear prediction of audio signals

–  fundamental issues

–  alternative models

–  performance evaluation

  Sparse linear prediction of audio signals*

Conclusion

* Daniele Giacobello, Toon van Waterschoot, Mads Græsbøll Christensen, Søren Holdt Jensen, and Marc Moonen, "High-order sparse linear predictors for audio processing", in Proc. 18th European

(26)

Sparse linear prediction

of audio signals (1)

Motivation:

–  the high-order all-pole model (HOLP) performs very well at

representing audio signals with varying degrees of tonality

–  harmonic relations between dominating tonal components

induce sparsity and structure in HOLP parameter vector

High-order sparse linear prediction (HOSpLP):

–  l1-regularized estimation problem

with high-order parameter vector:

–  influence of :

•  related to prior knowledge on parameter vector (MAP approach)

or how sparse the predictor should be

•  similar to Tikhonov regularization ( has highly correlated rows,

(27)

Sparse linear prediction

of audio signals (2)

Example 1: Monophonic audio signal

(28)

Sparse linear prediction

of audio signals (3)

Example 1: Monophonic audio signal

–  comparison of HOLP and HOSpLP signal spectrum estimates

0 0.5 1 1.5 2 x 104 -40 -30 -20 -10 0 10 20 30 40 50 60 70 Frequency [Hz] Ma g n itu d e 1000 2000 3000 -10 0 10 20 30 40 X HOLP HOSpLP ENV 4th ord

(29)

Sparse linear prediction

of audio signals (4)

Example 2: Polyphonic audio signal

(30)

Sparse linear prediction

of audio signals (5)

Example 2: Polyphonic audio signal

–  comparison of HOLP and HOSpLP signal spectrum estimates

0 0.5 1 1.5 2 x 104 -50 0 50 Frequency [Hz] Ma g n itu d e X HOLP HOSpLP ENV12th ord 0 200 400 600 800 10 20 30 40

(31)

Conclusion

Why has LP not been as popular for audio as for speech?

–  audio contains strong tonal (near-sinusoidal) components

–  dominating tonal components located in low-frequency region

–  PEF zeros tend to uniform angular distribution

Alternative LP models do provide accurate representation

–  LP models different from low-order all-pole model

–  low-order all-pole model of transformed audio signal

Sparse approximation leads to efficient representation

–  HOLP model provides most accurate audio signal representation

–  harmonic relations induce sparsity and structure in HOLP model

–  HOSpLP model results from solving l1-regularized estimation

Ongoing work:

–  efficient first-order numerical optimization algorithms for HOSpLP

Referenties

GERELATEERDE DOCUMENTEN

Op grond van de veronderstelling, dat degenen die door de participanten aan de vorming van het algemeen beleid belangrijk en invloedrijk worden geacht, dikwijls

9081 Paalkuil BrDoBr GeVl1 Z3S2 Vaag Rond Onbekend. 9082 Greppel BrDoBr Z3S2 Vaag

In our previous work we have de- fined a new synergistic predictive framework that reduces this mismatch by jointly finding a sparse prediction residual as well as a sparse high

Overigens heeft Mayer geheel in lijn met zijn achtergrond uit- gebreid onderzoek gedaan naar de bodemgesteldheid van de tabaksgronden op de Utrechtse Heuvelrug en ook adviezen

In Chapter 1 (Sections 1.1.2 to 1.1.7) the literature for the connexin structure in gap junction architecture and formation, their role in cell-cell communication,

Alhoewel niet alle respondenten bewust waren van de UNESCO-status, bleek 95,8 % van de bewoners en 100 % van de toeristen het (zeer) belangrijk te vinden om het Waddengebied door te

Given that the domain state responds according to the linear scenario instead of moving the domain walls, we now consider its effective conductivity... From this de finition it is

One can think of efficient updates for preconditioners [6], model reduction techniques and combinations of Jacobi-Davidson with other iterative methods like Arnoldi or