Linear Predictive Modeling and Sparse Approximation of Audio Signals

(1)

Linear Predictive Modeling

and Sparse Approximation of

Audio Signals

Toon van Waterschoot

Dept. ESAT-SCD, KU Leuven, Belgium

Statistics Seminar

(2)

Outline

• Introduction

• Recap: Linear prediction of speech signals

• Linear prediction of audio signals

[modeling]

–  fundamental issues

–  alternative models

–  performance evaluation

• Sparse linear prediction of audio signals

[estimation]

(3)

Outline

• Introduction

• Recap: Linear prediction of speech signals

• Linear prediction of audio signals

• Sparse linear prediction of audio signals

(4)

Introduction (1): Motivation

• Linear prediction (LP) provides a simple yet powerful

and widely used model for

speech signals

• Musical

audio signals

are typically represented using

–  non-parametric (transform-domain) models (MP3, AAC, ...)

–  parametric (transients + sinusoids + noise) models

• Envisaged applications:

–  coding: represent, store, and transmit signals using relatively small

number of parameters

–  analysis: summarize characteristic signal behavior in low-dimensional

parameter space

–  synthesis: generate synthetic signals from limited number of

parameters

–  whitening: invertible parametric signal models can be useful for signal

whitening

We claim that LP modeling of audio signals has high potential,

provided that some fundamental issues are dealt with

(5)

Introduction (2): LP models

• All-pole / Autoregressive (AR) models & Pole-Zero /

Autoregressive Moving Average (ARMA) models:

–  linear prediction interpretation: prediction of current signal

sample based on past signal samples and (moving average of) excitation signal

AR: ARMA:

–  source-filter interpretation: modeling signal as output of

linear all-pole/pole-zero filter driven by excitation (source) signal AR:

(6)

Outline

• Introduction

• Recap: Linear prediction of speech signals

• Linear prediction of audio signals

• Sparse linear prediction of audio signals

(7)

Linear prediction of speech signals (1):

Source-filter model of speech production

• Correspondence between all-pole source-filter model

and human speech production system:

–  glottal air flow represented as broadband noise signal

(white noise excitation for unvoiced speech)

–  vocal cords shape glottal air flow into periodic signal

(impulse train excitation for voiced speech)

–  vocal tract behaves as time-varying all-pole filter

(spectral shaping filter applied to speech excitation signal)

impulse train excita/on

white noise

excita/on unvoiced/voiced _switch

/me-‐varying all-‐pole ﬁlter

(8)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 10 Real Part Im a g in a ry P a rt

Linear prediction of speech signals (2):

Practical issues

• User choices:

–  length of speech signal segment ( ) = compromise between

•  accurate estimation of autocorrelation function:

•  speech stationarity throughout signal segment:

rule of thumb: ~ 20 ms (e.g., at kHz) –  model order ( ) = compromise between

•  model accuracy: •  model complexity: rule of thumb: 0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10

normalized frequency (rad)

ma g n it u d e ( d B ) DFT spectrum AR spectrum p = 10 AR spectrum p = 20 AR spectrum p = 50

(9)

Linear prediction of speech signals (3):

Pitch prediction

• Voiced speech:

–  periodicity of voiced speech (originating at vocal cords) cannot be

modeled using (low-order) all-pole model

(all-pole model only represents signal autocorrelation up to lag )

–  cascade of two all-pole models (formant predictor+pitch predictor)

• Pitch predictor:

–  pitch lag:

–  comb filter behavior

–  improvements:

•  interpolation filter

•  multi-tap pitch predictor -600 0.5 1 1.5 2 2.5 3

-50 -40 -30 -20 -10 0 10

ma g n it u d e ( d B ) DFT spectrum AR spectrum p = 10 AR spectrum p = 20 0 0.5 1 1.5 2 2.5 3 -60 -50 -40 -30 -20 -10 0 10

ma g n it u d e ( d B ) DFT spectrum FP spectrum P = 10 PP spectrum FP+PP spectrum P = 10

(10)

Outline

• Introduction

• Recap: Linear prediction of speech signals

• Linear prediction of audio signals*

• Sparse linear prediction of audio signals

• Conclusion

* Toon van Waterschoot and Marc Moonen, "Comparison of linear prediction models for audio signals", EURASIP J. Audio, Speech, Music Process., vol. 2008, Article ID 706935, 24 p., 2008.

(11)

Linear prediction of audio signals (1):

Fundamental issues (1)

0 0.5 1 1.5 2 2.5 3 3.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 time (s) am pl it ude

ATTACK DECAY SUSTAIN RELEASE

0 0.5 1 1.5 2 x 104 -40 -30 -20 -10 0 10 20 30 40 frequency (Hz) ma g n it u d e ( d B )

• Difference between speech and audio signals:

–  temporal structure: attack-decay-sustain-release

(stationarity interval ranges from a few millisec to a few sec)

–  spectral structure: tonality (damped sinusoidal components)

timbre (spectral envelope)

(12)

Linear prediction of audio signals (2):

Fundamental issues (2)

• Sinusoidal signals:

–  sum of sinusoids admits AR( ) representation:

–  prediction error filter (PEF)

0 0.5 1 1.5 2 2.5 3 -250 -200 -150 -100 -50 0 50 100

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 30 Real Part Im a g in a ry P a rt

(13)

Linear prediction of audio signals (3):

Fundamental issues (3)

• Sinusoidal signals with additive broadband noise:

–  pole radii decrease: poles are drawn to the origin

–  pole angles increase: poles tend to uniform angular distribution

0 0.5 1 1.5 2 2.5 3 -60 -50 -40 -30 -20 -10 0 10 20 30

ma g n it u d e ( d B ) DFT spectrum AR spectrum P = 30

audio signals are generally sampled at very high rate _ distribution of dominating tonal components ≠ uniform

(14)

Linear prediction of audio signals (4):

Fundamental issues (4)

• Poles tend to uniform angle distribution:

Intuition

–  in noiseless case, behaves as cascade of notch filters

–  two-zero notch filter at exhibits high-frequency boost

–  cascade of two-zero notch filters exhibits very high boost

–  additive broadband noise component in PEF output is amplified

in if majority of notch filters are centered in

 contradicts with the PEF output whitening property of LP

0 0.5 1 1.5 2 2.5 3 -200 -150 -100 -50 0 50 100 150 200

ma g n it u d e ( d B ) PEF response -1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 30 Real Part Im a g in a ry P a rt

(15)

Linear prediction of audio signals (5):

Fundamental issues (5)

• Poles tend to uniform angle distribution:

Proof

–  LP model parameter estimation criterion:

–  can be shown by considering

(16)

Linear prediction of audio signals (6):

Alternative models (1)

• LP models different from low-order all-pole model

–  pole-zero model (PZLP)

–  high-order all-pole model (HOLP)

–  pitch prediction model (PLP)

• Low-order all-pole model of transformed audio signal

–  warped all-pole model (WLP)

–  selective all-pole model (SLP)

• Example: monophonic audio signal segment

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 30 Real Part Im a g in a ry P a rt 0 0.5 1 1.5 2 2.5 3 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10

(17)

Linear prediction of audio signals (7):

Alternative models (2)

1. Pole-zero model (PZLP):

–  sum sinusoids + white noise admits ARMA( ) representation

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10 20 30

ma g n it u d e ( d B ) DFT spectrum ARMA spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Real Part Im a g in a ry P a rt

(18)

Linear prediction of audio signals (8):

Alternative models (3)

2. High-order all-pole models (HOLP):

–  high-order all-pole model can closely approximate pole-zero model

–  compression still feasible by exploiting sparsity of all-pole model

parameter vector (sparse linear prediction, see later)

-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 1024 2 2 Real Part Im a g in a ry P a rt 0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

(19)

Linear prediction of audio signals (9):

Alternative models (4)

3. Pitch prediction model (PLP):

–  comparable to cascade formant + pitch predictor for voiced speech

–  1st all-pole model=periodicity, 2nd all-pole model=spectral envelope

–  only appropriate for modeling monophonic signals (harmonicity)

0 0.5 1 1.5 2 2.5 3 -90 -80 -70 -60 -50 -40 -30 -20 -10 0 10

ma g n it u d e ( d B ) DFT spectrum AR+PP spectrum P = 30 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 159 Real Part Im a g in a ry P a rt

(20)

Linear prediction of audio signals (10):

Alternative models (5)

4. Warped all-pole model (WLP):

–  frequency warping = non-uniform transformation of frequency axis

–  pole angle distribution: uniform (warped ) → logarithmic (linear )

–  model order: finite (warped ) → infinite (linear )

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

ma g n it u d e ( d B ) DFT spectrum WAR spectrum P = 30

(21)

-1 -0.5 0 0.5 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 90 Real Part Im a g in a ry P a rt

Linear prediction of audio signals (11):

Alternative models (6)

5. Selective all-pole model (SLP):

–  frequency zooming = uniform transformation of frequency axis

–  downsampling → conventional LP → upsampling

0 0.5 1 1.5 2 2.5 3 -70 -60 -50 -40 -30 -20 -10 0 10

ma g n it u d e ( d B ) DFT spectrum SAR spectrum P = 30

(22)

Linear prediction of audio signals (12):

Performance evaluation (1)

• Mean square frequency estimation error

vs.

SNR

-50 0 50 -120 -100 -80 -60 -40 -20 0 20 SNR (dB ) M SF E (d B ) LP_COV LP AUTO PZLP HOLP PLP WLP SLP

more tonal behavior

hi gh er fre qu en cy accu ra cy

(23)

-50 -40 -30 -20 -10 0 10 20 30 40 50 -14 -12 -10 -8 -6 -4 -2 0 SNR (dB ) SF ME (d B ) LP_COV LP_AUTO PZLP HOLP PLP WLP SLP

Linear prediction of audio signals (13):

Performance evaluation (2)

• Spectral flatness measure (SFM) of LP residual

more tonal behavior

hi gh er sp ect ra l f la tn ess

(24)

Linear prediction of audio signals (14):

Performance evaluation (3)

• Interpeak dip depth (IDD):

–  quantifies perceptual separability of spectral peaks in LP model

representing sum of two sinusoids separated by twice ERB

higher frequency be tte r pe rce pt ua l se pa ra bi lit y 102 103 104 0 10 20 30 40 50 60 70 80 90 f1 (H z) ID D (d B ) LP_COV LP_AUTO PZLP HOLP WLP SLP

(25)

Outline

• Introduction

• Recap: Linear prediction of speech signals

• Linear prediction of audio signals

•   Sparse linear prediction of audio signals*

• Conclusion

* Daniele Giacobello, Toon van Waterschoot, Mads Græsbøll Christensen, Søren Holdt Jensen, and Marc Moonen, "High-order sparse linear predictors for audio processing", in Proc. 18th European

(26)

Sparse linear prediction

of audio signals (1)

• Motivation:

–  the high-order all-pole model (HOLP) performs very well at

representing audio signals with varying degrees of tonality

–  harmonic relations between dominating tonal components

induce sparsity and structure in HOLP parameter vector

• High-order sparse linear prediction (HOSpLP):

–  l1-regularized estimation problem

with high-order parameter vector:

–  influence of :

•  related to prior knowledge on parameter vector (MAP approach)

or how sparse the predictor should be

•  similar to Tikhonov regularization ( has highly correlated rows,

(27)

Sparse linear prediction

of audio signals (2)

• Example 1: Monophonic audio signal

(28)

Sparse linear prediction

of audio signals (3)

• Example 1: Monophonic audio signal

–  comparison of HOLP and HOSpLP signal spectrum estimates

0 0.5 1 1.5 2 x 104 -40 -30 -20 -10 0 10 20 30 40 50 60 70 Frequency [Hz] Ma g n itu d e 1000 2000 3000 -10 0 10 20 30 40 X HOLP HOSpLP ENV 4th ord

(29)

Sparse linear prediction

of audio signals (4)

• Example 2: Polyphonic audio signal

(30)

Sparse linear prediction

of audio signals (5)

• Example 2: Polyphonic audio signal

–  comparison of HOLP and HOSpLP signal spectrum estimates

0 0.5 1 1.5 2 x 104 -50 0 50 Frequency [Hz] Ma g n itu d e X HOLP HOSpLP ENV_{12th ord} 0 200 400 600 800 10 20 30 40

(31)

Conclusion

• Why has LP not been as popular for audio as for speech?

–  audio contains strong tonal (near-sinusoidal) components

–  dominating tonal components located in low-frequency region

–  PEF zeros tend to uniform angular distribution

• Alternative LP models do provide accurate representation

–  LP models different from low-order all-pole model

–  low-order all-pole model of transformed audio signal

• Sparse approximation leads to efficient representation

–  HOLP model provides most accurate audio signal representation

–  harmonic relations induce sparsity and structure in HOLP model

–  HOSpLP model results from solving l1-regularized estimation

• Ongoing work:

–  efficient first-order numerical optimization algorithms for HOSpLP