Unification of multi-microphone noise reduction systems
Ann Spriet, Simon Doclo, Marc Moonen, Jan Wouters 7th February 2006
1 General cost function
1.1 Signal model
Let Xi( f ), i = 1, . . . , M denote the frequency-domain microphone signals. Each microphone signal Xi( f ), i = 1, . . . , M can be decomposed into a speech component Xis( f ) and an additive noise component Xin( f ) as:
Xi( f ) = Xis( f ) + Xin( f ). (1)
Defining His( f ) as the acoustic transfer function from the speech source S( f ) to the i-th microphone, the speech component Xis( f ) equals:
Xis( f ) = His( f )S( f ) =His( f )
H1s( f )X1s( f ), (2)
were ˜His( f ) =HHiss( f )
1( f )denotes the relative transfer function ratio of the i-th to the first microphone.
Let X( f ) ∈ CM×1be defined as the stacked vector X( f ) =
X1( f ) X2( f ) · · · XM( f ) T
. (3)
Then, (1) can be written as
X( f ) = Xs( f ) + Xn( f ) = Hs( f )S( f ) + Xn( f ) = ˜Hs( f )X1s( f ) + Xn( f ), (4) with Xs( f ) and Xn( f ) defined similarly as in (3) and
H˜s( f ) = Hs( f )/H1s( f ) =h
1 HH2ss( f )
1( f ) . . . HHMss( f ) 1( f )
iT
. (5)
To simplify notation, we define the power spectral density (PSD) of the speech and noise component in the i-th microphone signal as
PXs
i( f ) = ε{Xis( f )Xi∗,s( f )} (6)
PXn
i( f ) = ε{Xin( f )Xi∗,n( f )} (7)
and the PSD of the speech source S( f ) as
PS( f ) =ε{S( f )S∗( f )}. (8)
In addition, we define the noise and speech correlation matrix as:
Rn( f ) = ε{Xn( f )Xn,H( f )}, (9)
Rs( f ) = ε{Xs( f )Xs,H( f )} = PXs1( f ) ˜Hs( f ) ˜Hs,H( f ). (10)
1.2 Free-field propagation model
Single point source
Assuming free-field propagation, mathematical expressions can be derived for the acoustic tranfer function from a point source S( f , p) to the M microphones. Let S( f , p) be defined as a point source at location p with cartesian coordinates p = (x, y, z) and spherical coordinates(R,θ,φ) (where R is the distance,θthe azimuth,φthe elevation) as defined in Figure 1. Without loss of generality, we define the origin of the coordinates at the position of the first microphone of the microphone array. The contribution Xi( f , p) of the point source S( f , p) in the i-th microphone signal (with coordinates pi) equals
Xi( f , p) = Ai( f , p)ai(p)e− j2πfτi(p)S( f , p), (11) where Ai( f , p) = Ai( f ,θ,φ) includes the microphone characteristics of the i-th microphone (and in the case of a hearing aid, the head related transfer function to the i-th microphone), ai(p) is the attenuation of the point source S( f , p) at the position of the i-th microphone (near-field effect) and
τi(p) =kp − pik
c (12)
with c the speed of sound (340 m/s), is the propagation delay from the point source S( f , p) to the i-th microphone. Defining the first microphone signal X1( f , p) as reference signal, X( f , p) can be written as:
X( f , p) = ˜d( f , p)X1( f , p) (13)
where ˜d( f , p) is the steering vector
˜d( f ,p) =h
1 AA2( f ,p)
1( f ,p) a2(p)
a1(p)e− j2πf(τ2(p)−τ1(p)) . . . AAM( f ,p)
1( f ,p) aM(p)
a1(p)e− j2πf(τM(p)−τ1(p)) iT
. (14)
Using (28), it can be shown that the PSD PX1( f ,p)( f ) of the first microphone signal X1( f , p) (i.e., the reference signal) equals:
PX
1( f ,p)( f ) = |A1( f , p)a1(p)|2PS( f ,p)( f ), (15)
where PS( f ,p)( f ) is the PSD of the source S( f , p). If the first microphone is omnidirectional (i.e., A1( f , p) = 1), the PSD of the first microphone signal equals the PSD of the source signal S( f , p) up to a scalar |a1(p)|2. An estimate of the first microphone signal is then a scaled and delayed version of the estimate of the source signal.
Multiple point sources
If several point sources S( f , p) at positions p ∈ P are propagating, the microphone signals X( f ) can be modeled as:
X( f ) = Z
p∈P˜d( f ,p)X1( f , p), (16)
with X1( f , p) defined by (28).
For uncorrelated point sources
ε{X1( f , pk)X1( f , pl)} = PX1( f )δkl. (17) The model (16)-(17) can be used when the speech/noise sources cover a certain known region in space or when an approximate position of the speech source is known.
Remark: The free-field propagation model assumes that there is no reverberation and that the microphone characteristics and positions, the HRTFs (in case of hearing aids) are known. In practice, these assumptions will often be violated (e.g., microphone mismatch, reverberation) such that the true model (4) deviates from the free-field model (13)-(14), i.e.,
H( f ) = ˜d( f ) +˜ δ˜d( f ). (18)
Because of this deviation, techniques assuming a free-field propagation model may have a performance degradation in prac- tice. The amount of degradation depends on the deviationδ˜d( f ).
1.2.1 Far-field propagation
For far-field propagation, (14) equals
˜d( f ,p) =h
1 AA2( f ,p)
1( f ,p)e− j2πf(τ2(p)−τ1(p)) . . . AAM( f ,p)
1( f ,p)e− j2πf(τM(p)−τ1(p)) iT
, (19)
where
τj(p) −τ1(p) =−xjsinφcosθ− yjsinφsinθ− zjcosφ
c (20)
For the special case of a linear array (i.e.,φ= 90◦, y= 0),τj(p) −τ1(p) reduces to:
τj(p) −τ1(p) =−xjcosθ
c . (21)
00 11
0000000000
1111111111000000000000000000000000000000000000000000000 11111 11111 11111 11111 11111 11111 11111 11111 11111 000000
000000 000000 000000 000000 000000 000000 000000
111111 111111 111111 111111 111111 111111 111111 111111
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111 11111
000000 000000 000000 000000 000000 000000 000000 000000 000000
111111 111111 111111 111111 111111 111111 111111 111111 111111
θ φ
x
y z
Figure 1: Coordinate system with cartesian and spherical coordinates. The origin corresponds to the position of the first microphone.
1.2.2 Near-field propagation
For near-field propagation, the steering vector ˜d( f , p) equals
˜d( f ,p) =h
1 AA2( f ,p)
1( f ,p) a2(p)
a1(p)e− j2πf(τ2(p)−τ1(p)) . . . AAM( f ,p)
1( f ,p) aM(p)
a1(p)e− j2πf(τM(p)−τ1(p)) iT
, (22)
with
ai(p) = 1
kp − pik. (23)
1.3 Multi-microphone noise reduction
In a multi-microphone noise reduction system, the microphone signals Xi( f ) are filtered by (adaptive or fixed) filters Wi( f ) and combined in order to obtain an enhanced speech signal Z( f ). To simplify the formulation of the noise reduction algorithms, we define
W( f ) =
W1( f ) W2( f ) · · · WM( f ) H
, (24)
with Wl( f ) =∑L−1m=0wl,me− j2π
f fsm
. The output Z( f ) of the multi-channel noise reduction algorithm can then be expressed as Z( f ) = WH( f )X( f ) = WH( f ) Xs( f )
| {z }
Zs( f )
+ WH( f )Xn( f )
| {z }
Zn( f )
. (25)
The goal of the filter W( f ) is to minimize the output noise energy as much as possible without severely distorting the speech signal. The amount of speech distortion is measured with respect to a reference speech signal Ds( f ). This reference signal can be the speech component X1s( f ) in the first microphone, the speech source signal S( f ) or the speech component in the output of a fixed beamformer (e.g., the speech reference in the spatially pre-processed SDW-MWF[?]).
A general cost function J(W( f )) for the filter W( f ) is:
J(W) = (1 −λ)WH( f )Rn( f )W( f ) +λWH( f )Rnm( f )W( f )))H}
+µ1ε{(Ds( f ) − WH( f )Xs( f ))(Ds( f ) − WH( f )Xs( f + µ2ε{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H}. (26) The first two terms in J(W) correspond to output noise power. This output noise power can be:
• estimated online (i.e., the term WHRn( f )W( f )) or
Speech model Noise model Hard/Soft constraint on speech distortion Technique Fixed beamforming
A-priori A-priori µ1= 0 µ2=∞ LCMV-based
(Section 2) µ1= 0 µ26=∞ Weighted LS criterion
None A-priori µ1= 0 µ2= 0 Differential microphone array
(Section 2) (Constraint on W( f ))
Adaptive beamforming
A-priori Online µ1= 0 µ2=∞ GSC
(Section 3) µ1= 0 µ26=∞ Soft-constrained MWF (Nordholm)
Combination µ1= 0 µ2=∞ Sensitivity constrained GSC
µ1= 0 µ26=∞ Soft-constrained with partial noise model (E.g. Nordholm calibration data)
Online Online µ1=∞ µ2= 0 TF-LCMV/ABM
(Section 4) µ16=∞ µ2= 0 SDW-MWF
Combination µ1=∞ µ2= 0 TF-LCMV with (partial) noise model
µ16=∞ µ2= 0 SDW-MWF with (partial) noise model
Fixed µ1=∞ µ2= 0 TF-LCMV with noise model
µ16=∞ µ2= 0 SDW-MWF with noise model (not useful)
Combination Online µ16=∞ µ2=∞ SDR-GSC
(Section 5) µ16=∞ µ26=∞ Combination SDW-MWF/soft-constrained
µ1=∞ µ26=∞ Combination TF-LCMV/GSC (cf. Kates)
Combination µ16=∞ µ2=∞ SDR-GSC with partial noise model
µ16=∞ µ26=∞ SDW-MWF/Soft-constrained + partial noise model
µ1=∞ µ26=∞ TF-LCMV/GSC with partial noise model
Fixed µ1=∞ µ26=∞ TF-LCMV/GSC with noise model
Table 1: Classification of multi-microphone noise reduction techniques.
• based on a pre-defined model Rnm( f ) of the noise correlation matrix, which is constructed through calibration measure- ments or mathematical models.
The last two terms in J(W) denote the distortion energy between the output speech component WH( f )Xs( f ) (or WH( f )Xsm( f )) and a reference speech signal Ds( f ) (or Dsm( f )). Again, the output speech distortion energy may be
• estimated online (i.e., asε{(Ds( f ) − WH( f )Xs( f ))(Ds( f ) − WH( f )Xs( f ))H}) or
• based on a pre-defined model Xsm( f ) for the microphone signals (i.e., asε{(Dsm( f )−WH( f )Xsm( f ))(Dsm( f )−WH( f )Xsm( f ))H}).
Again, this model can be constructed based on calibration data or based on mathematical models.
Depending on the use of a-priori knowledge of the speech and/or noise correlation matrix and the use of a hard constraint on the speech distortion term (i.e., µ1,2=∞or µ1,26=∞), different existing multi-microphone noise reduction techniques can be obtained, as indicated in Table 1. When using a hard constraint (µ1=∞or µ2=∞), noise suppression is only achieved in the subspace orthogonal to the defined or actual subspace of speech. Signals in the (defined or actual) speech subspace are passed through undistorted by the noise reduction algorithm. The use of a soft-constraint (µ16=∞or µ26=∞) typically results in a spectral filtering of the desired speech component Ds( f ) since the speech and noise subspace are generally not orthogonal (often, the noise subspace spans the complete space).
Below, the different techniques are explained in more detail.
2 A-priori speech and noise model: fixed beamforming
2.1 Hard constraint on speech distortion (µ2=∞)
Cost function
J(W) = WH( f )Rnm( f )W( f ) + µ2ε{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H} (27) with µ2=∞.
Assumed speech model The free-field propagation model in (13)-(14) is assumed:
Xsm( f ) = ˜ds( f , ps)Xm,1s ( f ), (28)
where psrefers to the position of the speech source. The reference signal Dsm( f ) = Xm,1s ( f ).
Assumed noise model Different noise models have been used in the literature. The most well-known are:
• Delay-and-sum beamformer: a homogeneous, spatially uncorrelated noise field is assumed, i.e.,
Rnm( f ) = PN( f )IM, (29)
with PXn
i( f ) = PN( f ), i = 1, . . . , M.
• Superdirective beamformer [?]: a homogeneous, diffuse (spherical isotropic) noise field is assumed, i.e.,
Rnm( f ) = PN( f )Γn( f ) (30)
with PXn
i( f ) = PN( f ), i = 1, . . . , M andΓn( f ) the coherence matrix of diffuse noise, i.e.,
Γ( f ) =
1 Γ12( f ) · · · Γ1M( f ) Γ21( f ) 1 · · · Γ2M( f )
... ... ... ... ΓM1( f ) ΓM2( f ) · · · 1
(31)
Γkl( f ) =sin(2πfkpl− pkk /c)
2πfkpl− pkk /c = sinc(2πfkpl− pkk /c) (32) withkpl− pkk the interspacing between microphone l and k.
• Combination of diffuse and spatially uncorrelated noise field: sensitivity constrained superdirective beamformer [?, ?]
Rnm( f ) = PN( f ) (Γnm( f ) +η( f )IM) ,
withΓnm( f ) the coherence matrix of diffuse noise andη( f ) a (frequency-dependent) weighting factor.
Solution The filter that minimizes (27) equals W( f ) =
Rnm( f ) + µ2PXs
m,1( f ) ˜ds( f , ps) ˜ds,H( f , ps)−1
µ2PXs
m,1( f ) ˜ds( f , ps). (33) Since ˜ds( f , ps) ˜ds,H( f , ps) equals a rank-one matrix and
Rnm( f ) + µ2PXs
m,1( f ) ˜ds( f , ps) ˜ds,H( f , ps)
is assumed to be full-rank, the matrix inversion lemma can be applied to
Rnm( f ) + µ2PXs
m,1( f ) ˜ds( f , ps) ˜ds,H( f , ps)−1
:
Rnm( f ) + µ2PXs
m,1( f ) ˜ds( f , ps) ˜ds,H( f , ps)−1
= Rnm−1( f ) −Rnm−1( f )µ2PXs
m,1( f ) ˜ds( f , ps) ˜ds,H( f , ps)Rnm−1( f ) 1+ µ2PXs
m,1( f ) ˜ds,H( f , ps)Rnm−1( f ) ˜ds( f , ps) (34) such that
W( f ) = Rnm−1( f ) 1+ µ2PXs
m,1( f ) ˜ds,H( f , ps)Rnm−1( f ) ˜ds( f , ps)µ2PXsm,1( f ) ˜ds( f , ps). (35) Using µ2=∞(i.e., hard constraint on the speech distortion term),
W( f ) = Rnm−1( f ) ˜ds( f , ps)
˜ds,H( f , ps)Rnm−1( f ) ˜ds( f , ps)= Γnm−1( f ) ˜ds( f , ps)
˜ds,H( f , ps)Γnm−1( f ) ˜ds( f , ps). (36)
2.2 Soft constraint on speech distortion (µ26=∞)
Cost function
J(W) = WH( f )Rnm( f )W( f ) + µ2ε{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H} (37) with µ26=∞.
Example The weighted Least-Squares (wLS) criterion [?, ?, ?, ?, ?, ?]
JW LS(W( f )) = Z
p∈P
L2( f , p)
WH( f ) ˜d( f , p) − D( f , p)2
dp, (38)
with D( f , p) a desired directivity pattern, can be transformed to the cost function (37), namely when the desired directivity pattern D( f , p) is defined as:
D( f , p) 6= 0 for p ∈ Ppass
= 0 for p ∈ Pstop. (39)
J(W( f )) = Z
p∈Pstop
WH( f ) ˜d( f , p) ˜dH( f , p)W( f )dp + µ2 Z
p∈Ppass
(1 − WH( f ) ˜d( f , p))(1 − WH( f ) ˜d( f , p))Hdp. (40) This corresponds to the following speech and noise model:
• Speech model: The speech source is modeled as an infinite number of (uncorrelated) point sources with PSD L2( f , p) in the angular region Ppass:
Xsm( f ) = Z
p∈Ppass
˜ds( f , p)X1s( f , p)dp, (41)
with
ε{X1s( f , pk)X1s,∗( f , pl)} = L2( f , pk)δkl for pk, pl∈ Ppass. (42) The reference signal Dsm( f )equals:
Dsm( f ) = Z
p∈Ppass
D( f , p)X1s( f , p)dp. (43)
• Noise model: The noise source is modeled as an infinite number of (uncorrelated) point sources with PSD L2( f , p) in the angular region Pstop
Xnm( f ) = Z
p∈Pstop
˜dn( f , p)X1n( f , p)dp, (44)
with
ε{X1n( f , pk)X1n,∗( f , pl)} = L2( f , pk)δkl for pk, pl∈ Pstop. (45) From (44)-(45), it follows that:
Rnm= Z
p∈Pstop
L2( f , p) ˜dn( f , p) ˜dn,H( f , p)dp. (46)
2.3 No constraint on speech distortion: differential microphone arrays [?, ?]
Cost function
J(W( f )) = WH( f )RnmW( f ) (47)
with W( f ) =
1 α T
to avoid the trivial solution W( f ) = 0. The noise is modelled as M − 1 uncorrelated noise sources, i.e,
Xnm( f ) =
M−1
∑
i=1
X1n( f , pi) ˜d( f , pi) (48)
with
ε{X1n( f , pi)X1n( f , pj)} = PXn1( f , pi)δi j (49) Using (48)-(49), the noise correlation matrix Rnmequals
Rnm( f ) =
M−1
∑
i=1
PXn
1( f , pi) ˜d( f , pi) ˜dH( f , pi) (50) where pi are the coordinates of the noise sources. Typically a linear array and far-field propagation is assumed such that pi is characterized by the azimuthθni of the noise source (cf. Section 1.2.1). Depending onθni, different directivity patterns are obtained (e.g., cardioid (θni = 180◦), hypercardioid (θni = 90◦, ...).
3 A-priori speech model
3.1 Online estimated noise model
3.1.1 Hard constraint (µ2=∞): GSC [?, ?]
Cost function
J(W) = WH( f )Rn( f )W( f ) + µ2ε{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H} (51) with µ2=∞.
Assumed speech model The free-field propagation model in (13)-(14) is assumed:
Xsm( f ) = ˜ds( f , ps)Xm,1s ( f ), (52)
where psrefers to the position of the speech source. The reference signal Dsm( f ) equals Xm,1s ( f ).
Noise model The noise model is estimated online.
Solution The filter W( f ) equals
W( f ) = Rn( f ) + µ2PXs
1( f ) ˜ds( f , ps) ˜ds,H( f , ps)−1
µ2PXs
1( f ) ˜ds( f , ps). (53) Since ˜ds( f , ps) ˜ds,H( f , ps) equals a rank-one matrix and
Rn( f ) + µ2PXs
1( f ) ˜ds( f , ps) ˜ds,H( f , ps)
is assumed to be full-rank, the matrix inversion lemma can be applied, resulting in (cf. Section 2.1):
W( f ) = Rn−1( f ) 1+ µ2PXs
1( f ) ˜ds,H( f , ps)Rn−1( f ) ˜ds( f , ps)µ2PXs
1( f ) ˜ds( f , ps). (54) Using µ2=∞(i.e., hard constraint on the speech distortion term),
W( f ) = Rn−1( f ) ˜ds( f , ps)
˜ds,H( f , ps)Rn−1( f ) ˜ds( f , ps). (55) In a GSC-scheme, the hard constraint WH( f ) ˜ds( f , ps) = 1 is imposed through the fixed beamformer and blocking matrix.
The filter W( f ) is then decomposed into a fixed filter Wq( f ) (i.e., the so-called quiescent vector) and an adaptive filter Wa( f ):
W( f ) = Wq+ B( f )Wa( f ) (56)
with Wq( f ) =M1˜ds( f , ps) and BH( f ) ˜ds( f , ps) = 0.
3.1.2 Soft constraint (µ26=∞): soft-constrained MWF techniques by Nordholm et al.
Cost function
J(W) = WH( f )Rn( f )W( f ) + µ2ε{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H} (57) with µ26=∞.
Assumed speech model In [?] a fixed model is used for the spatial characteristics ˜Hs( f ) of the speech while the speech PSD PXs
1( f ) is estimated online. The speech source is modeled as an infinite number of (uncorrelated) point sources with true PSD PXs
1( f ) clustered closely in space within a pre-defined area P:
Xsm( f ) = Z
p∈PXm,1s ( f , p) ˜ds( f , p)dp (58)
Dsm( f ) = Z
p∈PXm,1s ( f , p)dp (59)
with
ε{Xm,1s ( f , pk)Xm,1s,∗( f , pl)} = PXs1( f )δkl ∀pk, pl∈ Ppass. (60) The speech PSD PXs
1( f ) is estimated online. To separate the estimation of the spectral and spatial characteristics, the technique is implemented in the frequency-domain.
Noise model The noise model is estimated online.
Solution The filter W( f ) equals
W( f ) = (µ2Rsm( f ) + Rn( f ))−1µ2ε{Xsm( f )Ds,∗m ( f )}. (61) In the soft-constrained MWF techniques by Nordholm et al., µ2is set to 1[?].
Assuming uncorrelated point sources, Rsmandε{Xsm( f )Dsm( f )} in (61) can be computed as:
Rsm = ε{Z
p∈P
Xm,1s ( f , p) ˜ds( f , p)dp Z
p∈P
Xm,1s,∗( f , p) ˜ds,H( f , p)dp},
= Z
p∈P˜ds( f , p) ˜ds,H( f , p)ε{Xm,1s ( f , p)Xm,1s,∗( f , p)}dp,
= PXs1( f ) Z
p∈P
˜ds( f , p) ˜ds,H( f , p)dp. (62)
ε{Xsm( f )Dsm( f )} = PXs1( f ) Z
p∈P˜ds( f , p)dp, (63)
where PXs
1( f ) is estimated online.
Instead of using a mathematical speech model, the speech correlation matrix Rsm( f ) and the cross-correlationε{Xsm( f )Ds,∗m ( f )}
can also be computed based on calibration data [?]. This can for example be useful in hearing aid applications where the head shadow effect should be taken into account.
3.2 Combination of online and fixed noise model
Instead of using only an online estimate of the noise model, the online estimated noise model can be combined with a pre- defined fixed noise model. This may be useful to increase robustness of the noise reduction algorithm against model errors (e.g., sensitivity-constrained GSC [?]) and/or VAD failures or when the location of some interfering sources is known a-prior (e.g., echo or feedback in case of a set-up with fixed loudspeaker and microphone positions [?]).
Cost function
J(W) = (1 −λ)WH( f )Rn( f )W( f ) +λWH( f )Rnm( f )W( f )
+µ2{(Dsm( f ) − WH( f )Xsm( f ))(Dsm( f ) − WH( f )Xsm( f ))H}, (64) withλ> 0.
3.2.1 Hard constraint (µ2=∞) Examples
• Sensitivity-constrained GSC [?, ?]
In [?, ?] the robustness of the GSC against model errors is increased by inserting spatially incorrelated noise. This corresponds to a fixed noise model for spatially uncorrelated noise
Rnm( f ) = PN( f )Im, (65)
with PXnm,i( f ) = PN( f ), i = 1, . . . , M.
Alternatives
• Alternatively, the online noise estimate of the GSC can be combined with a diffuse noise model or a model of noise in the back hemisphere to prevent amplification of sounds coming from the back. In addition, noise sources with a fixed, known location can be included (e.g., echo or feedback in the case of a fixed loudspeaker-microphone position).
3.2.2 Soft constraint (µ26=∞) Examples
• In [?], a model based on calibration signals is used for the speech signal and the echo signal, while the noise statistics are estimated online.