Two-level ℓ

(1)

Two-level ℓ

1 Minimization for Compressed Sensing

Xiaolin Huang, Member, IEEE, Yipeng Liu, Member, IEEE, Lei Shi, Sabine Van Huffel, Fellow, IEEE, and Johan A.K. Suykens, Senior Member, IEEE

Abstract—Compressed sensing using ℓ1minimization has been widely and successfully applied. To further enhance the sparsity, some non-convex penalties have been established, such as the ℓq- norm with 0 < q < 1, the smoothly clipped absolute deviation penalty, and the truncated ℓ1-norm. In this paper, a new non- convex penalty, which gives two different weights according to the order of the absolute value and is hence called two-level ℓ1- norm, is proposed. On the one hand, compared with the classical ℓ1minimization, two-level ℓ1minimization leads to a more sparse solution. On the other hand, compared with the other non- convex penalty, two-level ℓ1minimization can be solved efficiently by local linearization. The corresponding iteratively reweighted method and soft thresholding algorithm are established. The soft thresholding operator for the two-level ℓ1-norm is expansive and its convergence is proved by investigating a surrogate function.

In numerical experiments, the proposed algorithms provide good sparse signal estimation performance, which implies two-level ℓ1

minimization a promising technique for compressed sensing.

Index Terms—compressed sensing, non-convex penalty, ℓ1

minimization, thresholding algorithm

I. INTRODUCTION

C

OMPRESSED sensing can acquire and reconstruct a signal from relatively fewer measurements than the classical Nyquist sampling. Random linear projections are used to sub- sample the analogue signal and get the digital measurements.

The basics of compressed sensing can be found in [1] [2] [3].

The sampling can be formulated in discrete form as b = Φx, whereΦ ∈ R^m×nis the sampling matrix (also called measure- ment matrix),x ∈ Rⁿ is the signal, andb ∈ R^mrepresents the compressive measurements. Generally, we have m ≪ n. The signal is then reconstructed from these projections using signal

Manuscript received 2013;

This work was supported by Research Council KUL: GOA/11/05 Am- biorics, GOA/10/09 MaNet, CoE EF/05/006, PFV/10/002 Optimization in Engineering (OPTEC), IOF-SCORES4CHEM, IDO 08/013 Autism, several PhD/postdoc & fellow grants; Flemish Government: FWO: PhD/postdoc grants, projects: G0226.06 (cooperative systems and optimization), G.0302.07 (SVM/Kernel), G.0320.08 (convex MPC), G.0558.08 (Robust MHE), G.0557.08 (Glycemia2), G.0588.09 (Brain-machine) research com- munities (WOG: ICCoS, ANMMM, MLDM); G.0377.09 (Mechatronics MPC), G.0377.12 (Structured models), G.0427.10N (Integrated EEG-fMRI), G.0108.11 (Compressed Sensing) G.0869.12N (Tumor imaging) G.0A5513N (Deep brain stimulation), IWT: TBM070713-Accelero, TBM080658-MRI (EEGfMRI), TBM110697-NeoGuard, PhD Grants, Eureka-Flite+, SBO LeCo- Pro, SBO Climaqs, SBO POM, O&O-Dsquare; Belgian Federal Science Pol- icy Office: IUAP P6/04 (DYSCO, Dynamical systems, control and optimization, 2012-2017); iMinds; EU: ERNSI; ERC AdG A-DATADRIVE-B, FP7- HD-MPC (INFSO-ICT-223854), COST intelliCIS, FP7-EMBOCON (ICT- 248940), FP7-HEALTH/ 2007-2013 (n260777); Contract Research: AMINAL;

Other: Helmholtz: viCERP, ACCM, Bauknecht, Hoerbiger. L. Shi is also supported by the National Natural Science Foundation of China (11201079).

Johan Suykens and Sabine Van Huffel are professor at KU Leuven, Belgium.

The authors are with Department of Electrical Engineering, ESAT-SCD- SISTA, KU Leuven, and with iMinds Future Health Department, KU Leuven, B-3001 Leuven, Belgium. (e-mails: huangxl06@mails.tsinghua.

edu.cn, yipeng.liu@esat.kuleuven.be, leishi@fudan.edu.cn, sabine.vanhuffel@

esat.kuleuven.be, johan.suykens@esat.kuleuven.be).

recovery methods which exploit the fact that many natural signals are sparse in certain representations, i.e., x = Ψθ whereΨ is the representation matrix and θ is a sparse vector.

To recover a sparse signal θ from b, one can consider the following optimization problem

minθ kθkℓ0

s.t. ΦΨθ = b, (1)

wherekθkℓ0 counts the number of nonzero components ofθ.

We first consider the case Ψ = In×n, which stands for the identity matrix in R^n×n. This setting assumes that the signal x is sparse in the time domain and then (1) becomes

minx kxkℓ0

s.t. Φx = b. (2)

In this paper, the theoretical part focuses on (2) and one appli- cation forΨ 6= In×n is considered in numerical experiments.

Problem (2) is NP-hard and hard to solve. To effectively recover sparse signals, some fast algorithms have been established, including orthogonal matching pursuit [4], CoSaMP [5], subspace pursuit [6], and iterative hard thresholding [7].

Another possibility is to use a penalty functionL(x) : Rⁿ 7→ R to approachkxkℓ0 and then to minimize the penalty function with the constraint thatΦx = b, i.e.,

minx L(x) s.t. Φx = b.

The most popular choice of the penalty function is to use ℓ1-norm, i.e., L(x) = kxkℓ1 = Pn

i=1|x(i)|, where x(i) is the i-th component of x. Then we obtain the following ℓ1- minimization problem

minx kxkℓ1

s.t. Φx = b, (3)

which is also called basis pursuit [8]. Based on theℓ1-norm, some interesting techniques including Dantzig selector [9] and decoding by linear programming [10], and practical solving algorithms, such as an iterative soft thresholding algorithm [11], have been established and widely applied in many fields.

Minimizing the ℓ1-norm is an efficient relaxation for (2).

IfΦ satisfies some conditions, such as the restricted isometry property, [10], the null space property [12], and the incoher- ence condition [13], then the sparse signal can be recovered by solving (3). But due to the relaxation, the accuracy is degenerated [14]. Moreover, for some applications, the result of ℓ1 minimization is not sparse enough and cannot recover the original signals. The following is an interesting and simple example, which is given by [15] to intuitively explain the

(2)

failure of ℓ1 minimization for some sparse signals. In this example, the real signal is x = [0, 1, 0]¯ ^T and

Φ =

2 1 1 1 1 2

.

Our task is to recover the signal from b = Φ¯x = [1, 1]^T. Unfortunately,ℓ1minimization does not lead to the real signal:

instead of x, ˜¯ x = [1/3, 0, 1/3]^T is optimal to (3). Fig.1(a) (i.e., Fig.1(b) in [15]) shows the reason: the feasible set intersects the interior of the ℓ1 ball {x : kxkℓ1 ≤ k¯xkℓ1}.

Fig.1(b) plots the contour map of the ℓ1-norm in the plane x(2) = 0, where x(i) denotes the i-th component of x. The solid line represents the level set {x : kxkℓ1 = k¯xkℓ1} and ˜x is marked by red cross. We observe thatx is in the interior of˜ {x : kxkℓ1 ≤ k¯xkℓ1}, which makes that (3) cannot recover ¯x.

¯ x

˜ x

Φx = b

(a)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x(1) x(3)

(b)

Fig. 1. An example whereℓ1 minimization cannot recoverx: (a) (Fig.1¯ (b) in [15]) feasible set Φx = b intersects the interior of the ℓ1-norm ball {x : kxkℓ1 ≤ k¯xkℓ1}; (b) the contour map of the ℓ1-norm in the plane x(2) = 0; the solid line is the level set {x : kxk_ℓ₁= k¯xk_ℓ₁}.

Motivated by this observation, we know that one possible way of recovering x is to replace the ℓ¯ 1-norm by another penalty L(x) to shrink the penalty ball {x : L(x) ≤ L(¯x)}.

Using the ℓq-norm with 0 < q < 1 instead of the ℓ1-norm is one potential choice, which has been investigated in [16] – [20]. The sparsity of ℓq minimization is related to the value of q. For example, when q = 2/3, ˜x cannot be excluded from the ℓq-norm ball, as shown in Fig.2(a). While, when q = 1/2, ¯x can be recovered. Generally, the performance of minimizing the ℓq-norm is improved but its computational complexity increases when decreasingq.

Compared with ℓ1 minimization, minimizing the ℓq-norm enhances sparsity but looses computational efficiency. To maintain the computational advantage of ℓ1 minimization, researchers try to find a suitable penalty from modifying the ℓ1-norm. In [21], the truncated ℓ1-norm was proposed:

L^a_trunc(x) =Xⁿ

i=1min{a, |x(i)|}.

The contour map of the truncated ℓ1-norm with a = 4/5 is illustrated in Fig.2(b), from which one can see that if valuea is properly selected, such penalty can lead to a good result. In the region |x(i)| > a, L^a_trunc performs like theℓ0-norm, i.e., the result is sparse but the optimization task is hard. In the region

|x(i)| ≤ a, L^a_trunc is equal to theℓ1-norm and the sparsity is not enhanced. To smooth the transition from the ℓ0-norm to the ℓ1-norm, one can consider the smoothly clipped absolute deviation penalty (SCAD) established in [22]. SCAD, of which

the contour map is illustrated in Fig.2(c), has two parameters a1 and a2. SCAD equals the ℓ1-norm when |x(i)| < a1, equals the ℓ1/2-norm when a1 ≤ |x(i)| < a2, and equals the ℓ0-norm when |x(i)| ≥ a2. Both the truncated ℓ1-norm and SCAD enhance the sparsity. One efficient local algorithm was given in [23] by applying difference of convex functions (DC) programming, which was developed by [24] and [25].

The basic idea of DC programming is to do linearization using the gradient and minimize the linearized system. Obviously, if the linearization is accurate in a large range, the minimization technique based on the linearized system is effective.

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.5 1 1.5 2 2.5

x(1) x(3)

(a)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

x(1) x(3)

(b)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

x(1) x(3)

(c)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

x(1) x(3)

(d)

Fig. 2. The contour map of several non-convex penalties: (a) theℓq-norm withq = 2/3; (b) the truncated ℓ1-norm with a = 4/5; (c) the smoothly clipped absolute deviation penalty witha1= 1/5, a2= 3/4; (d) the expected penalty, which can excludex and is piecewise linear.˜

Our goal in this paper is to establish a new penalty, of which the contour map is shown in Fig.2(d). In this figure,

˜

x is excluded and the level set is piecewise linear. Thus, we can on the one hand enhance the sparsity and on the other hand establish effective algorithms based on linearization. In Section II, we will give the corresponding analytic formulation. Generally, the proposed penalty considers two different penalty levels for different components according to the order of the component absolute values. The proposed penalty is hence called two-level ℓ1-norm. The two-level ℓ1-norm can be conducted in the framework of reweighted ℓ1-norm. But differently to some existing reweighted methods, such as [15]

and [26], which give weights based on the component values, we focus on the order of the absolute values. This difference makes it possible to establish a soft thresholding algorithm for the two-levelℓ1-norm.

The remainder of the paper is organized as follows: the formulation of the two-level ℓ1-norm is given in Section II, where we also discuss some properties and establish an iterative algorithm. In Section III, a soft thresholding algorithm for two-level ℓ1 minimization is established and its convergence is proved. The proposed algorithms then are evaluated by numerical experiments in Section IV. Section V ends the paper with concluding remarks.

(3)

II. TWO-LEVELℓ1MINIMIZATION

A. Two-level ℓ1-norm Define a penalty function

L^ρ(x) = min {ρ|x(1)| + |x(2)| + |x(3)|, |x(1)| + ρ|x(2)|

+|x(3)|, |x(1)| + |x(2)| + ρ|x(3)|} . (4) Whenρ = 1, L^ρ(x) reduces to the ℓ1-norm. With a decreasing value ofρ, the sparsity is enhanced. One can verify that L¹⁵(x) gives the contour map shown in Fig.2(d). For some other ρ values, we plot the correspondingL^ρ(x) in Fig.3(a)–3(c).

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2

x(1) x(3)

(a)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

x(1) x(3)

(b)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x(1) x(3)

(c)

−1 −0.5 0 0.5 1

−1

−0.8

−0.6

−0.4

−0.2 0 0.2 0.4 0.6 0.8 1

0.2 0.4 0.6 0.8 1 1.2 1.4

x(1) x(3)

(d)

Fig. 3. The contour map of the two-level ℓ1-norm and the iteratively reweighted ℓ1-norm: (a) the two-levelℓ1-norm withρ = 1/3; (b) the two- levelℓ1-norm withρ = 1/10; (c) the two-level ℓ1-norm withρ = 0; (d) the iteratively reweighted ℓ1-norm withǫ = 1.

Using (4), we aim at realizing the expected penalty in Fig.2(d). In a higher dimensional space, (4) is generalized into

L^ρ,K_2level(x) = min

I:|I|=K

( ρX

i∈I

|x(i)| +X

i∈I^c

|x(i)|

) , (5)

where I is a subset of {1, 2, . . . , n}, |I| stands for the cardinality ofI, and I^cis the complement ofS with respect to {1, 2, . . . , n}. In (5), there are two parameters ρ and K, which means that we give weightρ for K components and weight 1 for the remainder. Thus, we call (5) two-levelℓ1-norm.

Minimizing the two-levelℓ1-norm, i.e.,min{L^ρ,K_2level(x), s.t.

Φx = b}, is formulated as minx min

I:|I|=K

( ρX

i∈I

|x(i)| +X

i∈I^c

|x(i)|

)

(6) s.t. Φx = b.

Obviously, (6) can be posed as minx,I ρX

i∈I

|x(i)| +X

i∈I^c

|x(i)|

s.t. Φx = b (7)

|I| = K.

When I is determined, (7) becomes a linear programming problem and two-levelℓ1minimization (6) can be formulated as one mixed integer linear programming problem (MILP).

However, solving MILP is not practical when the problem scale is too large. Though (6) can be written as an MILP, it is essentially a continuous problem and we will in the following establish a descent algorithm.

For any point x, we can find a subset, which is related to x and denoted as I(x), such that

L^ρ,K_2level(x) = ρ X

i∈I(x)

|x(i)| + X

i∈I^c(x)

|x(i)|. (8)

Simple calculation shows that selecting the largestK components of|x(i)| to form I(x) makes (8) valid. Therefore,

|I(x)| = K and |x(i)| ≥ |x(j)|, ∀i ∈ I(x), ∀j ∈ I^c(x). (9) In the rest of this paper, when we write an index set as a function of one vector, like I(x), it implies that such index set satisfies condition (9).

AnyI satisfying (9) defines a polyhedron DI = {x : |x(i)| ≥ |x(j)|, ∀i ∈ I, ∀j ∈ I^c},

in which the linearization ofL^ρ,K_2level(x) remains. That means, for anyx0, there is

L^ρ,K_2level(x) = ρ X

i∈I(x0)

|x(i)| + X

i∈I^c(x0)

|x(i)|, ∀x ∈ D_I(x₀₎. (10) Hence, we can minimize a linear function in one polyhedron to get a descent forL^ρ,K_2level(x) from x0. Specifically, we consider the following linear programming problem related toI,

minx,z,t ρX

i∈I

z(i) +X

i∈I^c

z(i)

s.t. Φx = b (11)

z(i) ≥ x(i), z(i) ≥ −x(i), ∀i ∈ {1, 2, . . . , n}

z(i) ≥ t, ∀i ∈ I z(i) ≤ t, ∀i ∈ I^c.

One can verify that when we choose I = I(x0), x0 (with z(i) = |x0(i)|, t = mini∈I(x0)z(i)) is feasible to (11) and solving (11) can decrease the objective value of (6) from L^ρ,K_2level(x0). Repeating the above process leads to a local optimum of two-level ℓ1 minimization. Since (6) is non- convex, a good initial point is needed. In this paper, we use the result of ℓ1 minimization as the initial solution and give the following descent algorithm.

Algorithm 1: Descent method for two-levelℓ1-norm

•Set l, i.e., the iteration counter, to be zero;

•Solve (3), denote the result as x0and select I:= I(x0);

repeat

•Set l:= l + 1;

•Solve (11) and denote the result as xl;

•Select I:= I(xl) (if there are multiple choices, select the one not equaling to I(x_l−1));

until L^ρ,K_2level(xl) = L^ρ,K_2level(x_l−1) ;

•Algorithm ends and return xl.

(4)

B. Iteratively reweighted method for two-level ℓ1-norm Minimizing the two-level ℓ1-norm by the above algorithm can converge to a local optimum and enhance the sparsity from ℓ1 minimization. To speed up the local search for (6), we in this subsection develop another algorithm based on iteratively reweighted ℓ1-minimization. In Algorithm 1, the locally linearized function (8) is minimized with constraint x ∈ DI(x). If we ignore this constraint, then minimizing (8) becomes a weighted ℓ1 minimization problem,

minx n

X

i=1

w(i)|x(i)|

s.t. Φx = b, (12)

wherew(i) = ρ, i ∈ I(x) and w(i) = 1, i ∈ I^c(x).

Weighted ℓ1 minimization has been implemented in [15]

to establish an iterative algorithm for enhancing sparsity. The iteratively reweighted procedure is setting

wl(i) = 1

|xl−1(i)| + ǫ, (13) where ǫ is a pre-given parameter, and solving (12) to update xl. We can normalize the weight vectorw such that kwkℓ2 = 1 and plot the equivalent penalty in Fig.3(d), from which one can see its effect for enhancing sparsity. However, the convergence of the iteratively reweightedℓ1 minimization cannot be guaranteed. That algorithm essentially utilizes linearization at the current solution and minimizes the linearized system. It is well-known that without line search, minimization based on the locally linearized system can not guarantee the decrease of the objective value. In [15], the authors pre-defined the maximal number of iterations.

Drawing lessons from the success of iteratively reweighted ℓ1 minimization, we do linearization for L^ρ,K_2level(x) at the current solution and solve (12), which essentially minimizes the local objective function without constraints. Then we update the solution and the corresponding weight, and solve (12) again. The iteratively reweighted method for the two-level ℓ1-norm is summarized in the following algorithm.

Algorithm 2: Reweighted method for two-levelℓ1-norm

•Set l:= 0 and lmax (the maximal number of iterations);

•Solve (3) and denote the result as x0; repeat

•Set l:= l + 1;

•Set w(i) := ρ, i ∈ I(x_l−1), w(i) := 1, i ∈ I^c(x_l−1);

•Solve (12) and denote the result as xl; until x_l= x_l−1;

In Algorithm 2, the stop criterion can be checked by kxl − xl−1kℓ2. In the experiments of this paper, the loop ends when kxl− xl−1kℓ2 ≤ 10⁻⁶. The basic idea of the iteratively reweightedℓ1 minimization is that people care less about the components which have large absolute values, since they are not easy to become zero. The reweighted method for minimizing the two-levelℓ1has the similar idea: in Algorithm 2, we give a small penalty on the components which have large absolute values. The difference is that we do not care

about the value of|x(i)| but focus on the order of |x(i)|. This modification can further enhance the sparsity, which will be seen in numerical experiments, and reduce the computational time, which will be supported by numerical study and can be explained that the effective region of linearization of the two- level ℓ1-norm is larger than that of the penalty used in the iteratively reweightedℓ1 minimization.

Similarly to the discussion in [15] on reweighted ℓ1 minimization, we cannot prove convergence for Algorithm 2 neither. By comparison, Algorithm 1 can converge to a local optimum of (6). However, the convergence needs many steps and solving (11) takes much more time than (12). Though the precision of Algorithm 2 is less than Algorithm 1, numerical experiments will show that Algorithm 2 already gives good sparse estimation performance and hence we prefer to use Algorithm 2 in practice.

C. Analysis based on null space property

As analyzed previously, with the decrease ofρ, minimizing the two-levelℓ1-norm takes more computational time and leads to a more sparse result. If we take ρ = 0, minimizing the two-levelℓ1-norm has some similarity to the iterative support detection (ISD, [26]), which sets the weights for components larger than one threshold to be zero. Though ISD focuses on the value not the order of the absolute values of components, the analysis in [26] is applicable for the two-levelℓ1-norm as well. The analysis for ISD is mainly based on the null space property (NSP). We usexI to denote a vector satisfying

xI(i) =

x(i), i ∈ I, 0, i 6∈ I.

Then NSP can be expressed as below,

Definition 1: For a matrix Φ ∈ R^m×n, if for order s and γ > 0, there is

kηSkℓ1 ≤ γkηS^ckℓ1, ∀η ∈ N (Φ), ∀S : |S| ≤ s, whereN (Φ) stands for the null space of Φ, then we say that Φ has the null space property of order s for γ.

NSP was defined in [12], where the relationship between NSP and the restricted isometry property (RIP, [10]) was also given: if matrix Φ has RIP of order k with constant δ, then Φ satisfies NSP of order s1 forγ =ps2/s1(1 + δ)/(1 − δ), wheres1+ s2= k. In [27] and [28], NSP has been applied to investigate the properties for the iteratively reweighted least squares minimization. To analyze the performance of ISD, researchers extended NSP to the truncated-NSP, which was defined in [26] and is reviewed below,

Definition 2: For a matrixΦ ∈ R^m×n, if for orders, integer h, and γ > 0, there is

kηSkℓ1 ≤ γkηH∩S^ckℓ1,

∀η ∈ N (Φ), ∀S : |S| ≤ s, ∀H : |H| = h, thenΦ has truncated-NSP of order s for h and γ.

Using truncated-NSP, we give a sufficient condition for successful recovery using two-levelℓ1minimization, as below.

(5)

Theorem 1: Supposey = Φ¯x and Φ has truncated-NSP of orderk¯xkℓ0 forK and γ. If γ < 1, ¯x is the minimizer of (6).

Proof: x being a minimizer of (6) means that¯ L^ρ,K_2level(¯x + v) ≥ L^ρ,K_2level(¯x), ∀v ∈ N (Φ).

Denote I0 = I(¯x), I1 = I(¯x + v), and ¯S = supp(¯x), where supp(¯x) is the support set of ¯x. For any index set I ⊆ {1, 2, . . . , n}, there is

k¯xIkℓ1 = k¯xI∩ ¯Skℓ1. Therefore,

L^ρ,K_2level(¯x + v) = ρk¯xI1+ vI1kℓ1+ k¯xI₁^c+ vI₁^ckℓ1

= ρ k¯x_I₁_{∩ ¯}_S+ v_I₁_{∩ ¯}_Skℓ1+ kv_I₁_{∩ ¯}_S^ckℓ1

+k¯xI₁^c∩ ¯S+ vI1∩ ¯Skℓ1+ kvI₁^c∩ ¯S^ckℓ1. Considering the first item, we have

k¯x_I₁_{∩ ¯}_S+ v_I₁_{∩ ¯}_Skℓ1+ kv_I₁_{∩ ¯}_S^ck

= k¯x_I₁_{∩ ¯}_S+ v_I₁_{∩ ¯}_Skℓ1− k¯x_I₁_{∩ ¯}_Skℓ1+ kv_I₁_{∩ ¯}_Skℓ1

+k¯x_I₁_{∩ ¯}_Skℓ1+ kv_I₁_{∩ ¯}_S^ckℓ1− kv_I₁_{∩ ¯}_Skℓ1

≥ k¯xI1kℓ1+ kv_I₁_{∩ ¯}_S^ckℓ1− kvS¯kℓ1

≥ k¯xI1kℓ1.

The last inequality comes from the condition that Φ has truncated-NSP of orderk¯xkℓ0 forK and γ.

Similarly, we know that

k¯x_I₁^c_{∩ ¯}_S+ v_I₁_{∩ ¯}_Skℓ1+ kv_I₁^c_{∩ ¯}_S^ck ≥ k¯xI₁^ckℓ1, and hence

L^ρ,K_2level(¯x + v) ≥ ρk¯xI1kℓ1+ k¯xI₁^ckℓ1.

According to the definition ofI(¯x), we know that I0includes k components with the largest absolute values |¯x(i)|, thus

ρk¯xI1kℓ1+ k¯xI₁^ckℓ1 ≥ ρk¯xI0kℓ1+ k¯xI₀^ckℓ1 = L^ρ,K_2level(¯x), which follows that

L^ρ,K_2level(¯x + v) ≥ L^ρ,K_2level(¯x), ∀v ∈ N (Φ), and this theorem is proved.

III. THRESHOLDINGALGORITHM FORMINIMIZING

TWO-LEVELℓ1-NORM

To suppress additive noise in measurements, researchers give penalty on the violation of Φx = b and modify ℓ1

minimization (3) into the following problem,

minx kΦx − bk²_ℓ₂+ 2τ kxkℓ1, (14) whereτ > 0 is a trade-off parameter related to the noise level.

For solving (14), an iterative soft thresholding algorithm has been established in [11] and applied widely. Similarly, we in this section consider

minx kΦx − bk²_ℓ₂+ 2τ L^ρ,K_2level(x), (15) and establish a thresholding algorithm.

Define a soft thresholding operator S^τ,K as S^τ,K,ρ(x) (i)

=











x(i) − ρτ sign(x(i)), if i ∈ I(x), |x(i)| ≥ ρτ, 0, if i ∈ I(x), |x(i)| < ρτ, x(i) − τ sign(x(i)), if i ∈ I^c(x), |x(i)| ≥ τ, 0, if i ∈ I^c(x), |x(i)| < τ.

We can write the local optimality condition of (15) as x = S^τ,K,ρ(x + Φ^T(b − Φx)). (16) Then motivated by the optimality condition (16), we establish the following iterative soft thresholding algorithm for (15).

Algorithm 3: Two-level soft thresholding algorithm

•Set l:= 0 and x0:= Φ^Tb;

repeat

•xl+1:= S^τ,K,ρ(xl+ Φ^T(b − Φxl));

•Set l:= l + 1;

until x_l:= x_l−1;

According to the local optimality condition (16), we can expect that the stationary point of the two-level soft thresholding algorithm is a local optimum of (15). In the following, we will prove the convergence of Algorithm 3. The proof is different to that of the soft thresholding algorithm proposed by [11]

for the ℓ1-norm, since S^τ,K,ρ itself is expansive. Instead of analysis based on non-expansive operator, the convergence of Algorithm 3 is guaranteed by showing that a surrogate function can be decreased by the proposed two-level soft thresholding.

Theorem 2: When kΦk2 < 1, Algorithm 3 converges to a local optimum of (15).

Proof: Define the following functionF : R²ⁿ7→ R:

F (x, y) = kΦx − bk²_ℓ₂+ kx − yk²_ℓ₂+ 2τ L^ρ,K_2level(y). (17) Algorithm 3 defines the following update formulation,

yl = S^τ,K,ρ(xl), xl+1 = yl+ Φ^T(b − Φyl).

We will show that the process above can iteratively decrease the value ofF (x, y), which is guaranteed by proving that for anyy ∈ Rⁿ,

F (xl, y) ≥ F (xl, yl) ≥ F (yl, yl) ≥ F (xl+1, yl). (18) Whenx is fixed to be xl,F (x, y) is a function with respect toy, denoted by gxl(y). Suppose y^∗is the minimum ofgxl(y), thengxl(y) can be written as a separable function as gxl(y) = cxl +Pn

i=1g_xⁱ_l_,I(y∗)(y(i)), where cxl = kΦxl− bk²_ℓ₂ is a constant and

gⁱ_x_l_,I(y^∗₎(y(i))

=

(xl(i) − y(i))²+ τ |xl(i) − y(i)|, i ∈ I(y^∗), (xl(i) − y(i))²+ ρτ |xl(i) − y(i)|, i 6∈ I(y^∗).

Therefore, simple calculation for the minimum of g_xⁱ_l_,I(y∗)

shows that for anyi ∈ I(y^∗), y^∗(i) =

xl(i) − ρτ sign(xl(i)), |xl(i)| ≥ ρτ,

0, |xl(i)| < ρτ.

(6)

Similarly, for anyi ∈ I^c(y^∗), y^∗(i) =

xl(i) − τ sign(xl(i)), |xl(i)| ≥ τ,

0, |xl(i)| < τ.

In the following, we show thatgxl(yl) ≤ gxl(y^∗). Recalling the formulation ofyl= S^τ,K,ρ(xl), one can verify that

y^∗(i) = yl(i), ∀i ∈ (I(xl) ∩ I(y^∗)) ∪ (I^c(xl) ∩ I^c(y^∗)) . Then we focus only onI(xl)T I^c(y^∗). If there exists an i such thati ∈ I(xl)T I^c(y^∗), one can find another index such that j ∈ I^c(xl)T I(y^∗), since |I(xl)| = |I^c(y^∗)|. Then |xl(i)| >

|xl(j)| and |y_l^∗(i)| < |y^∗_l(j)|.

There are the following four cases:

1) |xl(i)| ≤ τ, |xl(j)| ≤ ρτ , then y_l^∗(i) = y^∗_l(j) = 0, which conflicts with the assumption|y^∗_l(i)| < |y_l^∗(j)|;

2) |xl(i)| ≤ τ, |xl(j)| > ρτ , then

g_xⁱ_l_,I(y^∗₎(y^∗(i)) + g^j_x_l_,I(y∗)(y^∗(j))

= x²_l(i) + ρ|xl(i)| + 3ρ²τ²

≥ 3ρ²τ²+ x²_l(j) + ρ|xl(j)|

≥ g_xⁱ_l_,I(x_l₎(xl(i) − ρτ sign(xl(i))) + g_x^j_l_,I(x_l₎(0)

= g_xⁱ_l_,I(y_l₎(yl(i)) + g^j_x_l_,I(y_l₎(yl(j));

3) |xl(i)| > τ, |xl(j)| ≤ ρτ , then y_l^∗(i) = xl(i) − τ sign(xl(i)) and y^∗_l(j) = 0, which conflicts with the assumption|y^∗_l(i)| < |y_l^∗(j)|;

4) |xl(i)| > τ, |xl(j)| > ρτ , then

gⁱ_x_l_,I(y^∗₎(y^∗(i)) + g_x^j_l_,I(y∗)(y^∗(j))

= 3τ²+ 3ρ²τ²

= gⁱ_x_l_,I(y_l₎(yl(i)) + g_x^j_l_,I(y_l₎(yl(j)).

The discussion above means gxl(yl) ≤ gxl(y^∗), i.e., yl = S^τ,K,ρ(xl) is a minimum of minyF (xl, y), which proves the validity of the first inequality of (18).

The second inequality of (18) equally means

kΦxl− bk²_ℓ₂+ kxl− ylk²_ℓ₂− kΦyl− bk²_ℓ₂≥ 0. (19) According to the fact

kΦxl− bk²_ℓ₂− kΦyl− bk²_ℓ₂≥ −kΦ(xl− yl)k²_ℓ₂, we know that the left hand of (19) is larger than or equal to

kxl− ylk²_ℓ₂− kΦ(xl− yl)k²_ℓ₂

= (xl− yl)^T I_n×n− Φ^TΦ (xl− yl)

≥ 0,

where the last inequality comes fromkΦk2< 1.

Next, we verify the third inequality of (18) by considering the following optimization problem,

minx kΦx − bk²_ℓ₂+ kx − yk²_ℓ₂. (20) Its optimality condition can be written as

−Φ^T(b − Φx) + (x − yl) = 0,

from which, we formulate the following update scheme xh+1= yl+ Φ^T(b − Φxh), (21)

where h indicates the iteration step. Similarly to the analysis for Landweber iterative method, one can find that (21) produces a non-increasing sequence of the objective value of (20), ifkΦk2< 1. Moreover, by setting xh= yl, (21) indeed givesxl+1. Hence, we know that F (yl, yl) ≥ F (xl+1, yl).

From (18), we can expect that iterations defined by Al- gorithm 3 provide a non-increasing sequence of F (x, y).

Suppose that afterl steps, F (x, y) converges, i.e., F (xl, yl) = F (xl+1, yl+1). According to (18), we have

F (xl, yl) = F (yl, yl) = F (xl+1, yl) = F (xl+1, yl+1). (22) The last equality of (22) means that yl satisfies the optimality condition forgxl+1(y), i.e., yl= yl+1= S^τ,K,ρ(xl+1), wherexl+1 is obtained from (21) by settingxh = yl. Thus, we have

yl+1= S^τ,K,ρ(yl+ Φ^T(b − Φyl)),

which means thatyl+1meets (16), the optimality condition of (15). Therefore, we proved that Algorithm 3 converges to a local optimum of (15).

IV. NUMERICALEXPERIMENTS

In this paper, we introduce the two-levelℓ1-norm for sparse signal recovery. From Fig.3, one can expect the sparsity of the two-level ℓ1-norm and its computational effectiveness, compared with other non-convex penalties. There are two parameters in (5): the number of components having a small penalty and the weight for such components. Parameter ρ trades off between sparsity and computational complexity, which can be observed from the contour map of L^ρ,K_2level(x).

When ρ is small, the result is more sparse but is harder to solve. In the following experiments, we will test the performance of the two-levelℓ1 minimization withρ = 0,₁₀¹,¹₃.

For any ρ < 1, L^ρ,K_2level(x) is non-convex. The established algorithms do not have global search capability. To achieve a good solution, we use the result ofℓ1-minimization, denoted by x0, as the starting point for Algorithm 1 and Algorithm 2. There are kx0kℓ0 non-zero components in x0 and we minimize the two-levelℓ1-norm to enhance the sparsity from x0.L^ρ,K_2level(x) gives less penalty for K components and tries to make the restkx0kℓ0−K components to be zero. Hence, when we have no prior knowledge about the sparsity of real signals, we setK according to x0. In the experiments, we test several K values and suggest to use K = ⌊¹₃kx0kℓ0⌋, where ⌊a⌋

stands for the smallest integer not less than a. The proposed two-level soft thresholding algorithm (Algorithm 3) does not calculatex0, but we can expect thatkx0kℓ0 is less thanm, the number of equality constraints. Hence, for Algorithm 3,K can be heuristically chosen to beK = ⌊^M₅⌋. All the experiments are done in Matlab R2011a in Core 2-2.83 GHz, 2.96G RAM.

A. Simulated sparse signal recovery

We first evaluate the performance of two-levelℓ1minimiza- tion on sparse signal recovery problem and compare the result withℓ1minimization and iteratively reweightedℓ1 minimization. We randomly generate a k-sparse signal ¯x ∈ Rⁿ, i.e., k¯xkℓ0 = k, and a sampling matrix Φ ∈ R^m×n. Then we