Bregman forward-backward splitting for nonconvex composite optimization:
superlinear convergence to nonisolated critical points
MASOUD AHOOKHOSH, ANDREAS THEMELIS AND PANAGIOTIS PATRINOS
Abstract. We introduce Bella, a locally superlinearly convergent Bregman forward-back- ward splitting method for minimizing the sum of two nonconvex functions, one of which satisfying a relative smoothness condition [9, 46] and the other one possibly nonsmooth. A key tool of our methodology is the Bregman forward-backward envelope (BFBE), an ex- act and continuous penalty function with favorable first- and second-order properties, and enjoying a nonlinear error bound when the objective function satisfies a Łojasiewicz-type property. The proposed algorithm is of linesearch type over the BFBE along candidate up- date directions, and converges subsequentially to stationary points, globally under a KL condition, and owing to the given nonlinear error bound can attain superlinear conver- gence rates even when the limit point is a nonisolated minimum, provided the directions are suitably selected.
1. Introduction
In this paper, we address the composite minimization problem minimize
x∈
nϕ(x) ≡ f (x) + g(x) (1.1)
under the following hypotheses (see Section 2.2):
Assumption I (requirements for composite minimization (1.1)). The following hold:
a1 h : n → B ∪ {∞} is strictly convex, 1-coercive
1and essentially smooth;
2a2 f : n → is L f -smooth relative to h: namely, functions L f h± f are convex on dom h;
a3 g : n → is proper and lower semicontinuous (lsc);
a4 dom ϕ ⊆ int dom h, and arg min ϕ , ∅.
Despite its simple structure, (1.1) encompasses a variety of optimization problems ap- pearing frequently in scientific areas such as signal and image processing, machine learn- ing, control and system identification; see, e.g., [33, 46]. The notion of relative smooth- ness has been recently discovered in seminal works [9, 46] as a generalization of smooth functions with Lipschitz-continuous gradients. Studying optimization problems involv- ing relatively smooth functions has received much attention during the last few years [9, 19, 32, 33, 46, 49, 62]. In the composite setting (1.1), since f is relatively smooth and g is nonsmooth nonconvex, we can cover a wide spectrum of applications.
1991 Mathematics Subject Classification. 90C06, 90C25, 90C26, 49J52, 49J53.
Key words and phrases. Nonsmooth nonconvex optimization, Bregman forward-backward envelope, rela- tive smoothness, prox-regularity, KL inequality, nonlinear error bound, nonisolated critical points, superlinear convergence.
Department of Electrical Engineering (ESAT-STADIUS) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leu- ven, Belgium
This work was supported by the Research Foundation Flanders (FWO) research projects G086518N and G086318N; Research Council KU Leuven C1 project No. C14 /18/068; Fonds de la Recherche Scientifique — FNRS and the Fonds Wetenschappelijk Onderzoek — Vlaanderen under EOS project no 30468160 (SeLMA)..
1 h is 1-coercive if
h(x)/
kxkis coercive.
2
h is essentially smooth if int dom h , ∅, h ∈ C
1(int dom h), and k∇h(x
k)k → ∞ for any sequence (x
k)
k∈converging to a point in the boundary of dom h.
1
There are plenty of optimization algorithms that can handle composite minimization of the form (1.1), such as [1, 2, 12, 17, 48, 64] for convex problems and [5, 18, 21, 20, 22, 30, 57, 66] for nonconvex problems. Recently, in the relatively smooth setting for convex f and g, [9] proposed a Bregman proximal gradient method, [46] developed primal and dual algorithms, [49] proposed an accelerated tensor method, and [32, 33] suggested a Nesterov-type accelerated method and a stochastic mirror descent method. Moreover, [19, 62] extended the results of [9] in the nonconvex setting. More recently, [8] showed linear convergence of the gradient method for relatively smooth functions. To our knowledge, apart from the latter three papers, there have not been many attempts to deal with (1.1) in the relative smooth setting for nonsmooth nonconvex problems.
One of the most significant discussions in the field of numerical optimization has been related to designing iterative schemes guaranteeing a superlinear convergence rate; see, e.g., [50] for many algorithms attaining a superlinear convergence rate for smooth prob- lems and [30, 31, 57, 64, 66] for other related works in the nonconvex setting. In most of these attempts, the key element is the so-called Dennis-Moré condition [25, 26] which guarantees superlinear convergence to an isolated critical point of the objective function.
However, there are many applications that have nonisolated critical points such as low- rank matrix completion [60], low-rank matrix recovery [13], phase retrieval [59], and deep learning [37]. Up to now, besides some attempts for minimizing smooth nonlinear least- squares problems (see, e.g., [3, 4, 35] and references therein) far too little attention has been paid to the superlinear convergence to nonisolated critical points for nonconvex non- smooth problems. Our main motivation is thus to design an algorithmic framework that requires only a first-order black-box oracle with guarantees of superlinear convergence to nonisolated critical points in a fully nonsmooth nonconvex setting.
1.1. Contribution. We propose the Bregman EnveLope Linesearc Algorithm ( Bella ) to address problem (1.1), a method that generalizes Bregman forward-backward splitting (BFBS). Our contribution can be summarized as follows.
1) Bregman forward-backward envelope: a new key tool. We introduce an envelope func- tion for forward-backward splitting using Bregman distance, the Bregman forward-back- ward envelope (BFBE), which is a generalization of its Euclidean counterpart introduced in [52] and later further analyzed in [42, 57, 64, 66, 71]. A local equivalence of the BFBE and its Euclidean version allows to provide first- and second-order di fferential properties of the BFBE based on the known Euclidean properties of prox-regularity and epi-di fferentiability (Theorems 4.11 and 4.13). As a byproduct of our results, we also present the first- and second-order di fferential properties of the Bregman-Moreau envelope as a special case.
Moreover, the existence of first derivatives of the BFBE in a neighbourhood of critical points under such assumptions allows to provide a local nonlinear error bound for the BFBE around local (not necessarily isolated) minima of the original function, whenever the latter satisfies the Kurdyka-Łojasiewicz (KL) property.
2) Superlinear convergence to nonisolated critical points. Using the aforementioned fa- vorable properties of the BFBE around critical points, an accelerated Bregman forward- backward splitting ( Bella ) is developed, and the global and linear convergence of the sequence generated by this algorithm under the KL property are given (Theorem 5.5). Re- markably, under mild assumptions and thanks to the nonlinear error bound for the BFBE, the superlinear convergence to nonisolated critical points is shown (Theorem 5.8). To the best of our knowledge, this is the first analysis that exploits this nonlinear error bound to guarantee superlinear convergence to a nonisolated critical point.
1.2. Paper organization. The rest of the paper is organized as follows. In Section 2 we
introduce the notation and some preliminary results. In Section 3 we review some basic
properties of the Bregman forward-backward mapping, needed in Section 4 to construct
and analyze the BFBE, key tool of our analysis. In Section 5, we introduce the proposed Bella BFBE-based linesearch algorithm, and finally Section 6 concludes the paper.
2. Preliminaries
2.1. Notation. The extended-real line is denoted by B ∪ {∞}. The open and closed balls of radius r ≥ 0 centered at x ∈ n are denoted as B(x; r) and B(x; r), respectively. We say that (x k ) k∈ ⊂ n is summable if P
k∈ kx k k is finite, and square-summable if (kx k k 2 ) k∈
is summable.
A function f : n → is proper if f . ∞, in which case its domain is defined as the set dom f B {x ∈ n | f (x) < ∞}. For α ∈ , [ f ≤ α] B {x ∈ n | f (x) ≤ α} is the α-(sub)level set of f ; [ f ≥ α], [ f = α], etc. are defined accordingly. We say that f is level bounded if [ f ≤ α] is bounded for all α ∈ . The convex conjugate of a function h is denoted as h ∗ B sup z {h · , zi − h(z)}.
A vector v ∈ ∂ f (x) is a subgradient of f at x, where ∂ f (x) is the subdi fferential
∂ f (x) B n
v ∈ n | ∃(x k , v k ) k∈ s.t. x k → x, f (x k ) → h(x), ˆ ∂ f (x k ) 3 v k → v o and ˆ ∂ f (x) is the set of regular subgradients of f at x, namely
v ∈ ˆ ∂ f (x) iff lim inf
z→x z , x
f (z) − f (x) − hv, z − xi
kz − xk ≥ 0.
Following the terminology of [55], we say that f : n → is strictly continuous at ¯x if lip f ( ¯x) B lim sup
y,z→ ¯x y,z
| f (y) − f (z)|
ky − zk < ∞, and strictly di fferentiable at ¯x if ∇f ( ¯x) exists and
y,z→ lim ¯x y,z
f (y) − f (z) − h∇f ( ¯x), y − zi
ky − zk = 0.
With C 1 + ( n ) and C 1,1 ( n ), we indicate the set of functions from n to with locally and globally Lipschitz continuous gradient, respectively. If f is strictly continuous in an open set O, then its gradient exists almost everywhere on O, and as such its Bouligand subdi fferential
∂ B f (x) B n
v | ∃x k → x with ∇f (x k ) → v o is nonempty and compact for all x ∈ O [55, Thm. 9.61].
For a point-to-set mapping F : n ⇒ n , the set of its fixed points and zeros are denoted as fix F B {x ∈ n | x ∈ F(x)} and zer F B {x ∈ n | 0 ∈ F(x)}, respectively.
2.2. Relative smoothness and hypoconvexity. Here, after giving some definitions, we establish necessary facts regarding relative smoothness and hypoconvexity.
Definition 2.1. Let h : n → be a proper, lsc, convex function with int dom h , ∅ and such that h ∈ C 1 (int dom h). Then, h is said to be
(i) a kernel function if it is 1-coercive, i.e., lim kxk→∞ h(x) / kxk = ∞;
(ii) essentially smooth, if k∇h(x k )k → ∞ for every sequence (x k ) k∈ ⊆ int dom h con- verging to a boundary point of dom h;
(iii) of Legendre type if it is essentially smooth and strictly convex.
Definition 2.2 (Bregman distance [23]). For a kernel function h, the Bregman distance D h : n × n → is given by
D h (x, y) B (h(x) − h(y) − h∇h(y), x − yi if y ∈ int dom h,
∞ otherwise. (2.1)
If h is a strictly convex kernel function, then D h serves as a pseudo-distance, having D h ≥ 0 and D h (x, y) = 0 iff x = y ∈ int dom h. In general, however, D h is nonsymmetric and fails to satisfy the triangular inequality. There are many popular kernel functions such as energy, Boltzmann-Shannon entropy, Fermi-Dirac entropy and so on leading to variant Bregman distances that appear in many applications; see, e.g., [11, Ex. 2.3].
Remark 2.3. The following assertions hold:
(i) If h : n → is of Legendre type and 1-coercive, then h ∗ ∈ C 1 ( n ) is strictly convex. In fact, ∇h : int dom h → n is a (continuous) bijection with ∇h −1 = ∇h ∗ [56, Thm. 26.5, Cor. 13.3.1].
(ii) If h ∈ C 2 is of Legendre type and ∇ 2 h 0 on int dom h, then h ∗ ∈ C 2 ( n ) [54].
(iii) D h is continuous on int dom h × int dom h.
(iv) D h ( · , x) and D h (x, · ) are level bounded locally uniformly in x [10, Lem. 7.3(v)- (viii)].
3(v) If ∇h is ˜ σ h -strongly monotone on an open convex set U ⊆ int dom h, then D h (y, x) ≥
σ ˜
h2 ky − xk 2 for all x, y ∈ U.
(vi) If ∇h is ˜ L h -Lipschitz on an open convex set U ⊆ int dom h, then D h (y, x) ≤ L ˜ 2hky−xk 2
for all x, y ∈ U.
We will sometimes require properties such as Lipschitz di fferentiability or strong con- vexity to hold locally, where locality amounts to the existence for any point of a neighbor- hood in which such property holds.
Definition 2.4 (relative smoothness [9, 46]). We say that a proper and lsc function f :
n → is smooth relative to a kernel h : n → if there exists L f ≥ 0 such that L f h − f and L f h + f are convex functions. Whenever h is clear from context, we will simply say that
f is relatively smooth, or L f -relatively smooth to make the modulus L f explicit.
Whenever there exists σ f ∈ such that function f − σ f h is convex, we will say that f is σ f -hypoconvex relative to h. In particular, any L f -relatively smooth function f is also σ f -relatively hypoconvex with σ f = −L f . There are however cases in which a tighter (i.e., larger) σ f can be considered, as it is the case of convex functions f for which σ f ≥ 0. As we will formally elaborate in Proposition 2.6, the modulus L f ≥ 0 provides some upper bounding inequalities on f , whereas the modulus σ f ∈ [−L f , L f ] provides lower bounds.
Proposition 2.5. Let f be smooth relative to a kernel h. Then, the following hold:
(i) f ∈ C 1 (int dom h);
(ii) if h is Lipschitz di fferentiable on an open set U, then so is f . Proof.
♠ 2.5(i) Convexity of L f h ± f and continuous di fferentiability of h on int dom h ensure that dom f ⊇ int dom h, and through [55, Ex. 8.20(b) and Cor. 9.21] that both f and − f are subdi fferentially regular on int dom h, in the sense of [ 55, Def. 7.25], with ˆ ∂ f and ˆ∂(− f ) both nonempty. The proof now follows from [55, Thm. 9.18(d)].
♠ 2.5(ii) Let ˜ L h be a Lipschitz modulus for ∇h on U. Convexity of f − σ f h yields h∇f (x) − ∇f (y), x − yi ≥ σ f h∇h(x) − ∇h(y), x − yi ≥ min{σ f , 0} ˜L h kx − yk 2 for x, y ∈ U, while due to concavity of f − L f h it holds that
h∇f (x) − ∇f (y), x − yi ≤ L f h∇h(x) − ∇h(y), x − yi ≤ L f L ˜ h kx − yk 2 .
The two inequality together prove that f is ˜ L f -smooth and ˜ σ f -hypoconvex (in the classical Euclidean sense) with ˜ L f = L f L ˜ h and ˜ σ f = ˜L h min{σ f , 0}.
3 Although [10] only states level boundedness, a trivial modification of the proof shows local uniformity too.
The proof of the following result is a a simple adaptation of that of [46, Prop. 1.1].
Proposition 2.6 (characterization of relative smoothness and hypoconvexity). The follow- ing assertions are equivalent for a proper lsc function f : n → :
(a) f is L f -smooth and σ f -hypoconvex relative to h;
(b) σ f D h (y, x) ≤ f (y) − f (x) − h∇f (x), y − xi ≤ L f D h (y, x) for all x, y ∈ int dom h;
(c) σ f h∇h(x) − ∇h(y), x − yi ≤ h∇f (x) − ∇f (y), x − yi ≤ L f h∇h(x) − ∇h(y), x − yi for all x, y ∈ int dom h;
(d) σ f ∇ 2 h ∇ 2 f L f ∇ 2 h on int dom h, provided that f , h ∈ C 2 (int dom h).
2.3. Bregman proximal mapping. Let us now recall the definition of Bregman proximal mapping and Bregman-Moreau envelope and some of their fundamental properties.
Definition 2.7 (Bregman proximal mapping and Moreau envelope). Let g : n → be proper and lsc and let h : n → be a kernel function. The Bregman proximal mapping is the set-valued mapping prox h γg : n ⇒ n defined as
prox h γg (x) B arg min
z∈
nn g(z) + 1 γ D h (z, x)o, (2.2) while the Bregman-Moreau envelope is the single-valued function g
h/
γ: n → defined as
g
h/
γ(x) B inf
z∈
nn g(z) + 1 γ D h (z, x)o. (2.3) Some results of the paper will make use of an important connection relating the prox- imal mapping and Moreau envelope as defined here with a Bregman kernel with similar objects in the Euclidean case. To ease the notation, the kernel h will be omitted in the Eu- clidean case h = 1 2 k · k 2 , and thus write g
1/
γand prox γg to indicate the γ-Moreau envelope function and γ-proximal point mapping for h = 1 2 k · k 2 .
Definition 2.8 (h-prox-boundedness). Given a kernel h, a function g : n → is said to be h-prox-bounded if there exists γ > 0 such that g
h/
γ(x) > −∞ for some x ∈ n . The supremum of the set of all such γ is the threshold γ h g of the h-prox-boundedness, i.e.,
γ h g B sup nγ > 0 | ∃x ∈ n s.t. g
h/
γ(x) > −∞o.
3. Bregman forward-backward mapping
We now introduce the forward-backward operator with a Bregman kernel, and analyze some properties of its fixed points. We first make a technical observation that allows to drop prox-boundedness requirements in the sequel.
Remark 3.1. Notice that Assumption I ensures that g is h-prox-bounded with threshold γ h g ≥ 1 / Lf. In fact, for any x ∈ int dom h concavity of f − L f h yields
min ϕ ≤ f (z) + g(z) = ( f − L f h)(z) + L f h(z) + g(z)
≤ ( f − L f h)(x) + h∇( f − L f h)(x), z − xi + L f h(z) + g(z)
= f (x) + h∇f (x), z − xi + g(z) + L f D h (z, x).
Thus, for any γ < 1 / Lf the function g + 1 γ D h ( · , x) is lower bounded, owing to 1-coercivity
of h (which entails that of D h ( · , x)).
The L f -relative smoothness of f implies through Prop. 2.6(b) that for any γ ∈ (0, 1 / Lf) the function M
hf ,g /
γ(z, x) : n × n → given by
M
hf ,g /
γ
(z, x) B f (x) + h∇f (x), z − xi + γ 1 D h (z, x) + g(z) (3.1) provides a majorization model for ϕ, i.e., ϕ(z) ≤ M
hf ,g /
γ(z, x). More specifically,
ϕ(z) + 1−γL γ f D h (z, x) ≤ M
hf ,g /
γ(z, x) ≤ ϕ(z) + 1−γσ γ f D h (z, x) ∀x, z. (3.2)
D h (z, x) ∀x, z. (3.2)
The Bregman forward-backward splitting mapping is the (set-valued) majorization-mini- mization operator T
hf ,g /
γ: n ⇒ n defined as
T
hf ,g /
γ(x) B arg min
z∈
nM
hf ,g /
γ(z, x), (3.3) and the Bregman forward-backward residual mapping is R
hf ,g /
γ: n ⇒ n given by
R
hf ,g /
γ(x) B x − T
hf ,g /
γ(x). (3.4) 3.1. Stationarity, criticality, and optimality. We investigate di fferent aspects of subop- timality and show their connection to ϕ = f + g, similarly to what has been done in [ 66]
for the Euclidean case.
Definition 3.2. Relative to problem (1.1), we say that a point x ? ∈ dom ϕ is (i) stationary if 0 ∈ ˆ ∂ϕ(x ? );
(ii) critical if it is γ-critical for some γ > 0, namely if x ? ∈ T
hf ,g /
γ(x ? );
(iii) optimal if x ? ∈ arg min ϕ, i.e., x ? is a solution of (1.1).
As we will see in Proposition 3.5, criticality is a halfway property between stationarity and optimality. In fact, the higher the value of γ the more restrictive the property of γ- criticality. As a measure of this suboptimality, we thus introduce the criticality threshold.
Definition 3.3. Relative to problem (1.1), the h-criticality threshold is the function Γ h f ,g :
n → [0, γ h g ] defined by
Γ h f ,g (x) B sup nγ > 0 | x ∈ T
hf ,g /
γ(x) o
∪ {0}.
Note that for γ > γ h g the mapping T
hf ,g /
γis empty-valued, hence the inclusion Γ h f ,g ∈ [0, γ h g ]. In the next two results, we show how the three notions introduced in Definition 3.2 are interrelated and identify some useful properties of critical points that will be used to derive regularity of T
hf ,g /
γand R
hf ,g /
γ. Although simple adaptations of [66, Thm. 3.4 and Prop.
3.5] where the Euclidean case h = 1 2 k · k 2 is considered, for self containedness we detail the proofs.
Proposition 3.4 (critical point characterization). The following hold for a point x ? ∈ n : (i) x ? is γ-critical i ff g(x) ≥ g(x ? ) − h∇f (x ? ), x − x ? i − 1 γ D h (x, x ? ) for all x ∈ int dom h;
(ii) if x ? is critical, then it is γ-critical for all γ ∈ (0, Γ h f ,g (x ? ));
(iii) T
hf ,g /
γ(x ? ) = {x ? } and R
hf ,g /
γ(x ? ) = {0} for any γ ∈ (0, Γ h f ,g (x ? )).
Proof.
♠ 3.4(i) By definition, x ? is γ-critical i ff M
hf ,g /
γ(x ? , x ? ) ≤ M
hf ,g /
γ(x, x ? ) for all x, i.e., i ff f (x ? ) + g(x ? ) ≤ f (x ? ) + h∇f (x ? ), x − x ? i + g(x) + 1 γ D h (x, x ? ) ∀x ∈ n . By suitably rearranging, the claim readily follows.
♠ 3.4(ii) Directly follows from assert 3.4(i).
♠ 3.4(iii) Let x ? be a critical point, and let x ∈ T
hf ,g /
γ(x ? ) for some γ < Γ(x ? ). Fix γ 0 ∈ (γ, Γ(x ? )). From 3.4(i) and 3.4(ii), it then follows that
g(x) ≥ g(x ? ) + h − ∇f (x ? ), x − x ? i − γ 10D h (x, x ? ). (3.5) Since x, x ? ∈ T
hf ,g /
γ(x ? ), it holds that M
hf ,g /
γ(x ? , x ? ) = M
hf ,g /
γ(x, x ? ), i.e.,
g(x ? ) = h∇f (x ? ), x − x ? i + 1 γ D h (x, x ? ) + g(x) (3.5) ≥ g(x ? ) + 1 γ − γ 10 D h (x, x ? ).
Since 1 γ − γ 10 > 0, necessarily x = x ? .
Proposition 3.5 (optimality, criticality, and stationarity). The following statements hold:
(i) (criticality ⇒ stationarity) fix T
hf ,g /
γ⊆ zer ˆ ∂ϕ for all γ ∈ (0, γ h g );
(ii) (optimality ⇒ criticality) Γ h f ,g (x ? ) ≥ 1 / Lf for all x ? ∈ arg min ϕ.
Proof.
♠ 3.5(i) Let γ ∈ (0, γ h g ) and x ∈ fix T
hf ,g /
γ. Since x minimizes M
hf ,g /
γ( · , x), we have 0 ∈ ˆ ∂M
hf ,g /
γ( · , x)(x) = ∇f (x) + 1 γ [∇h(x) − ∇h(x)] + ˆ∂g(x) = ˆ∂g(x) + ∇f (x) = ˆ∂ϕ(x), where the inclusion follows from [55, Thm. 10.1] and the equalities from [55, Thm. 8.8(c)].
♠ 3.5(ii) Fix γ ∈ (0, 1 / Lf), x ? ∈ arg min ϕ and z ∈ T
hf ,g /
γ(x ? ). Then, (3.2) with x = x ? yields ϕ(z) ≤ M
hf ,g /
γ(z, x ? ) − 1−γL γ f D h (z, x ? ) = arg min
D h (z, x ? ) = arg min
w
M
hf ,g /
γ(w, x ? ) − 1−γL γ f D h (z, x ? )
≤ M
hf ,g /
γ(x ? , x ? ) − 1−γL γ f D h (z, x ? ) = ϕ(x ? ) − 1−γL γ f D h (z, x ? ),
D h (z, x ? ),
where the first equality is due to the inclusion z ∈ T
hf ,g /
γ(x ? ). We then conclude that z = x ? , for otherwise ϕ(z) < ϕ(x ? ) would contradict global minimality of x ? .
4. Properties of Bregman forward-backward envelope
In this section, we first introduce the Bregman forward-backward envelope (BFBE) and study its fundamental properties that we need in subsequent sections. Let us define the function BFBE ϕ
hf ,g /
γ: n → as
ϕ
hf ,g /
γ(x) B inf
z∈
nn f (x) + h∇f (x), z − xi + g(z) + 1 γ D h (z, x)o, (4.1) namely the value function of the minimization subproblem defining the Bregman forward- backward mapping (3.3). We next give alternative expressions for T
hf ,g /
γand ϕ
hf ,g /
γ.
Proposition 4.1 (alternative expressions of T
hf ,g /
γand ϕ
hf ,g /
γ). Suppose that Assumption I is satisfied. Then, the following hold:
(i) T
hf ,g /
γ(x) = prox h γg (∇h ∗ (∇h(x) − γ∇f (x))) for every x ∈ int dom h;
(ii) h γ − f is a Legendre kernel, ϕ
hf ,g /
γ= ϕ
hγ− f and Thf ,g /
γ = prox ϕ
hγ− f . Proof.
♠ 4.1(i) By expanding the Bregman distance and discarding constant terms in (3.3), one has
T
hf ,g /
γ(x) = arg min
z
n g(z) + 1 γ h(z) − h∇h(x) − γ∇f (x), z − xi o = arg min
z
n g(z) + 1 γ D h (¯z, x) o for ¯z = ∇h ∗ (∇h(x) − γ∇f (x)), owing to the identity ∇h(x) − γ∇f (x) = ∇h(¯z) (cf. Rem. 2.3(i)).
♠ 4.1(ii) Let us first show that 1 γ h − f is a Legendre kernel. Let ˆh B 1 γ h − f and C B int dom h. Since ˆh = 1−γL γ fh + L f h − f and L f h − f is convex, strict convexity and 1- coercivity of ˆh follow from the similar properties of h. We now show essential smoothness;
to arrive to a contradiction, suppose that there exists a sequence (x k ) k∈ ⊂ C converging to a boundary point x ? of C and such that sup k∈ k∇ˆh(x k )k < ∞. By possibly extracting, we may assume that ∇h(x k )/k∇h(x k )k → v for some vector v with unitary norm. For every y ∈ C, since ∇(L f h − f )(x k ) = ∇ˆh(x k ) − 1−γL γ f∇h(x k ), it holds that
h∇(L f h − f )(x k ) − ∇(L f h − f )(y), x k − yi ≤ c − 1−γL γ fh∇h(x k ) − ∇h(y), x k − yi, (4.2) where c B suph∇ˆh(x k ) − ∇ˆh(y), x k − yi is a finite quantity. Moreover, 0 ≤ k∇h(x 1k)k h∇h(x k ) −
)k h∇h(x k ) −
∇h(y), x k − yi → hv, x ? − yi as k → ∞, and from the arbitrarity of y ∈ C we conclude that
v ∈ {u | hu, x ? − yi ≥ 0 ∀y ∈ C}. Since C is open, B(x ? ; ε) ∩ C , ∅ for any ε > 0, and in particular there exists y ∈ C such that hv, x ? − yi 0. Pluggin this y in ( 4.2) yields
h∇(L f h − f )(x k ) − ∇(L f h − f )(y), x k − yi ≤ c − 1−γL γ fk∇h(x k )kh ∇h(x k∇h(xk)−∇h(y)
k)k , x k − yi → − ∞, contradicting convexity of L f h − f . Therefore, 1 γ h − f is a Legendre kernel. Now, observe that
)−∇h(y)
k)k , x k − yi → − ∞, contradicting convexity of L f h − f . Therefore, 1 γ h − f is a Legendre kernel. Now, observe that
T
hf ,g /
γ(x) = arg min
z
n f (x) + h∇f (x), z − xi + g(z) + 1 γ D h (z, x) o
= arg min
z
n f (z) + 1 γ (h − γ f )(z) − (h − γ f )(x) − h∇(h − γ f )(x), z − xi + g(z) o
= arg min
z nϕ(z) + 1 γ D h−γ f (z, x)o,
hence the identity T
hf ,g /
γ= prox ϕhγ− f . The same reasoning with the infima proves also the identity ϕ
hf ,g /
γ= ϕ
γh− f , completing the proof. The next two results characterize the fundamental relationship between the Bregman forward-backward envelope ϕhf ,g /
γ and the original function ϕ that are essential to analyze the convergence of the Bregman forward-backward scheme that will be given in Section 5.
Theorem 4.2 (relation between ϕ and ϕ
hf ,g /
γ). Under conditions given in Assumption I and γ ∈ (0, 1 / Lf), the following statements are true:
(i) ϕ
hf ,g /
γ≤ ϕ;
(ii) 1−γL γ f D h ( ¯x, x) ≤ ϕ
hf ,g /
γ(x) − ϕ( ¯x) ≤ 1−γσ γ f D h ( ¯x, x) for all x ∈ n and ¯x ∈ T
hf ,g /
γ(x);
D h ( ¯x, x) for all x ∈ n and ¯x ∈ T
hf ,g /
γ(x);
(iii) ϕ
hf ,g /
γ(x) = ϕ(x) iff x ∈ T
hf ,g /
γ(x);
(iv) inf ϕ
hf ,g /
γ= inf ϕ and arg min ϕ
hf ,g /
γ= arg min ϕ;
Proof.
♠ 4.2(i) Follows from [36, Prop. 2.1(i)] in light of the identification of Prop. 4.1(ii).
♠ 4.2(ii) & 4.2(iii) Follow from (3.2) with z = ¯x (since ϕ
hf ,g /
γ(x) = M
hf ,g /
γ( ¯x, x)) and from the fact that D h ( ¯x, x) ≥ 0 for every x, ¯x with equality holding i ff x = ¯x.
♠ 4.2(iv) It follows from 4.2(i) that inf ϕ
hf ,g /
γ≤ inf ϕ. Let (x k ) k∈ be such that ϕ
hf ,g /
γ(x k ) → inf ϕ
hf ,g /
γas k → ∞. Then, taking ¯x k ∈ T
hf ,g /
γ(x k ) assert 4.2(ii) ensures that ϕ( ¯x k ) → inf ϕ, hence the claimed equivalence of infima.
If x ∈ arg min ϕ
hf ,g /
γand ¯x ∈ T
hf ,g /
γ(x) then
ϕ( ¯x)
4.2(ii)≤ ϕ
hf ,g /
γ(x) − 1−γL γ f D h ( ¯x, x) = inf ϕ
hf ,g /
γ − 1−γL γ f D h ( ¯x, x) = inf ϕ − 1−γL γ f D h ( ¯x, x), hence necessarily x = ¯x ∈ arg min ϕ. Similarly, for x ∈ arg min ϕ one has
D h ( ¯x, x) = inf ϕ − 1−γL γ f D h ( ¯x, x), hence necessarily x = ¯x ∈ arg min ϕ. Similarly, for x ∈ arg min ϕ one has
ϕ
hf ,g /
γ(x)
4.2(i)
≤ ϕ(x) = inf ϕ = inf ϕ
hf ,g /
γ,
hence x ∈ arg min ϕ
hf ,g /
γ.
Proposition 4.3 (regularity of T
hf ,g /
γand ϕ
hf ,g /
γ). Suppose that Assumption I holds. Then, for every γ ∈ (0, γ h g ) the following statements are true:
(i) dom T
hf ,g /
γ= dom ϕ
hf ,g /
γ= int dom h;
(ii) range T
hf ,g /
γ⊆ dom ϕ ⊆ int dom h;
(iii) T
hf ,g /
γand R
hf ,g /
γare osc, locally bounded and compact-valued;
(iv) ϕ
hf ,g /
γis real valued and continuous on int dom h; if additionally h is C 1 + on int dom h, then ϕ
hf ,g /
γis locally Lipschitz continuous on int dom h.
Proof. The first inclusion of 4.3(ii) follows from Thm.s 4.2(i) and 4.2(ii), and the second one from Assumption Ia4 . The other claims can invoke similar ones for the Bregman- Moreau envelope and Bregman proximal mapping, owing to the identification of Prop.
4.1(ii):
♠ 4.3(i) See [36, Thm. 2.2(ii)].
♠ 4.3(iv) Follow from [36, Cor. 2.2 and Thm. 2.3] in light of Prop. 2.5(ii).
♠ 4.3(iii) Compact-valuedness and osc are shown in [36, Thm. 2.2(i)]. Further, the conti- nuity of ϕ
hf ,g /
γentails its local upper boundedness; as such, local boundedness of T
hf ,g /
γfollows from [55, Ex. 5.22], which applies as ensured by [36, Thm. 2.1]. We now show two other important properties relating the BFBE with the original cost function, namely equivalence of level boundedness and of (strong /local) minimality.
Theorem 4.4 (equivalence of level boundedness). Suppose that Assumption I is satisfied, and let γ ∈ (0, 1 / Lf). Then, ϕ
hf ,g /
γ is level bounded i ff so is ϕ.
Proof. It follows from 4.2(i) that if ϕ
hf ,g /
γis level bounded then so is ϕ. Conversely, suppose that there exists α ∈ together with an unbounded sequence (x k ) k∈ ⊆ [ϕ
hf ,g /
γ≤ α]. Then, it follows from Prop. 4.3(i) that x k ∈ dom ϕ
hf ,g /
γ= int dom h for all k, and in turn that for any k there exists ¯x k ∈ T
hf ,g /
γ(x k ) which satisfies ϕ( ¯x k ) ≤ α − 1−γL γ f D h ( ¯x k , x k ), as it follows from Thm. 4.2(ii). Local boundedness of D h with respect to the second variable (Rem. 2.3(iv)) then ensures that ( ¯x k ) k∈ is not bounded, hence that ϕ is not level bounded. Theorem 4.5 (equivalence of local minimality). Additionally to Assumption I, suppose that the following requirements are satisfied:
a1 x ? is (a critical point) such that T
hf ,g /
γ(x ? ) = {x ? } for some γ < 1 / Lf (as it is the case when γ < Γ h f ,g (x ? ), cf. Prop. 3.4(iii));
a2 h is ˜ σ h -strongly convex around x ? for some σ ˜ h > 0.
Then, x ? is a (strong) local minimum for ϕ
hf ,g /
γi ff it is a (strong) local minimum for ϕ.
Proof. That (strong) local minimality for ϕ
hf ,g /
γimplies that for ϕ follows from the fact that ϕ
hf ,g /
γ“supports” ϕ at x ? , namely that ϕ
hf ,g /
γ≤ ϕ and ϕ
hf ,g /
γ(x ? ) = ϕ(x ? ) (Thm. 4.2(i)).
Conversely, suppose that there exists µ ≥ 0 such that ϕ(x) ≥ ϕ(x ? ) + µ 2 kx − x ? k 2 for all x su fficiently close to x ? . Let δ B 1 2 min nµ, σ ˜ 2h1−γL γ
fo
≥ 0, and note that δ = 0 iff µ = 0.
Thus, contrary to the claim suppose that for all k ∈ ≥1 there exists x k ∈ B(x ? ; 1 / k ) such that ϕ
hf ,g /
γ(x k ) < ϕ
hf ,g /
γ(x ? ) + δ 2 kx k − x ? k 2 . Let ¯x k ∈ T
hf ,g /
γ(x k ); since T
hf ,g /
γis osc and T
hf ,g /
γ(x ? ) = {x ? }, necessarily ¯x k → x ? as k → ∞. We have
ϕ( ¯x k )
4.2(i)
≤ ϕ
hf ,g /
γ(x k ) − 1−γL γ f D h ( ¯x k , x k )
2.3(v)
≤ ϕ
hf ,g /
γ(x k ) − σ ˜ 2h1−γL γ
fkx k − ¯x k k 2
< ϕ
hf ,g /
γ(x ? ) + δ 2 kx k − x ? k 2 − σ ˜ 2h1−γL γ
fkx k − ¯x k k 2
= ϕ(x ? ) + δ 2 kx k − x ? k 2 − σ ˜h
2 1−γL
fγ kx k − ¯x k k 2 .
By using the inequality 1 2 ka − ck 2 ≤ ka − bk 2 + kb − ck 2 holding for any a, b, c ∈ n , we have
ϕ( ¯x k ) < ϕ(x ? ) + δk ¯x k − x ? k 2 + δ − σ ˜ 2h1−γL γ
fkx k − ¯x k k 2 ≤ ϕ(x ? ) + µ 2 k ¯x k − x ? k 2 ,
where the last inequality follows from the definition of δ. Thus, ϕ( ¯x k ) < ϕ(x ? ) + µ 2 k ¯x k − x ? k 2
for all k ∈ , which contradicts µ-strong local minimality of x ? for ϕ (since ¯x k → x ? ).
for all k ∈ , which contradicts µ-strong local minimality of x ? for ϕ (since ¯x k → x ? ).
It was first observed in [42] that the Euclidean forward-backward envelope can be inter- preted as a Bregman-Moreau envelope. The following theorem furnishes a local converse relation, namely that when h is locally strongly convex and locally Lipschitz di fferentiable the Bregman FBE, hence in particular the Bregman-Moreau envelope, can locally be iden- tified with a Euclidean FBE. This equivalence is a key certificate for analyzing the local properties of BFBE (in particular Bregman-Moreau envelope) close to critical points under the prox-regularity condition using the existing analysis of FBE; cf. [53, 64]. To do so, we first state a technical lemma, whose proof is an immediate consequence of the osc and lo- cal boundedness of T
hf ,g /
γtogether with the inclusion range T
hf ,g /
γ⊆ int dom h shown in Prop.
4.3.
Lemma 4.6. Suppose that Assumption I is satisfied, and let U ⊂ int dom h be a compact set. Then T
hf ,g /
γ(U) ⊂ int dom h is also compact.
Theorem 4.7 (local equivalence of Bregman and Euclidean FBE). Suppose that Assump- tion I holds and let γ < 1 / Lf be fixed. Suppose further that h is locally Lipschitz di fferen- tiable and locally strongly convex (as it is the case when h ∈ C 2 with ∇ 2 h 0 on int dom h).
Then, for all ˜γ > 0 and all bounded sets U such that cl(U) ⊂ int dom h one has
ϕ
hf ,g /
γ= ϕ
1f ,˜g / ˜
˜γand T
hf ,g /
γ= T
1f ,˜g / ˜
˜γon U, (4.3) where
f B f − ˜ 1 γ h + 2 ˜γ 1 k · k 2 and ˜g B g + 1 γ h − 2 ˜γ 1 k · k 2 + δ B (4.4) for some nonempty and compact set B ⊆ dom h. Moreover, ˜g is proper, lsc, and prox- bounded (in the Euclidean sense) with γ ˜g = ∞, and for ˜γ small enough ˜f is L f ˜ -Lipschitz- di fferentiable on dom ˜g with ˜γ < 1 / Lf˜.
Proof. It follows from Lem. 4.6 that B B T
hf ,g /
γ(U) is compact. We have ϕ
hf ,g /
γ(x) = min
z∈B
n f (x) + h∇f (x), z − xi + γ 1 D h (z, x) + g(z) o
= min
z∈B
n f (x) − γ 1 h(x) + h∇f (x) − γ 1 ∇h(x), z − xi + g(z) + γ 1 h(z) o
= min
z∈
nn ˜ f (x) + h∇ ˜f(x), z − xi + ˜g(z) + 2 ˜γ 1 kz − xk 2 o = ϕ
1f ,˜g ˜ /
˜γ(x).
Notice that ˜g is proper, lsc and with bounded domain, hence its claimed prox-boundedness.
Let now Ω B conv(U ∪ T
hf ,g /
γ(U)), and observe that cl Ω is a bounded subset of int dom h.
In fact, boundedness follows from that of U and its image under T
hf ,g /
γ, and the inclusion from convexity of int dom h. Thus, h is L h, Ω -smooth and σ h, Ω -strongly convex on Ω for some constants L h, Ω , σ h, Ω > 0. Then, from the equalities
f ˜ = f − L f h − 1−γL γ fh + 2 ˜γ 1 k · k 2 = f + L f h − 1 +γL γ fh + 2 ˜γ 1 k · k 2 , the convexity of f + L f h, and the concavity of f − L f h, it follows that
h + 2 ˜γ 1 k · k 2 , the convexity of f + L f h, and the concavity of f − L f h, it follows that
1
˜γ − 1 +γL γ
fσ h, Ω
kx − yk 2 ≤ h∇ ˜ f (x) − ∇ ˜ f (y), x − yi ≤ 1
˜γ − 1−γL γ
fL h, Ω kx − yk 2 for every x, y ∈ Ω. Therefore, ˜f is L f ˜ -smooth on B, with
L f ˜ = max
1
˜γ − 1 +γL γ
fσ h, Ω
,
1
˜γ − 1−γL γ
fL h, Ω
.
Imposing ˜γ < 1 / Lf˜yields ˜γ < 2γ 1 min n
(1 − γL f )L h, Ω , (1 + γL f )σ h, Ω o
, hence the claim.
4.1. First-order properties. We here discuss first-order properties of the BFBE. Let us begin with a subdi fferential inclusion that extends known facts about the Bregman-Moreau envelope [36].
Proposition 4.8. Additionally to Assumption I, suppose that f , h ∈ C 2 (int dom h). Then, ϕ
hf ,g /
γis strictly continuous on int dom h and is strictly differentiable wherever it is differen- tiable. Moreover,
(i) lip ϕ
hf ,g /
γ(x) = max r∈Rf ,g
h/γ
(x)
1
γ ∇ 2 h(x) − ∇ 2 f (x) r
; (ii) ∂ϕ
hf ,g /
γ(x) = ∂ B ϕ
hf ,g /
γ(x) ⊆ 1
γ ∇ 2 h(x) − ∇ 2 f (x) Rhf ,g /
γ(x).
Proof. As shown in Prop. 4.1(ii), ϕ
hf ,g /
γ= ϕ ˆh with ˆh B 1 γ h − f . We will pattern the proof of [55, Ex. 10.32], and thus observe that in light of Lem. 4.6 for every open set O ⊆ cl O ⊆ int dom h there exists a compact set Y ⊆ int dom h such that −ϕ ˆh (x) = max y∈Y Φ(x, y), where Φ(x, y) B −ϕ(y) − γ 1 D ˆh (y, x) is continuously di fferentiable in x, its derivatives de- pending continuously on (y, x) with ∇ x Φ(y, x) = ∇ 2 ˆh(x)(y − x). In fact, the maxima are attained for y ∈ T
hf ,g /
γ(x). Function −ϕ ˆh is thus lower-C 1 in the sense of [55, Def. 10.29], so that [55, Def. 10.31] ensures its strict continuity, the equivalence of di fferentiability and strict di fferentiability, and the relations
lip ϕ ˆh (x) = max
¯x∈T
hf ,g/γ
(x)
∇
2 ˆh(x)(x − ¯x)
and ∂ϕ ˆh (x) = ∂ B ϕ ˆh (x) ⊆ ∇ 2 ˆh(x)
x − T
hf ,g /
γ(x).
The proof now follows from the identities ϕ
hf ,g /
γ= ϕ ˆh and ˆh = 1 γ h − f . Although strict continuity ensures almost everywhere di fferentiability, with mild addi- tional assumptions the BFBE can be shown to be (Lipschitz-continuously) di fferentiable around critical points. Thanks to the local equivalence shown in Theorem 4.7, these re- quirements are the same as those ensuring similar properties in the Euclidean case. These amount to prox-regularity, a condition which was first proposed in [53] that we state next.
Definition 4.9 (prox-regularity). A function g : n → is prox-regular at ¯x ∈ int dom h for ¯v ∈ ∂g( ¯x) if it is locally lsc at ¯x and there exists r, ε > 0 such that
g(x 0 ) ≥ g(x) + hv, x 0 − xi − r 2 kx 0 − xk 2 (4.5) holds for all x, x 0 ∈ B( ¯x; ε) and (x, v) ∈ gph ∂g with v ∈ B(¯v; ε) and g(x) ≤ g( ¯x) + ε.
In order to ease the terminology, since prox-regularity will only be needed at critical points x ? for v = −∇f (x ? ), we will introduce a slight abuse of notation and define prox- regularity of critical points as follows.
Definition 4.10 (prox-regularity of critical points). Relative to problem (1.1), we say that a critical point x ? is prox-regular if g is prox-regular at x ? for −∇f (x ? ).
The subsequent result connects prox-regularity of g in (1.1) with the first-order proper- ties of BFBE, owing to the relation of BFBE and the Euclidean forward-backward envelope given in Theorem 4.7. To shorten the notation, we introduce the matrix-valued mapping
Q
hf /
γ(x) B 1 γ ∇ 2 h(x) − ∇ 2 f (x), (4.6) defined wherever it makes sense.
Theorem 4.11 (continuous di fferentiability of BFBE). Suppose that Assumption I holds and that h ∈ C 2 with ∇ 2 h 0 on int dom h. Suppose further that f is of class C 2 around a prox-regular critical point x ? . Then, for all γ ∈ (0, Γ h f ,g (x ? )) there exists a neighborhood U of x ? on which the following statements are true:
(i) T
hf ,g /
γand R
hf ,g /
γare Lipschitz continuous (hence single valued);
(ii) ϕ
hf ,g /
γ∈ C 1 + with ∇ϕ
hf ,g /
γ= Q γ R
hf ,g /
γ, where Q
hf /
γis as in (4.6).
Proof. For any compact neighborhood U ⊂ int dom h of x ? we may invoke Thm. 4.7 and identify ϕ
hf ,g /
γwith the Euclidean FBE ϕ
1f ,˜g / ˜
˜γon U, for some ˜γ small enough and with ˜ f and
˜g as in (4.4). It follows from [55, Ex. 13.35] and the continuous di fferentiability of f and h that ˜g is prox-regular at x ? for −∇ ˜ f (x ? ). Up to possibly restricting U so that f is twice continuously di fferentiable there, we may thus invoke the similar result in the Euclidean [66, Thm. 4.7] to infer that T
1f ,˜g / ˜
˜γis Lipschitz continuous around x ? and the Euclidean FBE ϕ
1f ,˜g ˜ /
˜γis Lipschitz di fferentiable around x ? with
∇ ϕ
hf ,g /
γ= ∇ϕ
1f ,˜g / ˜
˜γ= ˜γ −1 I − ˜γ∇ 2 f ˜ id − T
1f ,˜g ˜ /
˜γ= Q
hf /
γid − T
hf ,g /
γ, (4.7) where the last equality follows from the fact that
˜γ −1 [I − ˜γ∇ 2 f ˜ ] = ˜γ −1 I − ∇ 2 f − 1 γ h + 2 ˜γ 1 k · k 2 = γ 1 ∇ 2 h − ∇ 2 f = Q
hf /
γ, (4.8) together with the identity T
1f ,˜g / ˜
˜γ