NONCONVEX FUNCTIONS: FURTHER PROPERTIES AND
2
NONMONOTONE LINE-SEARCH ALGORITHMS
∗3
ANDREAS THEMELIS AND LORENZO STELLA AND PANOS PATRINOS † 4
Abstract. We proposeZeroFPR, a nonmonotone linesearch algorithm for minimizing the sum of 5
two nonconvex functions, one of which is smooth and the other possibly nonsmooth.ZeroFPRis the 6
first algorithm that, despite being fit for fully nonconvex problems and requiring only the black-box 7
oracle of forward-backward splitting (FBS) — namely evaluations of the gradient of the smooth term 8
and of the proximity operator of the nonsmooth one — achieves superlinear convergence rates under 9
mild assumptions at the limit point when the linesearch directions satisfy a Dennis-Moré condition, 10
and we show that this is the case for Broyden’s quasi-Newton directions. Our approach is based on the 11
forward-backward envelope (FBE), an exact and strictly continuous penalty function for the original 12
cost. Extending previous results we show that, despite being nonsmooth for fully nonconvex problems, 13
the FBE still enjoys favorable first- and second-order properties which are key for the convergence 14
results ofZeroFPR. Our theoretical results are backed up by promising numerical simulations. On 15
large-scale problems, by computing linesearch directions using limited-memory quasi-Newton updates 16
our algorithm greatly outperforms FBS and its accelerated variant (AFBS).
17
Key words. Nonsmooth optimization, nonconvex optimization, forward-backward splitting, line- 18
search methods, quasi-Newton methods, prox-regularity.
19
AMS subject classifications. 90C06, 90C25, 90C26, 90C53, 49J52, 49J53.
20
1. Introduction. In this paper we deal with optimization problems of the form
21
(1.1) minimize
x∈IRn
ϕ(x) ≡ f(x) + g(x)
22
under the following requirements, which will be assumed without further mention.
23
Assumption I (Basic assumption). In problem (1.1)
24
(i) f ∈ C
1,1(IR
n) (differentiable with L
f-Lipschitz continuous gradient);
25
(ii) g : IR
n→ IR is proper, closed and γ
g-prox-bounded (see Section 2.1);
26
(iii) a solution exists, that is, argmin ϕ 6= ∅.
27
Both f and g are allowed to be nonconvex, making (1.1) prototypic for a plethora
28
of applications spanning signal and image processing, machine learning, statistics,
29
control and system identification. A well known algorithm addressing (1.1) is forward-
30
backward splitting (FBS), also known as proximal gradient method. FBS has been
31
thoroughly analyzed under the assumption of g being convex. If moreover f is convex,
32
then FBS is known to converge globally with rate O(1/k) in terms of objective value,
33
where k is the iteration count. In this case, accelerated variants of FBS, also known
34
as fast forward-backward splitting (FFBS), can be derived thanks to the work of
35
Nesterov [9, 36], that only require minimal additional computations per iteration but
36
achieve the provably optimal global convergence rate of order o(1/k
2) [6].
37
The work in [41] pioneered an alternative acceleration technique. The method is
38
based on an exact, real-valued penalty function for the original problem (1.1), namely
39
∗This work was supported by: KU Leuven internal funding: StG/15/043 Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS Project no 30468160 (SeLMA) FWO projects: G086318N; G086518N
†Department of Electrical Engineering (ESAT-STADIUS) & Optimization in Engineering Center (OPTEC) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
andreas.themelis@esat.kuleuven.be,lorenzostella@gmail.com,panos.patrinos@esat.kuleuven.be 1
the forward-backward envelope (FBE), defined as follows
40
(1.2) ϕ
γ(x) = ϕ
f,gγ(x) := inf
z∈IRn
n f (x) + h∇f(x), z − xi +
2γ1kz − xk
2+ g(z) o
41
where γ > 0 is a given parameter. We will adopt the simpler notation ϕ
γwithout
42
superscript whenever f and g are clear from context.
43
The name forward-backward envelope comes from the fact that ϕ
γ(x) is the value
44
of the minimization problem that defines the forward-backward step and alludes to
45
the kinship that it has with the Moreau envelope. These claims will be addressed more
46
in detail in Section 4. When f is sufficiently smooth and both f and g are convex,
47
the FBE was shown to be continuously differentiable and amenable to be minimized
48
with generalized Newton methods. More recently, [50] proposed a linesearch algorithm
49
based on (L-)BFGS quasi-Newton directions for minimizing the FBE. The curvature
50
information exploited by Newton-like methods acts as an online preconditioner, en-
51
abling superlinear rates of convergence, under some assumptions. However, unlike
52
plain (F)FBS schemes, such methods require accessing second-order information of
53
the smooth term f (needed for the evaluation of ∇ϕ
γ), and are well defined only as
54
long as the nonsmooth term g is convex. On the contrary, FBS only requires first-order
55
information on f and prox-boundedness of g, in which case all accumulation points
56
are stationary for ϕ, i.e., they satisfy the first order necessary conditions [5].
57
Contributions. In this paper we propose ZeroFPR , a nonmonotone linesearch al-
58
gorithm that, to the best of our knowledge, is the first that (1) addresses the same
59
range of problems as FBS, (2) requires the same black-box oracle as FBS (gradient of
60
one function and proximity operator of the other), (3) yet achieves superlinear rates
61
if some assumptions (only) at the limit point are met. Though related to minFBE al-
62
gorithm [50], ZeroFPR is conceptually different, mainly because it is gradient-free, in
63
the sense that it does not require the gradient of the FBE. Moreover,
64
• We provide the necessary theoretical background linking the concepts of station-
65
arity of a point for problem (1.1), criticality and optimality. To the best of our knowl-
66
edge, such an analysis was previously made only for the proximal point algorithm [45],
67
for a special case of the projected gradient method [7, 8] and for difference-of-convex
68
minimization problems [40].
69
• The analysis of the FBE, previously studied only in the case of f being C
2(IR
n)
70
and g convex [50], is extended to f and g as in Assumption I. In particular, we discuss
71
properties of f and g that ensure (1) continuous differentiabilty of the FBE around
72
critical points, (2) (strict) twice differentiability at critical points, and (3) equivalence
73
of strong local minimality for the original function and the FBE.
74
• Exploiting the investigated properties of the FBE and of critical points we prove
75
that ZeroFPR with monotone linesearch converges (1) globally if ϕ
γhas the Kurdyka-
76
Łojasiewicz property [33, 34, 27], and (2) superlinearly when quasi-Newton Broyden
77
directions are employed, under additional requirements at the limit point.
78
Organization of the paper. In Section 2 we introduce some notation and list known
79
facts about FBS. In Section 3 we define and explore notions of stationarity and crit-
80
icality for the investigated problem and relate them with properties of the forward-
81
backward operator. In Section 4 we extend the results of [50] about the fundamental
82
properties of the FBE to the more general setting addressed in this paper; for the sake
83
of readability, some of the proofs are deferred to Appendix A. Section 5 addresses the
84
core contribution of the paper, ZeroFPR; although arbitrary directions can be cho-
85
sen, we specialize the results on superlinear convergence to a quasi-Newton Broyden
86
method so as to truely maintain the same black-box oracle as FBS. Some ancillary
87
results needed for the proofs are listed in Appendix B. Finally, Section 6 illustrates
88
numerical results obtained with the proposed method.
89
2. Preliminaries.
90
2.1. Notation. The identity n ×n matrix is denoted as I, and the extended real
91
line as IR = IR ∪ {∞}. The open and closed ball of radius r ≥ 0 centered in x ∈ IR
nis
92
denoted as B(x; r) and B(x; r), respectively. Given a set E and a sequence (x
k)
k∈IN,
93
we write (x
k)
k∈IN⊂ E with the obvious meaning of x
k∈ E for all k ∈ IN. The
94
(possibly empty) set of cluster points of (x
k)
k∈INis denoted as ω (x
k)
k∈IN, or simply
95
as ω(x
k) whenever the indexing is clear from context. We say that (x
k)
k∈IN⊂ IR
nis
96
summable if P
k∈INkx
kk is finite, and square-summable if (kx
kk
2)
k∈INis summable.
97
A function h : IR
n→ IR is level-bounded if for all α ∈ IR the level-set lev
≤αh :=
98
{x ∈ IR
n| h(x) ≤ α} is bounded. Following the terminology of [49], we say that a
99
function f : IR
n→ IR is strictly continuous at ¯x if lim sup
y,z→¯x y6=z|f(y)−f(z)|
ky−zk
is finite, and
100
strictly differentiable at x ¯ if ∇f(¯x) exists and lim
y,z→¯x y6=zf (y)−f(z)−h∇f(¯x),y−zi
ky−zk
= 0. The
101
set of functions IR
n→ IR with Lipschitz continuous gradient is denoted as C
1,1(IR
n) ,
102
and for f ∈ C
1,1(IR
n) we write L
fto indicate the Lipschitz modulus of ∇f.
103
For a proper, closed function g : IR
n→ IR, a vector v ∈ ∂g(x) is a subgradient of
104
g at x, where the subdifferential ∂g(x) is considered in the sense of [49, Def. 8.3]
105
∂g(x) = n
v ∈ IR
n| ∃(x
k)
k∈IN→ x, (v
k∈ ˆ ∂g(x
k))
k∈IN→ v s.t. g(x
k) → g(x) o ,
106107
and ˆ∂g(x) is the set of regular subgradients of g at x, namely
108
∂g(x) = ˆ v ∈ IR
n| g(z) ≥ g(x) + hv, z − xi + o(kz − xk), ∀z ∈ IR
n.
109110
We have ∂ϕ(x) = ∇f(x) + ∂g(x) and ˆ∂ϕ(x) = ∇f(x) + ˆ∂g(x) [49, Ex. 8.8(c)].
111
Given a parameter value γ > 0, the Moreau envelope function g
γand the proximal
112
mapping prox
γgare defined by
113
g
γ(x) := inf
z
n g(z) +
2γ1kz − xk
2o (2.1) ,
114
prox
γg(x) := argmin
z
n g(z) +
2γ1kz − xk
2o (2.2) .
115 116
We now summarize some properties of g
γand prox
γg; the interested reader is referred
117
to [49] for a detailed discussion. A function g : IR
n→ IR is prox-bounded if there exists
118
γ > 0 such that g +
2γ1k · k
2is bounded below on IR
n. The supremum of all such γ
119
is the threshold γ
gof prox-boundedness for g. In particular, if g is convex or bounded
120
below then γ
g= ∞. In general, for any γ ∈ (0, γ
g) the proximal mapping prox
γgis
121
nonempty- and compact-valued, and the Moreau envelope g
γfinite [49, Thm. 1.25].
122
Given a nonempty closed set S ⊆ IR
nwe let δ
S: IR
n→ IR denote its indicator
123
function , namely δ
S(x) = 0 if x ∈ S and δ
S(x) = ∞ otherwise, and Π
S: IR
n⇒ IR
n124
the (set-valued) projection x 7→ argmin
z∈Skz − xk. Proximal mappings can be seen
125
as generalized projections, due to the relation Π
S= prox
γδSfor any γ > 0.
126
For a set-valued mapping T : IR
n⇒ IR
nwe let gph T = {(x, y) | y ∈ T (x)}
127
denote its graph, zer T = {x ∈ IR
n| 0 ∈ T (x)} the set of its zeros and fix T =
128
{x ∈ IR
n| x ∈ T (x)} the set of its fixed-points.
129
2.2. Forward-backward iterations. Due to the quadratic upper bound
130
(2.3) f (z) ≤ f(x) + h∇f(x), z − xi +
L2fkz − xk
2131
holding for all x, z ∈ IR
n[11, Prop. A.24], for any γ ∈ (0,
1/
Lf) the function
132
(2.4) `
f,gγ(z; x) := f (x) + h∇f(x), z − xi +
2γ1kz − xk
2+ g(z)
133
furnishes a majorization model for ϕ, in the sense that
134
• `
f,gγ(z ; x) ≥ ϕ(z) for all x, z ∈ IR
n, and
135
• `
f,gγ(x; x) = ϕ(x) for all x ∈ IR
n.
136
Given a point x ∈ IR
n, one iteration of forward-backward splitting (FBS) for problem
137
(1.1) consists in the minimization of the majorizing function `
f,gγ, namely, in selecting
138
(2.5) x
+∈ T
γf,g(x) := argmin
z`
f,gγ(z; x),
139
where γ ∈ 0, min {γ
g,
1/
Lf}
is the stepsize parameter. The (set-valued) forward-
140
backward operator T
γf,gcan be equivalently expressed as
141
T
γf,g(x) = prox
γg(x − γ∇f(x)), (2.6a)
142143
which motivates the bound γ < γ
gin (2.5) to ensure the existence of x
+for any x.
We also introduce the corresponding (set-valued) forward-backward residual, namely
144
R
f,gγ(x) :=
γ1x − T
γf,g(x).
(2.6b)
145146
Whenever no ambiguity occurs, we will omit the superscript and write simply `
γ, T
γ 147and R
γin place of `
f,gγ, T
γf,gand R
f,gγ, respectively.
148
The inclusion (2.5) emphasizes that FBS is a majorization-minimization algo-
149
rithm (MM), a class of methods which has been thoroughly analyzed when the ma-
150
jorizing function is strongly convex in the first argument [14] (for `
γ, this is the case
151
when g is convex). MM algorithms are of interest whenever minimizing the surrogate
152
function `
γ( · ; x) is significantly easier than directly addressing the non structured
153
minimization of ϕ. For FBS this translates into simplicity of prox
γgand ∇f oper-
154
ations, cf. (2.6a). Under very mild assumptions FBS iterations (2.5) converge to a
155
critical point (see §3) independently of the choice of x
+in the set T
γ(x) [5]. The key
156
is the following sufficient decrease property, whose proof can be found in [15, Lem. 2].
157
Lemma 2.1 (Sufficient decrease). For any γ ∈ (0, γ
g), x ∈ IR
nand x ¯ ∈ T
γ(x) it
158
holds that ϕ(¯ x) ≤ ϕ(x) −
1−γL2γ fkx − ¯xk
2.
159
3. Stationary and critical points. Unless ϕ is convex, the stationarity condi-
160
tion 0 ∈ ˆ∂ϕ(x
?) in problem (1.1) is only necessary for the optimality of x
?[49, Thm.
161
10.1]. In this section we define different concepts of (sub)optimality and show how
162
they are related for generic functions ϕ = f + g as in Assumption I.
163
Definition 3.1. We say that a point x
?∈ dom ϕ is
164
(i) stationary if 0 ∈ ˆ∂ϕ(x
?);
165
(ii) critical if it is γ-critical for some γ ∈ (0, γ
g), i.e., if x
?∈ T
γ(x
?);
166
(iii) optimal if x
?∈ argmin ϕ, i.e., if it solves ( 1.1).
167
The notion of criticality was already discussed in [7, 8] under the name of L-
168
stationarity (L plays the role of
1/
γ) for the special case of g = δ
B∩Cs, where B is a
169
convex set and C
sis the (nonconvex) set of vectors with at most s nonzero entries.
170
In [40] it is defined as d-stationarity, although the analysis is limited to difference-of-
171
convex minimization problems; more precisely, it addresses problem (1.1) for a concave
172
piecewise smooth function f and a convex function g.
173
If g is convex, then γ
g= ∞ and we may talk of criticality without mention of γ: in
174
this case, γ-criticality and stationarity are equivalent properties regardless of the value
175
of γ. For more general functions g, instead, the value of γ plays a role in determining
176
whether a point is γ-critical or not, which legitimizes the following definition.
177
Definition 3.2. The criticality threshold is the function Γ
f,g: IR
n→ [0, γ
g]
178
(3.1) Γ
f,g(x) := sup γ > 0 | x ∈ T
γf,g(x) ∪ {0} for x ∈ IR
n.
179
As usual, whenever f and g are clear from the context we simply write Γ in place
180
of Γ
f,g. The bound Γ ≤ γ
gis due to the fact that prox
γg(and consequently T
γ) is
181
everywhere empty-valued for γ > γ
g. Considering also γ = 0 forces the set in the
182
definition to be nonempty, and the lower-bound Γ ≥ 0 in particular; more precisely,
183
observe that, by definition, Γ(x) > 0 iff x is a critical point.
184
Example 3.3. Let us consider ϕ = f +g for f(x) =
12x
2and g = δ
Cwhere C = {±1}.
185
Clearly, γ
g= + ∞ (as g is lower-bounded), L
f= 1 and ±1 are both (unique) optima.
186
Since ˆ∂ϕ(x) = IR for x ∈ C and ˆ∂ϕ is clearly empty elsewhere, all points in C are
187
stationary. prox
γgis the (set-valued) projection on C, therefore the forward-backward
188
operator is T
γ(x) = Π
C((1 − γ)x). We have
189
T
γ( −1) =
{−1} if γ < 1 {±1} if γ = 1 {1} if γ > 1
and T
γ(1) =
{1} if γ < 1 {±1} if γ = 1 {−1} if γ > 1.
190
In particular, Γ(1) = Γ(−1) = 1.
191
We now list some properties of critical and optimal points which will be used to
192
derive regularity properties of T
γand g
γ.
193
Theorem 3.4 (Properties of critical points). The following properties hold:
194
(i) for γ ∈ (0, γ
g), a point x
?is γ-critical iff
195
g(x) ≥ g(x
?) + h − ∇f(x
?), x − x
?i −
2γ1kx − x
?k
2∀x ∈ IR
n;
196
(ii) if x
?is critical, then it is γ-critical for all γ ∈ (0, Γ(x
?)); moreover, x
?is also
197
Γ(x
?)-critical provided that Γ(x
?) < γ
g;
198
(iii) T
γ(x
?)= {x
?} and R
γ(x
?)= {0} for any critical point x
?and γ ∈ (0, Γ(x
?)).
199
Proof.
200
♠ 3.4(i): by definition, x
?is γ-critical iff `
γ(x
?; x
?) ≤ `
γ(x; x
?) for all x, i.e., iff
201
f (x
?) + g(x
?) ≤ f(x
?) + h∇f(x
?), x − x
?i +
2γ1kx − x
?k
2+ g(x) ∀x ∈ IR
n.
202
By suitably rearranging, the claim readily follows.
203
♠ 3.4(ii): since x
?is γ-critical, due to 3.4(i) apparently it is also γ
0-critical for any
204
γ
0∈ (0, γ]. From the definition (3.1) of the criticality threshold Γ(x
?), it then follows
205
that x
?is γ-critical for any γ ∈ (0, Γ(x
?)). Suppose now that Γ(x
?) < γ
g. Then, due
206
to 3.4(i) for all γ ∈ (0, Γ(x
?)) we have
207
g(x) ≥ g(x
?) + h − ∇f(x
?), x − x
?i −
2γ1kx − x
?k
2∀x ∈ IR
n.
208
By taking the limit as γ % Γ(x
?) we obtain that the inequality holds for Γ(x
?) as
209
well, proving the claim in light of the characterization 3.4(i).
210
♠ 3.4(iii): let x
?be a critical point, and let x ∈ T
γ(x
?) for some γ < Γ(x
?). Fix
211
γ
0∈ (γ, Γ(x
?)). From 3.4(i) and 3.4(ii) it then follows that
212
(3.2) g(x) ≥ g(x
?) + h − ∇f(x
?), x − x
?i −
2γ10kx − x
?k
2.
213
Since x, x
?∈ T
γ(x
?), it holds that `
γ(x
?; x
?) = `
γ(x; x
?), i.e.,
214
g(x
?) = h∇f(x
?), x − x
?i +
2γ1kx − x
?k
2+ g(x)
(3.2)
≥ g(x
?) +
2γ1−
2γ10kx − x
?k
2.
215
Since
2γ1−
2γ10> 0, necessarily x = x
?.
216
The inequality in Theorem 3.4(i) can be rephrased as the fact that the vector
217
−∇f(¯x) is a “global” proximal subgradient for g at ¯x as in [49, Def. 8.45], where
218
“global” refers to the fact that δ can be taken +∞ in the cited definition. An interesting
219
consequence is that the definition of criticality depends solely on ϕ and not on the
220
considered decomposition f + g; in fact, it is only the threshold Γ that depends on
221
it. To see this, let ˜ f = f − h and ˜g = g + h for some h ∈ C
1,1(IR
n), and consider a
222
point x
?which is γ-critical with respect to the decomposition f + g, i.e., such that
223
x
?∈ T
γf,g(x
?). Combining Theorem 3.4(i) with the quadratic bound (2.3) for h, we
224
obtain
225
˜
g(x) ≥ ˜g(x
?) − h∇˜ f (x
?), x − x
?i −
21+γLh1γkx − x
?k
2for all x ∈ IR
n.
226
Again from the characterization of Theorem 3.4(i), we deduce that x
?∈ T
˜γf ,˜˜g(x
?),
227
where ˜γ =
1+γLγ h. In particular, considering h = −f we infer that a point x
?is
228
critical iff x
?∈ T
γ0,ϕ(x
?) = prox
γϕ(x
?) for some γ > 0, which legitimizes the notion
229
of criticality without mentioning a specific decomposition.
230
In the next result we show that criticality is a halfway property between station-
231
arity and optimality. In light of these relations we shall seek “suboptimal” solutions
232
which we characterize as critical points.
233
Proposition 3.5 (Optimality, criticality, stationarity). Let ¯γ := min {γ
g,
1/
Lf}.
234
(i) (criticality ⇒ stationarity) fix T
γ⊆ zer ˆ ∂ϕ for all γ ∈ (0, γ
g);
235
(ii) (optimality ⇒ criticality) Γ(x
?) ≥ ¯γ for all x
?∈ argmin ϕ; in particular,
236
argmin ϕ ⊆ fix T
γfor all γ ∈ (0, ¯γ), and also for γ =
1/
Lfif γ
g>
1/
Lf;
237
Proof.
238
♠ 3.5(i): let γ ∈ (0, γ
g) and x ∈ fix T
γ. Since x minimizes g +
2γ1k · − x + γ∇f(x)k
2,
239
we have 0 ∈ ˆ∂g +
2γ1k · − x + γ∇f(x)k
2(x) = ˆ ∂g(x) + ∇f(x) = ˆ ∂ϕ(x), where the
240
first inclusion follows from [49, Thm. 10.1] and the equalities from [49, Thm. 8.8(c)].
241
This proves that x is stationary.
242
♠ 3.5(ii): Fix γ ∈ (0, ¯γ), x
?∈ argmin ϕ and y ∈ T
γ(x
?). Necessarily y = x
?, oth-
243
erwise, due to Lem. 2.1, ϕ(y) would contradict minimality of ϕ(x
?). Therefore, x
?is
244
γ -critical and the claim follows from the arbitrarity of γ ∈ (0, ¯γ).
245
As already seen in Example 3.3, the bound Γ(x
?) ≥ min {γ
g,
1/
Lf} at optimal
246
points in Proposition 3.5(ii) is tight, and clearly the implication “optimality ⇒ crit-
247
icality” cannot be reversed (consider, e.g., the point x
?= 0 for ϕ = cos). The next
248
example shows that the other implication is also proper.
249
Example 3.6 (Stationarity 6⇒ criticality). Let f(x) =
12x
2and g(x) = x
5/3. We have
250
γ
g= + ∞, L
f= 1, and for x
?= 0 it holds that ˆ∂ϕ(x
?) = {∇ϕ(x
?) } = {0}. Thus, x
?is
251
stationary; however, T
γ(x
?) = prox
γg(0) = −(
5γ/
3)
3, and in particular x
?∈ T /
γ(x
?)
252
for any γ > 0, proving x
?to be non critical.
253
4. Forward-backward envelope. The FBE (1.2) was introduced in [41] and
254
further analyzed in [50, 32] in the case when g is convex. Under such assumption the
255
FBE was shown to be continuously differentiable, which made it possible to derive
256
minimization algorithms based on its gradient. In the general setting addressed in
257
this paper the FBE might fail to be (continuously) differentiable, and as such we
258
need to resort to methods that do not need first-order information of the FBE. This
259
task will be addressed in Section 5 where Algorithm ZeroFPR will be proposed; other
260
than being applicable to a wider range of problems, the proposed scheme is entirely
261
based on the same oracle of forward-backward iterations, unlike the approaches in
262
[41, 50, 32] which instead require the computation of ∇
2f . All this will be possible
263
thanks to continuity properties of the FBE, and to the behavior of ϕ
γat critical
264
points. We now focus on its continuity, while the other property will be addressed
265
shortly after in Theorem 4.4.
266
Remark 4.1 (Alternative expressions for ϕ
γ). By expanding the square and rear-
267
ranging the terms in the definition (1.2), ϕ
γcan equivalently be expressed as
268
ϕ
γ(x) = inf
z∈IRn
n f (x) −
γ2k∇f(x)k
2+ g(z) +
2γ1kz − x + γ∇f(x)k
2o .
269
Comparing with (2.5), it is apparent that the set of minimizers z in the above expres-
270
sion coincides with T
γ(x), the forward-backward operator at x. Moreover, taking out
271
the constant term f(x) −
γ2k∇f(x)k
2from the infimum we immediately obtain the
272
following expression involving the Moreau envelope of g:
273
(4.1) ϕ
γ(x) = f (x) −
γ2k∇f(x)k
2+ g
γ(x − γ∇f(x)).
274
275
Other than providing an explicit way of computing the FBE, (4.1) emphasizes
276
how ϕ
γinherits the regularity properties of the Moreau envelope of g. In particular,
277
the next key property follows from the strict continuity of g
γ[49, Ex. 10.32].
278
Proposition 4.2 (Strict continuity of ϕ
γ). For any γ ∈ (0, γ
g), the FBE ϕ
γis a
279
real-valued and strictly continuous function on IR
n.
280
4.1. Connections with the Moreau envelope. For the special case f = 0,
281
FBS iterations (2.5) reduce to the proximal point algorithm (PPA) x
+∈ prox
γϕ(x) ,
282
first introduced in [35] for convex functions ϕ and later generalized for functions with
283
convex majorizing surrogate `
γ0,ϕ( · ; x) = ϕ( · ) +
2γ1k · − xk
2, see e.g., [26]. Similarly,
284
the FBE reduces to the Moreau envelope ϕ
γ= ϕ
0,ϕγ. In fact, the FBE extends the
285
connection between PPA and Moreau envelope
286
ϕ
γ(x) = min
z`
γ0,ϕ(z; x) ↔ prox
γϕ(x) = argmin
z`
γ0,ϕ(z; x), (4.2a)
287288
holding for f = 0 in (2.4), to majorizing functions `
f,gγwith arbitrary f ∈ C
1,1(IR
n)
289
ϕ
γ(x) = min
z`
f,gγ(z; x) ↔ T
γ(x) = argmin
z`
f,gγ(z; x).
(4.2b)
290291
In the next section we will see the fundamental qualitative similarities between the
292
FBE and the Moreau envelope. Namely, for γ small enough both ϕ
γand ϕ
γare lower
293
bounds for the original function ϕ with same minimizers and minimum; in particular
294
the minimization of ϕ is equivalent to that of ϕ
γor ϕ
γ. Similarly, the identity
295
ϕ(¯ x) = ϕ
γ(x) −
2γ1kx − ¯xk
2for ¯x ∈ prox
γϕ(x)
296297
will be extended to the inequality
298
ϕ(¯ x) ≤ ϕ
γ(x) −
1−γL2γ fkx − ¯xk
2for ¯x ∈ T
γ(x).
299300
4.2. Basic properties. We now provide bounds relating ϕ
γto the original func-
301
tion ϕ that extend the well known inequalities involving the Moreau envelope.
302
Proposition 4.3. Let γ ∈ (0, γ
g) be fixed. Then
303
(i) ϕ
γ≤ ϕ.
304
(ii) ϕ(¯ x) ≤ ϕ
γ(x) −
1−γL2γ fkx − ¯xk
2for all x ∈ IR
nand x ¯ ∈ T
γ(x).
305
Proof. 4.3(i) is obvious from the definition of the FBE (consider z = x in (1.2)). As
306
to 4.3(ii), since the set of minimizers in (1.2) is T
γ(x) (cf. (4.2b)), (2.3) yields
307
ϕ
γ(x) = f (x) + h∇f(x), ¯x − xi + g(¯x) +
2γ1kx − ¯xk
2308
≥ f(¯x) −
Lf/
2k¯x − xk
2+ g(¯ x) +
2γ1kx − ¯xk
2= ϕ(¯ x) +
1−γL2γ fkx − ¯xk
2.
309310
With respect to the inequalities holding for convex g treated in [50], the lower
311
bound in Proposition 4.3 is weaker, while the upper bound unchanged. Regardless, an
312
immediate consequence of the result is that the value of ϕ and ϕ
γat critical points is
313
the same, and minimizers and infima of the two functions coincide for γ small enough.
314
Theorem 4.4. The following hold
315
(i) ϕ(x) = ϕ
γ(x) for all γ ∈ (0, γ
g) and x ∈ fix T
γ;
316
(ii) inf ϕ = inf ϕ
γand argmin ϕ = argmin ϕ
γfor all γ ∈ 0, min {
1/
Lf, γ
g} .
317
The bound γ <
1/
Lfin Theorem 4.4(ii) is tight even when f and g are convex,
318
as the counterexample with f(x) =
12x
2and g = δ
IR+shows (see [50, Ex. 2.4] for
319
details).
320
Although we will address problem (1.1) by simply exploiting the continuity of
321
the FBE, nevertheless ϕ
γenjoys favorable properties which are key for the efficacy of
322
the method which will be discussed in Section 5. Firstly, observe that, due to strict
323
continuity, ϕ
γis almost everywhere differentiable, as it follows from Rademacher’s
324
theorem. The same applies to the mapping x 7→ x − γ∇f(x), its Jacobian being
325
(4.3) Q
γ(x) := I − γ∇
2f (x)
326
which is symmetric wherever it exists [49, Cor. 13.42 and Prop. 13.34]. However, in
327
order to show that the proposed method achieves fast convergence we need additional
328
regularity properties, namely (strict) twice differentiability at critical points and con-
329
tinuous differentiability around. The rest of the section is dedicated to this task.
330
4.3. Prox-regularity and first-order properties. In the favorable case in
331
which g is convex and f ∈ C
2(IR
n) , the FBE enjoys global continuous differentiability
332
[50]. In our setting, prox-regularity acts as a surrogate of convexity; the interested
333
reader is referred to [49, §13.F] for a detailed discussion.
334
Definition 4.5 (Prox-regularity). Function g is said to be prox-regular at x
0for
335
v
0∈ ∂g(x
0) if there exist ρ, ε > 0 such that for all x
0∈ B(x
0; ε) and
336
(x, v) ∈ gph ∂g s.t. x ∈ B(x
0; ε), v ∈ B(v
0; ε), and g(x) ≤ g(x
0) + ε
337
it holds that g(x
0) ≥ g(x) + hv, x
0− xi −
ρ2kx
0− xk
2.
338
Prox-regularity is a mild requirement enjoyed globally and for any subgradient
339
by all convex functions, with ε = +∞ and ρ = 0. When g is prox-regular at x
0for v
0,
340
then for sufficiently small γ > 0 the Moreau envelope g
γis continuously differentiable
341
in a neighborhood of x
0+ γv
0[45]. To our purposes, when needed, prox-regularity of
342
g will be required only at critical points x
?, and only for the subgradient −∇f(x
?).
343
Therefore, with a slight abuse of terminology we define prox-regularity of critical
344
points as follows.
345
Definition 4.6 (Prox-regularity of critical points). We say that a critical point x
?346
is prox-regular if g is prox-regular at x
?for −∇f(x
?).
347
Examples where a critical point fails to be prox-regular are of challenging con-
348
struction; before illustrating a cumbersome such instance in Example 4.9, we first
349
prove an important result that connects prox-regularity with first-order properties of
350
the FBE.
351
Theorem 4.7 (Continuous differentiability of ϕ
γ). Suppose that f is of class C
2352
around a prox-regular critical point x
?. Then, for all γ ∈ (0, Γ(x
?)) there exists a
353
neighborhood U
x?of x
?on which the following properties hold:
354
(i) T
γand R
γare strictly continuous, and in particular single-valued;
355
(ii) ϕ
γ∈ C
1with ∇ϕ
γ= Q
γR
γ, where Q
γis as in (4.3).
356
Proof. For γ
0∈ (γ, Γ(x
?)), using Thm.s 3.4(i) and 3.4(iii) we obtain that
357
(4.4) g(x) ≥ g(x
?) − h∇f(x
?), x − x
?i −
2γ10kx − x
?k
2∀x ∈ IR
n.
358
Replacing γ
0with γ in the above expression, the inequality is strict for all x 6= x
?.
359
From [45, Thm. 4.4] applied to the “tilted” function x 7→ g(x+x
?) −g(x
?) −h∇f(x
?), x i
360
it follows that there is a neighborhood V of x
?− γ∇f(x
?) in which prox
γgis strictly
361
continuous and g
γis of class C
1+with ∇g
γ(x) = γ
−1x − prox
γg(x)
for all x ∈ V .
362
Since f is C
2around x
?and ∇f is continuous, by possibly narrowing U
x?we may
363
assume that f ∈ C
2(U
x?) and x − γ∇f(x) ∈ V for all x ∈ U
x?. Part 4.7(ii) then
364
follows from (4.1) and the chain rule of differentiation, and 4.7(i) from the fact that
365
strict continuity is preserved by composition.
366
When f = 0, Theorem 4.7 restates the known fact that if g is prox-regular at
367
x
?for 0 ∈ ∂g(x
?), then g
γis continuously differentiable around x
?with ∇g
γ(x) =
368 1
γ
(x − prox
γg(x)) . Notice that the bound γ < Γ(x
?) is tight: in general, for γ = Γ(x
?)
369
no continuity of T
γnor continuous differentiability of ϕ
γaround x
?can be guaranteed.
370
In fact, even when x
?is Γ(x
?) -critical, T
γmight even fail to be single-valued and ϕ
γ 371differentiable at x
?, as the following counterexample shows.
372
Example 4.8 (Necessity of γ 6= Γ(x
?) in first-order properties). Consider f =
12x
2373
and g = δ
Swhere S = {0, 1}. Then, L
f= 1, γ
g= + ∞, T
γ(x) = Π
S((1 −γ)x) and the
374
FBE is ϕ
γ(x) =
1−γ2kxk
2+
2γ1dist((1 − γ)x, S)
2. At the critical point x = 1, which
375
satisfies Γ(1) =
1/
2, g is prox-regular for any subgradient. For any γ ∈ (0,
1/
2) it is
376
easy to see that ϕ
γis differentiable in a neighborhood of x = 1. However, for γ =
1/
2377
the distance function has a first-order singularity in x = 1, due to the 2-valuedness of
378
T
γ(1) = Π
S(
1/
2) = {0, 1}.
379
Example 4.9 (Prox-nonregularity of critical points). Consider ϕ = f + g where
380
f (x) =
12x
2, g(x) = δ
S(x) and S = {
1/
n| n ∈ IN
≥1} ∪ {0}. For x
0= 0 we have
381
Γ(x
0) = + ∞, however g fails to be prox-regular at x
0for v
0= 0 = −∇f(x
0). For
382
any ρ > 0 and for any neighborhood V of (0, 0) in gph g it is always possible to find
383
a point arbitrarily close to (0, −
1/
ρ) with multi-valued projection on V . Specifically,
384
the midpoint P
n=
12(
n1+
n+11), −
1/
ρhas 2-valued projection on gph g for any
385
n ∈ IN
≥1, being it Π
gphg(P
n) = {
1/
n,
1/
n+1}. By considering a large n, P
ncan be
386
made arbitrarily close to (0, −
1/
ρ) and at the same time its projection(s) arbitrarily
387
close to (0, 0). It follows that g cannot be prox-regular at 0 for 0, for otherwise such
388
projections would be single-valued close enough to (0, 0) [45, Cor. 3.4 and Thm. 3.5].
389
As a result, g
γ(x) =
2γ1dist(x, S)
2is not differentiable around x = 0, and indeed at
390
each midpoint
12(
1n+
n+11) for n ∈ IN
≥1it has a nonsmooth spike.
391
To underline how unfortunate the situation depicted in Example 4.9 is, notice
392
that adding a linear term λx to f for any λ 6= 0, yet leaving g unchanged, restores
393
the desired prox-regularity of each critical point. Indeed, this is trivially true for any
394
nonzero critical point; besides, g is prox-regular at 0 for any λ ∈ (0, −∞), while for
395
any λ < 0 the point 0 is not critical.
396
4.4. Second-order properties. In this section we discuss sufficient conditions
397
for twice-differentiability of the FBE at critical points. Additionally to prox-regularity,
398
which is needed for local continuous differentiability, we will also need generalized
399
second-order properties of g. The interested reader is referred to [49, §13] for an
400
extensive discussion on epi-differentiability.
401
Assumption II. With respect to a given critical point x
?402
(i) ∇
2f exists and is (strictly) continuous around x
?;
403
(ii) g is prox-regular and (strictly) twice epi-differentiable at x
?for −∇f(x
?), with
404
its second order epi-derivative being generalized quadratic:
405
(4.5) d
2g(x
?|−∇f(x
?))[d] = hd, Mdi + δ
S(d), ∀d ∈ IR
n406
where S ⊆ IR
nis a linear subspace and M ∈ IR
n×n. Without loss of generality
407
we take M symmetric, and such that Im(M ) ⊆ S and ker(M) ⊇ S
⊥.
1408
We say that the assumptions are “strictly” satisfied if the stronger conditions in paren-
409
thesis hold.
410
Twice epi-differentiability of g is a mild requirement, and cases where d
2g is
411
generalized quadratic are abundant [47, 48, 43, 44]. Moreover, prox-regular and C
2-
412
partly smooth functions g (see [29, 19]) comprise a wide class of functions that strictly
413
satisfy Assumption II(ii) at a critical point x
?provided that strict complementarity
414
holds, namely if −∇f(x
?) ∈ relint ∂g(x
?). In fact, it follows from [19, Thm. 28]
415
applied to the tilted function ˜g = g + h∇f(x
?), · i (which is still C
2-partly smooth
416
and prox-regular at x
?[29, Cor. 4.6], [49, Ex. 13.35]) that prox
γ ˜gis continuously
417
differentiable around x
?for γ small enough (in fact, for γ < Γ(x
?)). From [42, Thm
418
4.1(g)] we then obtain that ˜g is strictly twice epi-differentiable at x
?with generalized
419
quadratic second-order epiderivative, and the claim follows by tilting back to g.
420
We now show that the quite common properties required in Assumption II are all
421
that is needed for ensuring first-order properties of the proximal mapping and second-
422
order properties of the FBE at critical points. The result generalizes the one in [50]
423
by allowing nonconvex functions g. Although the proof is quite similar, we include it
424
for the sake of self-inclusiveness.
425
Theorem 4.10 (Twice differentiability of ϕ
γ). Suppose that Assumption II is (strictly)
426
satisfied with respect to a critical point x
?. Then, for any γ ∈ (0, Γ(x
?))
427
(i) prox
γgis (strictly) differentiable at x
?− γ∇f(x
?) with symmetric and positive
428
semidefinite Jacobian
429
(4.6) P
γ(x
?) := J prox
γg(x
?− γ∇f(x
?));
430
1This can indeed be done without loss of generality: if M and S satisfy (4.5), then it suffices to replace M with M0=12ΠS(M + M>) ΠSto ensure the desired properties.
(ii) R
γis (strictly) differentiable at x
?with Jacobian
431
(4.7) JR
γ(x
?) =
1γ[I − P
γ(x
?)Q
γ(x
?)],
432
where Q
γis as in (4.3) and P
γas in (4.6);
433
(iii) ϕ
γis (strictly) twice differentiable at x
?with symmetric Hessian
434
(4.8) ∇
2ϕ
γ(x
?) = Q
γ(x
?)JR
γ(x
?).
435
Proof. See Appendix A.
436
Again, when f ≡ 0 Theorem 4.10 covers the differentiability properties of the
437
proximal mapping (and consequently the second-order properties of the Moreau en-
438
velope, due to the identity ∇g
γ(x) =
1γ(x − prox
γg(x))) as discussed in [42].
439
We now provide a key result that links nonsingularity of the Jacobian of the
440
forward-backward residual R
γto strong (local) minimality for the original cost ϕ and
441
for the FBE ϕ
γ, under the generalized second-order properties of Assumption II.
442
Theorem 4.11 (Conditions for strong local minimality). Suppose that Assumption
443
II is satisfied with respect to a critical point x
?, and let γ ∈ (0, min {Γ(x
?),
1/
Lf}).
444
The following are equivalent:
445
(a) x
?is a strong local minimum for ϕ;
446
(b) x
?is a local minimum for ϕ and JR
γ(x
?) is nonsingular;
447
(c) the (symmetric) matrix ∇
2ϕ
γ(x
?) is positive definite;
448
(d) x
?is a strong local minimum for ϕ
γ;
449
(e) x
?is a local minimum for ϕ
γand JR
γ(x
?) is nonsingular.
450
Proof. See Appendix A.
451
5. ZeroFPR algorithm. The first algorithmic framework exploiting the FBE for
452
solving composite minimization problems was studied in [41], and other schemes have
453
been recently investigated in [50, 32]. All such methods tackle the problem by looking
454
for a (local) minimizer of the FBE, exploiting the equivalence of (local) minimality
455
for the original function ϕ and for the FBE ϕ
γ, for γ small enough. To do so, they all
456
employ the concept of directions of descent, thus requiring the gradient of the FBE to
457
be well defined everywhere. In the more general framework addressed in this paper,
458
such basic requirement is not met, which is why we approach the problem from a
459
different perspective. This leads to ZeroFPR , the first algorithm, to the best of our
460
knowledge, that despite requiring only the black-box oracle of FBS and being suited
461
for fully nonconvex problems it achieves superlinear convergence rates.
462
5.1. Overview. Instead of directly addressing the minimization of ϕ or ϕ
γ, we
463
seek solutions of the following nonlinear inclusion (generalized equation)
464
(5.2) find x
?∈ IR
nsuch that 0 ∈ R
γ(x
?).
465
By doing so we address the problem from the same perspective of FBS, that is, finding
466
fixed points of the forward-backward operator T
γor, equivalently, zeros of its residual
467
R
γ. Despite R
γmight be quite irregular when g is nonconvex, it enjoys favorable
468
properties at the very solutions to (5.2) — i.e., at γ-critical points — starting from
469
single-valuedness, cf. Theorem 3.4(iii). If some assumptions are met, R
γturns out to
470
be continuous around and even differentiable at critical points (cf. Theorems 4.7 and
471
Algorithm ZeroFPR generalized forward-backward with nonmonotone linesearch Require γ ∈ (0, min {
1/
Lf, γ
g}), β, p
min∈ (0, 1), σ ∈ (0, γ
1−γL2 f), x
0∈ IR
n. Initialize ¯ Φ
0= ϕ
γ(x
0), k = 0.
1:
Select ¯x
k∈ T
γ(x
k) and set r
k=
1γ(x
k− ¯x
k)
2:
if kr
kk = 0, then stop; end if
3:
Select a direction d
k∈ IR
n4:
Let τ
k∈ {β
m| m ∈ IN} be the largest such that x
k+1= ¯ x
k+ τ
kd
ksatisfies (5.1) ϕ
γ(x
k+1) ≤ ¯ Φ
k− σkr
kk
25:
Φ ¯
k+1= (1 − p
k) ¯ Φ
k+ p
kϕ
γ(x
k+1) for some p
k∈ [p
min, 1]
k ← k + 1 and go to step 1.
4.10), and as a consequence the inclusion problem (5.2) reduces to a well behaved
472
system of equations, as opposed to generalized equations, when close to solutions.
473
This motivates addressing problem (5.2) with fast methods for nonlinear equa-
474
tions. Newton-like schemes are iterative methods that prescribe updates of the form
475
(5.3) x
+= x − HR
γ(x)
476
which essentially amount to selecting H = H(x), a linear operator that ideally carries
477
information of the geometry of R
γaround x, in the attempt to yield an optimal
478
iterate x
+. For instance, when R
γis sufficiently regular Newton method corresponds
479
to selecting H as the inverse of an element of the generalized Jacobian of R
γat x,
480
enabling fast convergence when close to a solution under some assumptions. However,
481
selecting H as in Newton method would require information additional to the forward-
482
backward oracle T
γ, and as such it goes beyond the scope of the paper. For this reason
483
we focus instead on quasi-Newton schemes, in which H are linear operators recursively
484
defined with low-rank updates that satisfy the (inverse) secant condition
485
(5.4) H
+y = s, where s = x
+− x and y ∈ R
γ(x
+) − R
γ(x).
486
A famous result [21] states that, under some assumptions and starting sufficiently
487
close to a solution x
?, updates as in (5.3) are superlinearly convergent to x
?iff the
488
Dennis-Moré condition holds, namely the limit
k(H−1−JRkskγ(x?))sk→ 0, see also [22]
489
for a thorough survey. More recently, in [23] the result was extended to generalized
490
equations of the form f(x) + G(x) 3 0, where f is smooth and G possibly set-valued.
491
The study focuses on Josephy-Newton methods where the update x
+is the solution
492
of the inner problem f(x) − Bx ∈ Bx
++ G(x
+), where B = H
−1, which can be
493
interpreted as a forward-backward step in the metric induced by B. In particular,
494
differently from the proposed ZeroFPR , the method in [23] has the crucial limitation
495
that, unless the operator B has a very particular structure, the backward step (B +
496
G)
−1may be prohibitely challenging. The same remark applies to proximal (quasi-)
497
Newton-type methods, in which each iteration requires the computation of a scaled
498
proximal gradient step, see [28] and the references therein.
499
5.1.1. Globalization strategy. Quasi-Newton schemes are extremely handy
500
and widely used methods. However, it is well known that they are effective only when
501
close enough to a solution and might even diverge otherwise. To cope with this crucial
502
downside there comes the need of a globalization strategy; this is usually addressed
503
by means of a linesearch over a suitable merit function ψ, along directions of descent
504
for ψ so as to ensure sufficient decrease for small enough stepsizes. Unfortunately, the
505
potential choice ψ(x) =
12kR
γ(x) k
2is not regular enough for a ‘direction of descent’
506
to be everywhere defined. The proposed Algorithm ZeroFPR bypasses this limitation
507
by exploiting the favorable properties of the FBE. In Theorem 5.10 we will see that
508
ZeroFPR achieves superlinear convergence, provided that f and g enjoy some regular-
509
ity requirements at the limit point and the directions satisfy a Dennis-Moré condition.
510
However, regardless of whether or not any of such conditions is met, the algorithm
511
has the same convergence guarantees of FBS (cf. Thm. 5.6).
512
ZeroFPR globalizes the convergence of any fast local method, and requires exactly
513
the same oracle of FBS. Conceptually, the algorithm is really elementary; for simplic-
514
ity, let us first consider the monotone case, i.e., with p
k≡ 1 so that ¯ Φ
k= ϕ
γ(x
k) (cf.
515
step 5). The following steps are executed for updating the iterate x
k:
516
1) first, at step 1 a nominal forward-backward call yields an element ¯x
k∈ T
γ(x
k)
517
that decreases the value of ϕ
γby at least γ
1−γL2 fkr
kk
2(Prop. 4.3(i));
518
2) then, at step 3 an update direction d
kat ¯x
k(not at x
k!) is selected;
519
3) because of the sufficient decrease x
k7→ ¯x
kon ϕ
γand the continuity of ϕ
γ, at
520
step 4 a stepsize τ
kcan be found with finite many backtrackings τ
k← βτ
k521
that ensures a decrease for ϕ
γof at least σkr
kk
2in the update x
k7→ ¯x
k+τ
kd
k,
522
for any σ < γ
1−γL2 f.
523
In order to reduce the number of backtrackings, p
k< 1 can be selected resulting
524
in a nonmonotone linesearch. The sufficient decrease is enforced with respect to a
525
parameter ¯Φ
k≥ ϕ
γ(x
k) (cf. Lem. 5.1), namely a convex combination of ϕ
γ(x
i)
k i=0.
526
For the sake of convergence, (p
k)
k∈INcan be selected arbitrarily in (0, 1] as long as it is
527
bounded away from 0, hence the role of the user-set lower bound p
min. Consequently,
528
small values of σ and p
kconcur in reducing conservatism in the linesearch by favoring
529
larger stepsizes.
530
Lemma 5.1 (Nonmonotone linesearch globalization). For all k ∈ IN the iterates
531
generated by ZeroFPR satisfy
532
(5.5) ϕ
γ(¯ x
k) ≤ ϕ(¯x
k) ≤ ϕ
γ(x
k) ≤ ¯ Φ
k 533and there exists τ ¯
k> 0 such that
534
(5.6) ϕ
γ(¯ x
k+ τ d
k) ≤ ¯ Φ
k− σkr
kk
2∀τ ∈ [0, ¯τ
k].
535
In particular, the number of backtrackings at step 4 is finite.
536
Proof. The first two inequalities in (5.5) are due to Prop.s 4.3(i) and 4.3(ii), respec-
537
tively. Moreover,
538
Φ ¯
k+1= (1 − p
k) ¯ Φ
k+ p
kϕ
γ(x
k+1) ≥ (1 − p
k)ϕ
γ(x
k+1) + p
kϕ
γ(x
k+1) = ϕ
γ(x
k+1),
539
where the inequality follows by the linesearch condition (5.1); this proves the last
540
inequality in (5.5). As to (5.6), let k be fixed and contrary to the claim suppose that
541
for all ε > 0 there exists τ
ε∈ [0, ε] such that the point x
ε= ¯ x
k+ τ
εd
ksatisfies
542
ϕ
γ(x
ε) > ϕ
γ(x
k) − σkr
kk
2. By taking the limit for ε → 0
+, so that x
ε→ ¯x
k, we have
543
ϕ
γ(¯ x
k) = lim
ε→0+
ϕ
γ(x
ε) ≥ ϕ
γ(x
k) − σkr
kk
2≥ ϕ(¯x
k) + γ
1−γL2 f− σkr
kk
2> ϕ(¯ x
k)
544
which contradicts Prop. 4.3(i). Here, the equality follows from the continuity of ϕ
γ 545(Prop. 4.2), the first inequality from the property of x
ε, the second one from Prop.
546