1. Introduction. In this paper we deal with optimization problems of the form

(1)

NONCONVEX FUNCTIONS: FURTHER PROPERTIES AND

2

NONMONOTONE LINE-SEARCH ALGORITHMS

^∗

3

ANDREAS THEMELIS AND LORENZO STELLA AND PANOS PATRINOS ^† 4

Abstract. We proposeZeroFPR, a nonmonotone linesearch algorithm for minimizing the sum of 5

two nonconvex functions, one of which is smooth and the other possibly nonsmooth.ZeroFPRis the 6

first algorithm that, despite being fit for fully nonconvex problems and requiring only the black-box 7

oracle of forward-backward splitting (FBS) — namely evaluations of the gradient of the smooth term 8

and of the proximity operator of the nonsmooth one — achieves superlinear convergence rates under 9

mild assumptions at the limit point when the linesearch directions satisfy a Dennis-Moré condition, 10

and we show that this is the case for Broyden’s quasi-Newton directions. Our approach is based on the 11

forward-backward envelope (FBE), an exact and strictly continuous penalty function for the original 12

cost. Extending previous results we show that, despite being nonsmooth for fully nonconvex problems, 13

the FBE still enjoys favorable first- and second-order properties which are key for the convergence 14

results ofZeroFPR. Our theoretical results are backed up by promising numerical simulations. On 15

large-scale problems, by computing linesearch directions using limited-memory quasi-Newton updates 16

our algorithm greatly outperforms FBS and its accelerated variant (AFBS).

17

Key words. Nonsmooth optimization, nonconvex optimization, forward-backward splitting, line- 18

search methods, quasi-Newton methods, prox-regularity.

19

AMS subject classifications. 90C06, 90C25, 90C26, 90C53, 49J52, 49J53.

20

1. Introduction. In this paper we deal with optimization problems of the form

21

(1.1) minimize

x∈IRⁿ

ϕ(x) ≡ f(x) + g(x)

22

under the following requirements, which will be assumed without further mention.

23

Assumption I (Basic assumption). In problem (1.1)

24

(i) f ∈ C

^1,1

(IR

ⁿ

) (differentiable with L

f

-Lipschitz continuous gradient);

25

(ii) g : IR

ⁿ

→ IR is proper, closed and γ

^g

-prox-bounded (see Section 2.1);

26

(iii) a solution exists, that is, argmin ϕ 6= ∅.

27

Both f and g are allowed to be nonconvex, making (1.1) prototypic for a plethora

28

of applications spanning signal and image processing, machine learning, statistics,

29

control and system identification. A well known algorithm addressing (1.1) is forward-

30

backward splitting (FBS), also known as proximal gradient method. FBS has been

31

thoroughly analyzed under the assumption of g being convex. If moreover f is convex,

32

then FBS is known to converge globally with rate O(1/k) in terms of objective value,

33

where k is the iteration count. In this case, accelerated variants of FBS, also known

34

as fast forward-backward splitting (FFBS), can be derived thanks to the work of

35

Nesterov [9, 36], that only require minimal additional computations per iteration but

36

achieve the provably optimal global convergence rate of order o(1/k

²

) [6].

37

The work in [41] pioneered an alternative acceleration technique. The method is

38

based on an exact, real-valued penalty function for the original problem (1.1), namely

39

∗This work was supported by: KU Leuven internal funding: StG/15/043 Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS Project no 30468160 (SeLMA) FWO projects: G086318N; G086518N

†Department of Electrical Engineering (ESAT-STADIUS) & Optimization in Engineering Center (OPTEC) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

andreas.themelis@esat.kuleuven.be,lorenzostella@gmail.com,panos.patrinos@esat.kuleuven.be 1

(2)

the forward-backward envelope (FBE), defined as follows

40

(1.2) ϕ

γ

(x) = ϕ

^f,g_γ

(x) := inf

z∈IRⁿ

n f (x) + h∇f(x), z − xi +

2γ¹

kz − xk

²

+ g(z) o

41

where γ > 0 is a given parameter. We will adopt the simpler notation ϕ

γ

without

42

superscript whenever f and g are clear from context.

43

The name forward-backward envelope comes from the fact that ϕ

γ

(x) is the value

44

of the minimization problem that defines the forward-backward step and alludes to

45

the kinship that it has with the Moreau envelope. These claims will be addressed more

46

in detail in Section 4. When f is sufficiently smooth and both f and g are convex,

47

the FBE was shown to be continuously differentiable and amenable to be minimized

48

with generalized Newton methods. More recently, [50] proposed a linesearch algorithm

49

based on (L-)BFGS quasi-Newton directions for minimizing the FBE. The curvature

50

information exploited by Newton-like methods acts as an online preconditioner, en-

51

abling superlinear rates of convergence, under some assumptions. However, unlike

52

plain (F)FBS schemes, such methods require accessing second-order information of

53

the smooth term f (needed for the evaluation of ∇ϕ

γ

), and are well defined only as

54

long as the nonsmooth term g is convex. On the contrary, FBS only requires first-order

55

information on f and prox-boundedness of g, in which case all accumulation points

56

are stationary for ϕ, i.e., they satisfy the first order necessary conditions [5].

57

Contributions. In this paper we propose ZeroFPR , a nonmonotone linesearch al-

58

gorithm that, to the best of our knowledge, is the first that (1) addresses the same

59

range of problems as FBS, (2) requires the same black-box oracle as FBS (gradient of

60

one function and proximity operator of the other), (3) yet achieves superlinear rates

61

if some assumptions (only) at the limit point are met. Though related to minFBE al-

62

gorithm [50], ZeroFPR is conceptually different, mainly because it is gradient-free, in

63

the sense that it does not require the gradient of the FBE. Moreover,

64

• We provide the necessary theoretical background linking the concepts of station-

65

arity of a point for problem (1.1), criticality and optimality. To the best of our knowl-

66

edge, such an analysis was previously made only for the proximal point algorithm [45],

67

for a special case of the projected gradient method [7, 8] and for difference-of-convex

68

minimization problems [40].

69

• The analysis of the FBE, previously studied only in the case of f being C

²

(IR

ⁿ

)

70

and g convex [50], is extended to f and g as in Assumption I. In particular, we discuss

71

properties of f and g that ensure (1) continuous differentiabilty of the FBE around

72

critical points, (2) (strict) twice differentiability at critical points, and (3) equivalence

73

of strong local minimality for the original function and the FBE.

74

• Exploiting the investigated properties of the FBE and of critical points we prove

75

that ZeroFPR with monotone linesearch converges (1) globally if ϕ

γ

has the Kurdyka-

76

Łojasiewicz property [33, 34, 27], and (2) superlinearly when quasi-Newton Broyden

77

directions are employed, under additional requirements at the limit point.

78

Organization of the paper. In Section 2 we introduce some notation and list known

79

facts about FBS. In Section 3 we define and explore notions of stationarity and crit-

80

icality for the investigated problem and relate them with properties of the forward-

81

backward operator. In Section 4 we extend the results of [50] about the fundamental

82

properties of the FBE to the more general setting addressed in this paper; for the sake

83

of readability, some of the proofs are deferred to Appendix A. Section 5 addresses the

84

core contribution of the paper, ZeroFPR; although arbitrary directions can be cho-

85

sen, we specialize the results on superlinear convergence to a quasi-Newton Broyden

86

method so as to truely maintain the same black-box oracle as FBS. Some ancillary

87

(3)

results needed for the proofs are listed in Appendix B. Finally, Section 6 illustrates

88

numerical results obtained with the proposed method.

89

2. Preliminaries.

90

2.1. Notation. The identity n ×n matrix is denoted as I, and the extended real

91

line as IR = IR ∪ {∞}. The open and closed ball of radius r ≥ 0 centered in x ∈ IR

ⁿ

is

92

denoted as B(x; r) and B(x; r), respectively. Given a set E and a sequence (x

^k

)

_k_∈IN

,

93

we write (x

^k

)

_k_∈IN

⊂ E with the obvious meaning of x

^k

∈ E for all k ∈ IN. The

94

(possibly empty) set of cluster points of (x

^k

)

_k_∈IN

is denoted as ω (x

^k

)

_k_∈IN

, or simply

95

as ω(x

^k

) whenever the indexing is clear from context. We say that (x

^k

)

_k_∈IN

⊂ IR

ⁿ

is

96

summable if P

k∈IN

kx

^k

k is finite, and square-summable if (kx

^k

k

²

)

_k_∈IN

is summable.

97

A function h : IR

ⁿ

→ IR is level-bounded if for all α ∈ IR the level-set lev

≤α

h :=

98

{x ∈ IR

ⁿ

| h(x) ≤ α} is bounded. Following the terminology of [49], we say that a

99

function f : IR

ⁿ

→ IR is strictly continuous at ¯x if lim sup

^y,z→¯x y6=z

|f(y)−f(z)|

ky−zk

is finite, and

100

strictly differentiable at x ¯ if ∇f(¯x) exists and lim

^y,z→¯x y6=z

f (y)−f(z)−h∇f(¯x),y−zi

ky−zk

= 0. The

101

set of functions IR

ⁿ

→ IR with Lipschitz continuous gradient is denoted as C

^1,1

(IR

ⁿ

) ,

102

and for f ∈ C

^1,1

(IR

ⁿ

) we write L

f

to indicate the Lipschitz modulus of ∇f.

103

For a proper, closed function g : IR

ⁿ

→ IR, a vector v ∈ ∂g(x) is a subgradient of

104

g at x, where the subdifferential ∂g(x) is considered in the sense of [49, Def. 8.3]

105

∂g(x) = n

v ∈ IR

ⁿ

| ∃(x

^k

)

_k∈IN

→ x, (v

^k

∈ ˆ ∂g(x

^k

))

_k∈IN

→ v s.t. g(x

^k

) → g(x) o ,

106107

and ˆ∂g(x) is the set of regular subgradients of g at x, namely

108

∂g(x) = ˆ v ∈ IR

ⁿ

| g(z) ≥ g(x) + hv, z − xi + o(kz − xk), ∀z ∈ IR

ⁿ

.

109110

We have ∂ϕ(x) = ∇f(x) + ∂g(x) and ˆ∂ϕ(x) = ∇f(x) + ˆ∂g(x) [49, Ex. 8.8(c)].

111

Given a parameter value γ > 0, the Moreau envelope function g

^γ

and the proximal

112

mapping prox

_γg

are defined by

113

g

^γ

(x) := inf

z

n g(z) +

_2γ¹

kz − xk

²

o (2.1) ,

114

prox

_γg

(x) := argmin

z

n g(z) +

_2γ¹

kz − xk

²

o (2.2) .

115 116

We now summarize some properties of g

^γ

and prox

γg

; the interested reader is referred

117

to [49] for a detailed discussion. A function g : IR

ⁿ

→ IR is prox-bounded if there exists

118

γ > 0 such that g +

_2γ¹

k · k

²

is bounded below on IR

ⁿ

. The supremum of all such γ

119

is the threshold γ

g

of prox-boundedness for g. In particular, if g is convex or bounded

120

below then γ

g

= ∞. In general, for any γ ∈ (0, γ

^g

) the proximal mapping prox

γg

is

121

nonempty- and compact-valued, and the Moreau envelope g

^γ

finite [49, Thm. 1.25].

122

Given a nonempty closed set S ⊆ IR

ⁿ

we let δ

S

: IR

ⁿ

→ IR denote its indicator

123

function , namely δ

S

(x) = 0 if x ∈ S and δ

^S

(x) = ∞ otherwise, and Π

^S

: IR

ⁿ

⇒ IR

ⁿ

124

the (set-valued) projection x 7→ argmin

z∈S

kz − xk. Proximal mappings can be seen

125

as generalized projections, due to the relation Π

S

= prox

_γδ_S

for any γ > 0.

126

For a set-valued mapping T : IR

ⁿ

⇒ IR

ⁿ

we let gph T = {(x, y) | y ∈ T (x)}

127

denote its graph, zer T = {x ∈ IR

ⁿ

| 0 ∈ T (x)} the set of its zeros and fix T =

128

{x ∈ IR

ⁿ

| x ∈ T (x)} the set of its fixed-points.

129

(4)

2.2. Forward-backward iterations. Due to the quadratic upper bound

130

(2.3) f (z) ≤ f(x) + h∇f(x), z − xi +

^L2^f

kz − xk

²

131

holding for all x, z ∈ IR

ⁿ

[11, Prop. A.24], for any γ ∈ (0,

¹

/

^Lf

) the function

132

(2.4) `

^f,g_γ

(z; x) := f (x) + h∇f(x), z − xi +

2γ¹

kz − xk

²

+ g(z)

133

furnishes a majorization model for ϕ, in the sense that

134

• `

^f,gγ

(z ; x) ≥ ϕ(z) for all x, z ∈ IR

ⁿ

, and

135

• `

^f,gγ

(x; x) = ϕ(x) for all x ∈ IR

ⁿ

.

136

Given a point x ∈ IR

ⁿ

, one iteration of forward-backward splitting (FBS) for problem

137

(1.1) consists in the minimization of the majorizing function `

^f,gγ

, namely, in selecting

138

(2.5) x

⁺

∈ T

γ^f,g

(x) := argmin

_z

`

^f,g_γ

(z; x),

139

where γ ∈ 0, min {γ

^g

,

¹

/

Lf

}

is the stepsize parameter. The (set-valued) forward-

140

backward operator T

_γ^f,g

can be equivalently expressed as

141

T

_γ^f,g

(x) = prox

_γg

(x − γ∇f(x)), (2.6a)

142143

which motivates the bound γ < γ

g

in (2.5) to ensure the existence of x

⁺

for any x.

We also introduce the corresponding (set-valued) forward-backward residual, namely

144

R

^f,g_γ

(x) :=

_γ¹

x − T

γ^f,g

(x).

(2.6b)

145146

Whenever no ambiguity occurs, we will omit the superscript and write simply `

γ

, T

γ 147

and R

γ

in place of `

^f,gγ

, T

γ^f,g

and R

^f,gγ

, respectively.

148

The inclusion (2.5) emphasizes that FBS is a majorization-minimization algo-

149

rithm (MM), a class of methods which has been thoroughly analyzed when the ma-

150

jorizing function is strongly convex in the first argument [14] (for `

γ

, this is the case

151

when g is convex). MM algorithms are of interest whenever minimizing the surrogate

152

function `

γ

( · ; x) is significantly easier than directly addressing the non structured

153

minimization of ϕ. For FBS this translates into simplicity of prox

γg

and ∇f oper-

154

ations, cf. (2.6a). Under very mild assumptions FBS iterations (2.5) converge to a

155

critical point (see §3) independently of the choice of x

⁺

in the set T

γ

(x) [5]. The key

156

is the following sufficient decrease property, whose proof can be found in [15, Lem. 2].

157

Lemma 2.1 (Sufficient decrease). For any γ ∈ (0, γ

^g

), x ∈ IR

ⁿ

and x ¯ ∈ T

γ

(x) it

158

holds that ϕ(¯ x) ≤ ϕ(x) −

^1−γL2γ ^f

kx − ¯xk

²

.

159

3. Stationary and critical points. Unless ϕ is convex, the stationarity condi-

160

tion 0 ∈ ˆ∂ϕ(x

^?

) in problem (1.1) is only necessary for the optimality of x

^?

[49, Thm.

161

10.1]. In this section we define different concepts of (sub)optimality and show how

162

they are related for generic functions ϕ = f + g as in Assumption I.

163

Definition 3.1. We say that a point x

^?

∈ dom ϕ is

164

(i) stationary if 0 ∈ ˆ∂ϕ(x

^?

);

165

(ii) critical if it is γ-critical for some γ ∈ (0, γ

^g

), i.e., if x

^?

∈ T

γ

(x

^?

);

166

(iii) optimal if x

^?

∈ argmin ϕ, i.e., if it solves ( 1.1).

167

The notion of criticality was already discussed in [7, 8] under the name of L-

168

stationarity (L plays the role of

¹

/

^γ

) for the special case of g = δ

B∩Cs

, where B is a

169

convex set and C

s

is the (nonconvex) set of vectors with at most s nonzero entries.

170

(5)

In [40] it is defined as d-stationarity, although the analysis is limited to difference-of-

171

convex minimization problems; more precisely, it addresses problem (1.1) for a concave

172

piecewise smooth function f and a convex function g.

173

If g is convex, then γ

g

= ∞ and we may talk of criticality without mention of γ: in

174

this case, γ-criticality and stationarity are equivalent properties regardless of the value

175

of γ. For more general functions g, instead, the value of γ plays a role in determining

176

whether a point is γ-critical or not, which legitimizes the following definition.

177

Definition 3.2. The criticality threshold is the function Γ

^f,g

: IR

ⁿ

→ [0, γ

^g

]

178

(3.1) Γ

^f,g

(x) := sup γ > 0 | x ∈ T

γ^f,g

(x) ∪ {0} for x ∈ IR

ⁿ

.

179

As usual, whenever f and g are clear from the context we simply write Γ in place

180

of Γ

^f,g

. The bound Γ ≤ γ

g

is due to the fact that prox

γg

(and consequently T

γ

) is

181

everywhere empty-valued for γ > γ

g

. Considering also γ = 0 forces the set in the

182

definition to be nonempty, and the lower-bound Γ ≥ 0 in particular; more precisely,

183

observe that, by definition, Γ(x) > 0 iff x is a critical point.

184

Example 3.3. Let us consider ϕ = f +g for f(x) =

¹₂

x

²

and g = δ

C

where C = {±1}.

185

Clearly, γ

g

= + ∞ (as g is lower-bounded), L

^f

= 1 and ±1 are both (unique) optima.

186

Since ˆ∂ϕ(x) = IR for x ∈ C and ˆ∂ϕ is clearly empty elsewhere, all points in C are

187

stationary. prox

γg

is the (set-valued) projection on C, therefore the forward-backward

188

operator is T

γ

(x) = Π

C

((1 − γ)x). We have

189

T

_γ

( −1) =







{−1} if γ < 1 {±1} if γ = 1 {1} if γ > 1

and T

γ

(1) =







{1} if γ < 1 {±1} if γ = 1 {−1} if γ > 1.

190

In particular, Γ(1) = Γ(−1) = 1.

191

We now list some properties of critical and optimal points which will be used to

192

derive regularity properties of T

γ

and g

^γ

.

193

Theorem 3.4 (Properties of critical points). The following properties hold:

194

(i) for γ ∈ (0, γ

^g

), a point x

^?

is γ-critical iff

195

g(x) ≥ g(x

^?

) + h − ∇f(x

^?

), x − x

^?

i −

2γ¹

kx − x

^?

k

²

∀x ∈ IR

ⁿ

;

196

(ii) if x

^?

is critical, then it is γ-critical for all γ ∈ (0, Γ(x

^?

)); moreover, x

^?

is also

197

Γ(x

^?

)-critical provided that Γ(x

^?

) < γ

g

;

198

(iii) T

_γ

(x

^?

)= {x

^?

} and R

γ

(x

^?

)= {0} for any critical point x

^?

and γ ∈ (0, Γ(x

^?

)).

199

Proof.

200

♠ 3.4(i): by definition, x

^?

is γ-critical iff `

γ

(x

^?

; x

^?

) ≤ `

γ

(x; x

^?

) for all x, i.e., iff

201

f (x

^?

) + g(x

^?

) ≤ f(x

^?

) + h∇f(x

^?

), x − x

^?

i +

2γ¹

kx − x

^?

k

²

+ g(x) ∀x ∈ IR

ⁿ

.

202

By suitably rearranging, the claim readily follows.

203

♠ 3.4(ii): since x

^?

is γ-critical, due to 3.4(i) apparently it is also γ

⁰

-critical for any

204

γ

⁰

∈ (0, γ]. From the definition (3.1) of the criticality threshold Γ(x

^?

), it then follows

205

that x

^?

is γ-critical for any γ ∈ (0, Γ(x

^?

)). Suppose now that Γ(x

^?

) < γ

g

. Then, due

206

to 3.4(i) for all γ ∈ (0, Γ(x

^?

)) we have

207

g(x) ≥ g(x

^?

) + h − ∇f(x

^?

), x − x

^?

i −

2γ¹

kx − x

^?

k

²

∀x ∈ IR

ⁿ

.

208

By taking the limit as γ % Γ(x

^?

) we obtain that the inequality holds for Γ(x

^?

) as

209

well, proving the claim in light of the characterization 3.4(i).

210

(6)

♠ 3.4(iii): let x

^?

be a critical point, and let x ∈ T

γ

(x

^?

) for some γ < Γ(x

^?

). Fix

211

γ

⁰

∈ (γ, Γ(x

^?

)). From 3.4(i) and 3.4(ii) it then follows that

212

(3.2) g(x) ≥ g(x

^?

) + h − ∇f(x

^?

), x − x

^?

i −

2γ¹⁰

kx − x

^?

k

²

.

213

Since x, x

^?

∈ T

γ

(x

^?

), it holds that `

γ

(x

^?

; x

^?

) = `

_γ

(x; x

^?

), i.e.,

214

g(x

^?

) = h∇f(x

^?

), x − x

^?

i +

2γ¹

kx − x

^?

k

²

+ g(x)

(3.2)

≥ g(x

^?

) +

_2γ¹

−

2γ¹⁰

kx − x

^?

k

²

.

215

Since

_2γ¹

−

2γ¹⁰

> 0, necessarily x = x

^?

.

216

The inequality in Theorem 3.4(i) can be rephrased as the fact that the vector

217

−∇f(¯x) is a “global” proximal subgradient for g at ¯x as in [49, Def. 8.45], where

218

“global” refers to the fact that δ can be taken +∞ in the cited definition. An interesting

219

consequence is that the definition of criticality depends solely on ϕ and not on the

220

considered decomposition f + g; in fact, it is only the threshold Γ that depends on

221

it. To see this, let ˜ f = f − h and ˜g = g + h for some h ∈ C

^1,1

(IR

ⁿ

), and consider a

222

point x

^?

which is γ-critical with respect to the decomposition f + g, i.e., such that

223

x

^?

∈ T

γ^f,g

(x

^?

). Combining Theorem 3.4(i) with the quadratic bound (2.3) for h, we

224

obtain

225

˜

g(x) ≥ ˜g(x

^?

) − h∇˜ f (x

^?

), x − x

^?

i −

2_1+γLh¹^γ

kx − x

^?

k

²

for all x ∈ IR

ⁿ

.

226

Again from the characterization of Theorem 3.4(i), we deduce that x

^?

∈ T

˜γ^{f ,˜}^˜^g

(x

^?

),

227

where ˜γ =

_1+γL^γ _h

. In particular, considering h = −f we infer that a point x

^?

is

228

critical iff x

^?

∈ T

γ^0,ϕ

(x

^?

) = prox

_γϕ

(x

^?

) for some γ > 0, which legitimizes the notion

229

of criticality without mentioning a specific decomposition.

230

In the next result we show that criticality is a halfway property between station-

231

arity and optimality. In light of these relations we shall seek “suboptimal” solutions

232

which we characterize as critical points.

233

Proposition 3.5 (Optimality, criticality, stationarity). Let ¯γ := min {γ

^g

,

¹

/

Lf

}.

234

(i) (criticality ⇒ stationarity) fix T

γ

⊆ zer ˆ ∂ϕ for all γ ∈ (0, γ

^g

);

235

(ii) (optimality ⇒ criticality) Γ(x

^?

) ≥ ¯γ for all x

^?

∈ argmin ϕ; in particular,

236

argmin ϕ ⊆ fix T

γ

for all γ ∈ (0, ¯γ), and also for γ =

¹

/

^Lf

if γ

g

>

¹

/

^Lf

;

237

Proof.

238

♠ 3.5(i): let γ ∈ (0, γ

g

) and x ∈ fix T

γ

. Since x minimizes g +

_2γ¹

k · − x + γ∇f(x)k

²

,

239

we have 0 ∈ ˆ∂g +

_2γ¹

k · − x + γ∇f(x)k

²

(x) = ˆ ∂g(x) + ∇f(x) = ˆ ∂ϕ(x), where the

240

first inclusion follows from [49, Thm. 10.1] and the equalities from [49, Thm. 8.8(c)].

241

This proves that x is stationary.

242

♠ 3.5(ii): Fix γ ∈ (0, ¯γ), x

^?

∈ argmin ϕ and y ∈ T

γ

(x

^?

). Necessarily y = x

^?

, oth-

243

erwise, due to Lem. 2.1, ϕ(y) would contradict minimality of ϕ(x

^?

). Therefore, x

^?

is

244

γ -critical and the claim follows from the arbitrarity of γ ∈ (0, ¯γ).

245

As already seen in Example 3.3, the bound Γ(x

^?

) ≥ min {γ

^g

,

¹

/

Lf

} at optimal

246

points in Proposition 3.5(ii) is tight, and clearly the implication “optimality ⇒ crit-

247

icality” cannot be reversed (consider, e.g., the point x

^?

= 0 for ϕ = cos). The next

248

example shows that the other implication is also proper.

249

Example 3.6 (Stationarity 6⇒ criticality). Let f(x) =

¹₂

x

²

and g(x) = x

⁵^/³

. We have

250

γ

g

= + ∞, L

^f

= 1, and for x

^?

= 0 it holds that ˆ∂ϕ(x

^?

) = {∇ϕ(x

^?

) } = {0}. Thus, x

^?

is

251

(7)

stationary; however, T

γ

(x

^?

) = prox

_γg

(0) = −(

^5γ

/

3

)

³

, and in particular x

^?

∈ T /

γ

(x

^?

)

252

for any γ > 0, proving x

^?

to be non critical.

253

4. Forward-backward envelope. The FBE (1.2) was introduced in [41] and

254

further analyzed in [50, 32] in the case when g is convex. Under such assumption the

255

FBE was shown to be continuously differentiable, which made it possible to derive

256

minimization algorithms based on its gradient. In the general setting addressed in

257

this paper the FBE might fail to be (continuously) differentiable, and as such we

258

need to resort to methods that do not need first-order information of the FBE. This

259

task will be addressed in Section 5 where Algorithm ZeroFPR will be proposed; other

260

than being applicable to a wider range of problems, the proposed scheme is entirely

261

based on the same oracle of forward-backward iterations, unlike the approaches in

262

[41, 50, 32] which instead require the computation of ∇

²

f . All this will be possible

263

thanks to continuity properties of the FBE, and to the behavior of ϕ

γ

at critical

264

points. We now focus on its continuity, while the other property will be addressed

265

shortly after in Theorem 4.4.

266

Remark 4.1 (Alternative expressions for ϕ

γ

). By expanding the square and rear-

267

ranging the terms in the definition (1.2), ϕ

γ

can equivalently be expressed as

268

ϕ

γ

(x) = inf

z∈IRⁿ

n f (x) −

^γ2

k∇f(x)k

²

+ g(z) +

_2γ¹

kz − x + γ∇f(x)k

²

o .

269

Comparing with (2.5), it is apparent that the set of minimizers z in the above expres-

270

sion coincides with T

γ

(x), the forward-backward operator at x. Moreover, taking out

271

the constant term f(x) −

^γ2

k∇f(x)k

²

from the infimum we immediately obtain the

272

following expression involving the Moreau envelope of g:

273

(4.1) ϕ

γ

(x) = f (x) −

^γ2

k∇f(x)k

²

+ g

^γ

(x − γ∇f(x)).

274

275

Other than providing an explicit way of computing the FBE, (4.1) emphasizes

276

how ϕ

γ

inherits the regularity properties of the Moreau envelope of g. In particular,

277

the next key property follows from the strict continuity of g

^γ

[49, Ex. 10.32].

278

Proposition 4.2 (Strict continuity of ϕ

γ

). For any γ ∈ (0, γ

g

), the FBE ϕ

γ

is a

279

real-valued and strictly continuous function on IR

ⁿ

.

280

4.1. Connections with the Moreau envelope. For the special case f = 0,

281

FBS iterations (2.5) reduce to the proximal point algorithm (PPA) x

⁺

∈ prox

γϕ

(x) ,

282

first introduced in [35] for convex functions ϕ and later generalized for functions with

283

convex majorizing surrogate `

γ^0,ϕ

( · ; x) = ϕ( · ) +

2γ¹

k · − xk

²

, see e.g., [26]. Similarly,

284

the FBE reduces to the Moreau envelope ϕ

^γ

= ϕ

^0,ϕ_γ

. In fact, the FBE extends the

285

connection between PPA and Moreau envelope

286

ϕ

^γ

(x) = min

z

`

_γ^0,ϕ

(z; x) ↔ prox

γϕ

(x) = argmin

_z

`

_γ^0,ϕ

(z; x), (4.2a)

287288

holding for f = 0 in (2.4), to majorizing functions `

^f,gγ

with arbitrary f ∈ C

^1,1

(IR

ⁿ

)

289

ϕ

γ

(x) = min

z

`

^f,g_γ

(z; x) ↔ T

_γ

(x) = argmin

_z

`

^f,g_γ

(z; x).

(4.2b)

290291

In the next section we will see the fundamental qualitative similarities between the

292

FBE and the Moreau envelope. Namely, for γ small enough both ϕ

^γ

and ϕ

γ

are lower

293

bounds for the original function ϕ with same minimizers and minimum; in particular

294

the minimization of ϕ is equivalent to that of ϕ

^γ

or ϕ

γ

. Similarly, the identity

295

ϕ(¯ x) = ϕ

^γ

(x) −

2γ¹

kx − ¯xk

²

for ¯x ∈ prox

γϕ

(x)

296297

(8)

will be extended to the inequality

298

ϕ(¯ x) ≤ ϕ

^γ

(x) −

^1−γL2γ ^f

kx − ¯xk

²

for ¯x ∈ T

γ

(x).

299300

4.2. Basic properties. We now provide bounds relating ϕ

γ

to the original func-

301

tion ϕ that extend the well known inequalities involving the Moreau envelope.

302

Proposition 4.3. Let γ ∈ (0, γ

^g

) be fixed. Then

303

(i) ϕ

γ

≤ ϕ.

304

(ii) ϕ(¯ x) ≤ ϕ

^γ

(x) −

^1−γL2γ ^f

kx − ¯xk

²

for all x ∈ IR

ⁿ

and x ¯ ∈ T

γ

(x).

305

Proof. 4.3(i) is obvious from the definition of the FBE (consider z = x in (1.2)). As

306

to 4.3(ii), since the set of minimizers in (1.2) is T

γ

(x) (cf. (4.2b)), (2.3) yields

307

ϕ

γ

(x) = f (x) + h∇f(x), ¯x − xi + g(¯x) +

2γ¹

kx − ¯xk

²

308

≥ f(¯x) −

^L^f

/

²

k¯x − xk

²

+ g(¯ x) +

_2γ¹

kx − ¯xk

²

= ϕ(¯ x) +

¹^−γL_2γ ^f

kx − ¯xk

²

.

309310

With respect to the inequalities holding for convex g treated in [50], the lower

311

bound in Proposition 4.3 is weaker, while the upper bound unchanged. Regardless, an

312

immediate consequence of the result is that the value of ϕ and ϕ

γ

at critical points is

313

the same, and minimizers and infima of the two functions coincide for γ small enough.

314

Theorem 4.4. The following hold

315

(i) ϕ(x) = ϕ

γ

(x) for all γ ∈ (0, γ

^g

) and x ∈ fix T

γ

;

316

(ii) inf ϕ = inf ϕ

γ

and argmin ϕ = argmin ϕ

γ

for all γ ∈ 0, min {

¹

/

^Lf

, γ

g

} .

317

The bound γ <

¹

/

^Lf

in Theorem 4.4(ii) is tight even when f and g are convex,

318

as the counterexample with f(x) =

¹₂

x

²

and g = δ

IR+

shows (see [50, Ex. 2.4] for

319

details).

320

Although we will address problem (1.1) by simply exploiting the continuity of

321

the FBE, nevertheless ϕ

γ

enjoys favorable properties which are key for the efficacy of

322

the method which will be discussed in Section 5. Firstly, observe that, due to strict

323

continuity, ϕ

γ

is almost everywhere differentiable, as it follows from Rademacher’s

324

theorem. The same applies to the mapping x 7→ x − γ∇f(x), its Jacobian being

325

(4.3) Q

γ

(x) := I − γ∇

²

f (x)

326

which is symmetric wherever it exists [49, Cor. 13.42 and Prop. 13.34]. However, in

327

order to show that the proposed method achieves fast convergence we need additional

328

regularity properties, namely (strict) twice differentiability at critical points and con-

329

tinuous differentiability around. The rest of the section is dedicated to this task.

330

4.3. Prox-regularity and first-order properties. In the favorable case in

331

which g is convex and f ∈ C

²

(IR

ⁿ

) , the FBE enjoys global continuous differentiability

332

[50]. In our setting, prox-regularity acts as a surrogate of convexity; the interested

333

reader is referred to [49, §13.F] for a detailed discussion.

334

Definition 4.5 (Prox-regularity). Function g is said to be prox-regular at x

0

for

335

v

0

∈ ∂g(x

⁰

) if there exist ρ, ε > 0 such that for all x

⁰

∈ B(x

⁰

; ε) and

336

(x, v) ∈ gph ∂g s.t. x ∈ B(x

⁰

; ε), v ∈ B(v

⁰

; ε), and g(x) ≤ g(x

⁰

) + ε

337

it holds that g(x

⁰

) ≥ g(x) + hv, x

⁰

− xi −

^ρ2

kx

⁰

− xk

²

.

338

Prox-regularity is a mild requirement enjoyed globally and for any subgradient

339

by all convex functions, with ε = +∞ and ρ = 0. When g is prox-regular at x

0

for v

0

,

340

(9)

then for sufficiently small γ > 0 the Moreau envelope g

^γ

is continuously differentiable

341

in a neighborhood of x

0

+ γv

0

[45]. To our purposes, when needed, prox-regularity of

342

g will be required only at critical points x

^?

, and only for the subgradient −∇f(x

^?

).

343

Therefore, with a slight abuse of terminology we define prox-regularity of critical

344

points as follows.

345

Definition 4.6 (Prox-regularity of critical points). We say that a critical point x

^?

346

is prox-regular if g is prox-regular at x

^?

for −∇f(x

^?

).

347

Examples where a critical point fails to be prox-regular are of challenging con-

348

struction; before illustrating a cumbersome such instance in Example 4.9, we first

349

prove an important result that connects prox-regularity with first-order properties of

350

the FBE.

351

Theorem 4.7 (Continuous differentiability of ϕ

γ

). Suppose that f is of class C

²

352

around a prox-regular critical point x

^?

. Then, for all γ ∈ (0, Γ(x

^?

)) there exists a

353

neighborhood U

x^?

of x

^?

on which the following properties hold:

354

(i) T

_γ

and R

_γ

are strictly continuous, and in particular single-valued;

355

(ii) ϕ

γ

∈ C

¹

with ∇ϕ

^γ

= Q

γ

R

_γ

, where Q

γ

is as in (4.3).

356

Proof. For γ

⁰

∈ (γ, Γ(x

^?

)), using Thm.s 3.4(i) and 3.4(iii) we obtain that

357

(4.4) g(x) ≥ g(x

^?

) − h∇f(x

^?

), x − x

^?

i −

2γ¹⁰

kx − x

^?

k

²

∀x ∈ IR

ⁿ

.

358

Replacing γ

⁰

with γ in the above expression, the inequality is strict for all x 6= x

^?

.

359

From [45, Thm. 4.4] applied to the “tilted” function x 7→ g(x+x

^?

) −g(x

^?

) −h∇f(x

^?

), x i

360

it follows that there is a neighborhood V of x

^?

− γ∇f(x

^?

) in which prox

γg

is strictly

361

continuous and g

^γ

is of class C

¹⁺

with ∇g

^γ

(x) = γ

⁻¹

x − prox

γg

(x)

for all x ∈ V .

362

Since f is C

²

around x

^?

and ∇f is continuous, by possibly narrowing U

^x^?

we may

363

assume that f ∈ C

²

(U

x^?

) and x − γ∇f(x) ∈ V for all x ∈ U

^x^?

. Part 4.7(ii) then

364

follows from (4.1) and the chain rule of differentiation, and 4.7(i) from the fact that

365

strict continuity is preserved by composition.

366

When f = 0, Theorem 4.7 restates the known fact that if g is prox-regular at

367

x

^?

for 0 ∈ ∂g(x

^?

), then g

^γ

is continuously differentiable around x

^?

with ∇g

^γ

(x) =

368 1

γ

(x − prox

γg

(x)) . Notice that the bound γ < Γ(x

^?

) is tight: in general, for γ = Γ(x

^?

)

369

no continuity of T

γ

nor continuous differentiability of ϕ

γ

around x

^?

can be guaranteed.

370

In fact, even when x

^?

is Γ(x

^?

) -critical, T

γ

might even fail to be single-valued and ϕ

γ 371

differentiable at x

^?

, as the following counterexample shows.

372

Example 4.8 (Necessity of γ 6= Γ(x

^?

) in first-order properties). Consider f =

¹₂

x

²

373

and g = δ

S

where S = {0, 1}. Then, L

f

= 1, γ

g

= + ∞, T

γ

(x) = Π

S

((1 −γ)x) and the

374

FBE is ϕ

γ

(x) =

¹^−γ₂

kxk

²

+

_2γ¹

dist((1 − γ)x, S)

²

. At the critical point x = 1, which

375

satisfies Γ(1) =

¹

/

²

, g is prox-regular for any subgradient. For any γ ∈ (0,

¹

/

²

) it is

376

easy to see that ϕ

γ

is differentiable in a neighborhood of x = 1. However, for γ =

¹

/

²

377

the distance function has a first-order singularity in x = 1, due to the 2-valuedness of

378

T

_γ

(1) = Π

S

(

¹

/

2

) = {0, 1}.

379

Example 4.9 (Prox-nonregularity of critical points). Consider ϕ = f + g where

380

f (x) =

¹₂

x

²

, g(x) = δ

S

(x) and S = {

¹

/

n

| n ∈ IN

≥1

} ∪ {0}. For x

⁰

= 0 we have

381

Γ(x

0

) = + ∞, however g fails to be prox-regular at x

⁰

for v

0

= 0 = −∇f(x

⁰

). For

382

any ρ > 0 and for any neighborhood V of (0, 0) in gph g it is always possible to find

383

a point arbitrarily close to (0, −

¹

/

^ρ

) with multi-valued projection on V . Specifically,

384

the midpoint P

n

=

¹₂

(

_n¹

+

_n+1¹

), −

¹

/

^ρ

has 2-valued projection on gph g for any

385

n ∈ IN

≥1

, being it Π

gphg

(P

n

) = {

¹

/

ⁿ

,

¹

/

ⁿ⁺¹

}. By considering a large n, P

ⁿ

can be

386

(10)

made arbitrarily close to (0, −

¹

/

ρ

) and at the same time its projection(s) arbitrarily

387

close to (0, 0). It follows that g cannot be prox-regular at 0 for 0, for otherwise such

388

projections would be single-valued close enough to (0, 0) [45, Cor. 3.4 and Thm. 3.5].

389

As a result, g

^γ

(x) =

_2γ¹

dist(x, S)

²

is not differentiable around x = 0, and indeed at

390

each midpoint

¹₂

(

¹_n

+

_n+1¹

) for n ∈ IN

≥1

it has a nonsmooth spike.

391

To underline how unfortunate the situation depicted in Example 4.9 is, notice

392

that adding a linear term λx to f for any λ 6= 0, yet leaving g unchanged, restores

393

the desired prox-regularity of each critical point. Indeed, this is trivially true for any

394

nonzero critical point; besides, g is prox-regular at 0 for any λ ∈ (0, −∞), while for

395

any λ < 0 the point 0 is not critical.

396

4.4. Second-order properties. In this section we discuss sufficient conditions

397

for twice-differentiability of the FBE at critical points. Additionally to prox-regularity,

398

which is needed for local continuous differentiability, we will also need generalized

399

second-order properties of g. The interested reader is referred to [49, §13] for an

400

extensive discussion on epi-differentiability.

401

Assumption II. With respect to a given critical point x

^?

402

(i) ∇

²

f exists and is (strictly) continuous around x

^?

;

403

(ii) g is prox-regular and (strictly) twice epi-differentiable at x

^?

for −∇f(x

^?

), with

404

its second order epi-derivative being generalized quadratic:

405

(4.5) d

²

g(x

^?

|−∇f(x

^?

))[d] = hd, Mdi + δ

^S

(d), ∀d ∈ IR

ⁿ

406

where S ⊆ IR

ⁿ

is a linear subspace and M ∈ IR

ⁿ^×n

. Without loss of generality

407

we take M symmetric, and such that Im(M ) ⊆ S and ker(M) ⊇ S

^⊥

.

¹

408

We say that the assumptions are “strictly” satisfied if the stronger conditions in paren-

409

thesis hold.

410

Twice epi-differentiability of g is a mild requirement, and cases where d

²

g is

411

generalized quadratic are abundant [47, 48, 43, 44]. Moreover, prox-regular and C

²

-

412

partly smooth functions g (see [29, 19]) comprise a wide class of functions that strictly

413

satisfy Assumption II(ii) at a critical point x

^?

provided that strict complementarity

414

holds, namely if −∇f(x

^?

) ∈ relint ∂g(x

^?

). In fact, it follows from [19, Thm. 28]

415

applied to the tilted function ˜g = g + h∇f(x

^?

), · i (which is still C

²

-partly smooth

416

and prox-regular at x

^?

[29, Cor. 4.6], [49, Ex. 13.35]) that prox

γ ˜g

is continuously

417

differentiable around x

^?

for γ small enough (in fact, for γ < Γ(x

^?

)). From [42, Thm

418

4.1(g)] we then obtain that ˜g is strictly twice epi-differentiable at x

^?

with generalized

419

quadratic second-order epiderivative, and the claim follows by tilting back to g.

420

We now show that the quite common properties required in Assumption II are all

421

that is needed for ensuring first-order properties of the proximal mapping and second-

422

order properties of the FBE at critical points. The result generalizes the one in [50]

423

by allowing nonconvex functions g. Although the proof is quite similar, we include it

424

for the sake of self-inclusiveness.

425

Theorem 4.10 (Twice differentiability of ϕ

γ

). Suppose that Assumption II is (strictly)

426

satisfied with respect to a critical point x

^?

. Then, for any γ ∈ (0, Γ(x

^?

))

427

(i) prox

_γg

is (strictly) differentiable at x

^?

− γ∇f(x

^?

) with symmetric and positive

428

semidefinite Jacobian

429

(4.6) P

γ

(x

^?

) := J prox

_γg

(x

^?

− γ∇f(x

^?

));

430

1This can indeed be done without loss of generality: if M and S satisfy (4.5), then it suffices to replace M with M⁰=¹₂ΠS(M + M^>) ΠSto ensure the desired properties.

(11)

(ii) R

_γ

is (strictly) differentiable at x

^?

with Jacobian

431

(4.7) JR

_γ

(x

^?

) =

¹_γ

[I − P

^γ

(x

^?

)Q

γ

(x

^?

)],

432

where Q

γ

is as in (4.3) and P

γ

as in (4.6);

433

(iii) ϕ

γ

is (strictly) twice differentiable at x

^?

with symmetric Hessian

434

(4.8) ∇

²

ϕ

γ

(x

^?

) = Q

γ

(x

^?

)JR

_γ

(x

^?

).

435

Proof. See Appendix A.

436

Again, when f ≡ 0 Theorem 4.10 covers the differentiability properties of the

437

proximal mapping (and consequently the second-order properties of the Moreau en-

438

velope, due to the identity ∇g

^γ

(x) =

¹_γ

(x − prox

γg

(x))) as discussed in [42].

439

We now provide a key result that links nonsingularity of the Jacobian of the

440

forward-backward residual R

γ

to strong (local) minimality for the original cost ϕ and

441

for the FBE ϕ

γ

, under the generalized second-order properties of Assumption II.

442

Theorem 4.11 (Conditions for strong local minimality). Suppose that Assumption

443

II is satisfied with respect to a critical point x

^?

, and let γ ∈ (0, min {Γ(x

^?

),

¹

/

^Lf

}).

444

The following are equivalent:

445

(a) x

^?

is a strong local minimum for ϕ;

446

(b) x

^?

is a local minimum for ϕ and JR

_γ

(x

^?

) is nonsingular;

447

(c) the (symmetric) matrix ∇

²

ϕ

γ

(x

^?

) is positive definite;

448

(d) x

^?

is a strong local minimum for ϕ

γ

;

449

(e) x

^?

is a local minimum for ϕ

γ

and JR

_γ

(x

^?

) is nonsingular.

450

Proof. See Appendix A.

451

5. ZeroFPR algorithm. The first algorithmic framework exploiting the FBE for

452

solving composite minimization problems was studied in [41], and other schemes have

453

been recently investigated in [50, 32]. All such methods tackle the problem by looking

454

for a (local) minimizer of the FBE, exploiting the equivalence of (local) minimality

455

for the original function ϕ and for the FBE ϕ

γ

, for γ small enough. To do so, they all

456

employ the concept of directions of descent, thus requiring the gradient of the FBE to

457

be well defined everywhere. In the more general framework addressed in this paper,

458

such basic requirement is not met, which is why we approach the problem from a

459

different perspective. This leads to ZeroFPR , the first algorithm, to the best of our

460

knowledge, that despite requiring only the black-box oracle of FBS and being suited

461

for fully nonconvex problems it achieves superlinear convergence rates.

462

5.1. Overview. Instead of directly addressing the minimization of ϕ or ϕ

γ

, we

463

seek solutions of the following nonlinear inclusion (generalized equation)

464

(5.2) find x

^?

∈ IR

ⁿ

such that 0 ∈ R

γ

(x

^?

).

465

By doing so we address the problem from the same perspective of FBS, that is, finding

466

fixed points of the forward-backward operator T

γ

or, equivalently, zeros of its residual

467

R

_γ

. Despite R

γ

might be quite irregular when g is nonconvex, it enjoys favorable

468

properties at the very solutions to (5.2) — i.e., at γ-critical points — starting from

469

single-valuedness, cf. Theorem 3.4(iii). If some assumptions are met, R

γ

turns out to

470

be continuous around and even differentiable at critical points (cf. Theorems 4.7 and

471

(12)

Algorithm ZeroFPR generalized forward-backward with nonmonotone linesearch Require γ ∈ (0, min {

¹

/

^Lf

, γ

g

}), β, p

^min

∈ (0, 1), σ ∈ (0, γ

^1−γL2 ^f

), x

⁰

∈ IR

ⁿ

. Initialize ¯ Φ

0

= ϕ

γ

(x

⁰

), k = 0.

1:

Select ¯x

^k

∈ T

^γ

(x

^k

) and set r

^k

=

¹_γ

(x

^k

− ¯x

^k

)

2:

if kr

^k

k = 0, then stop; end if

3:

Select a direction d

^k

∈ IR

ⁿ

4:

Let τ

k

∈ {β

^m

| m ∈ IN} be the largest such that x

^k+1

= ¯ x

^k

+ τ

k

d

^k

satisfies (5.1) ϕ

γ

(x

^k+1

) ≤ ¯ Φ

k

− σkr

^k

k

²

5:

Φ ¯

k+1

= (1 − p

^k

) ¯ Φ

k

+ p

k

ϕ

γ

(x

^k+1

) for some p

k

∈ [p

^min

, 1]

k ← k + 1 and go to step 1.

4.10), and as a consequence the inclusion problem (5.2) reduces to a well behaved

472

system of equations, as opposed to generalized equations, when close to solutions.

473

This motivates addressing problem (5.2) with fast methods for nonlinear equa-

474

tions. Newton-like schemes are iterative methods that prescribe updates of the form

475

(5.3) x

⁺

= x − HR

γ

(x)

476

which essentially amount to selecting H = H(x), a linear operator that ideally carries

477

information of the geometry of R

γ

around x, in the attempt to yield an optimal

478

iterate x

⁺

. For instance, when R

γ

is sufficiently regular Newton method corresponds

479

to selecting H as the inverse of an element of the generalized Jacobian of R

γ

at x,

480

enabling fast convergence when close to a solution under some assumptions. However,

481

selecting H as in Newton method would require information additional to the forward-

482

backward oracle T

γ

, and as such it goes beyond the scope of the paper. For this reason

483

we focus instead on quasi-Newton schemes, in which H are linear operators recursively

484

defined with low-rank updates that satisfy the (inverse) secant condition

485

(5.4) H

⁺

y = s, where s = x

⁺

− x and y ∈ R

γ

(x

⁺

) − R

γ

(x).

486

A famous result [21] states that, under some assumptions and starting sufficiently

487

close to a solution x

^?

, updates as in (5.3) are superlinearly convergent to x

^?

iff the

488

Dennis-Moré condition holds, namely the limit

^k(H⁻¹^−JR_ksk^γ^(x^?^))sk

→ 0, see also [22]

489

for a thorough survey. More recently, in [23] the result was extended to generalized

490

equations of the form f(x) + G(x) 3 0, where f is smooth and G possibly set-valued.

491

The study focuses on Josephy-Newton methods where the update x

⁺

is the solution

492

of the inner problem f(x) − Bx ∈ Bx

⁺

+ G(x

⁺

), where B = H

⁻¹

, which can be

493

interpreted as a forward-backward step in the metric induced by B. In particular,

494

differently from the proposed ZeroFPR , the method in [23] has the crucial limitation

495

that, unless the operator B has a very particular structure, the backward step (B +

496

G)

⁻¹

may be prohibitely challenging. The same remark applies to proximal (quasi-)

497

Newton-type methods, in which each iteration requires the computation of a scaled

498

proximal gradient step, see [28] and the references therein.

499

5.1.1. Globalization strategy. Quasi-Newton schemes are extremely handy

500

and widely used methods. However, it is well known that they are effective only when

501

close enough to a solution and might even diverge otherwise. To cope with this crucial

502

downside there comes the need of a globalization strategy; this is usually addressed

503

(13)

by means of a linesearch over a suitable merit function ψ, along directions of descent

504

for ψ so as to ensure sufficient decrease for small enough stepsizes. Unfortunately, the

505

potential choice ψ(x) =

¹₂

kR

γ

(x) k

²

is not regular enough for a ‘direction of descent’

506

to be everywhere defined. The proposed Algorithm ZeroFPR bypasses this limitation

507

by exploiting the favorable properties of the FBE. In Theorem 5.10 we will see that

508

ZeroFPR achieves superlinear convergence, provided that f and g enjoy some regular-

509

ity requirements at the limit point and the directions satisfy a Dennis-Moré condition.

510

However, regardless of whether or not any of such conditions is met, the algorithm

511

has the same convergence guarantees of FBS (cf. Thm. 5.6).

512

ZeroFPR globalizes the convergence of any fast local method, and requires exactly

513

the same oracle of FBS. Conceptually, the algorithm is really elementary; for simplic-

514

ity, let us first consider the monotone case, i.e., with p

k

≡ 1 so that ¯ Φ

k

= ϕ

γ

(x

^k

) (cf.

515

step 5). The following steps are executed for updating the iterate x

^k

:

516

1) first, at step 1 a nominal forward-backward call yields an element ¯x

^k

∈ T

γ

(x

^k

)

517

that decreases the value of ϕ

γ

by at least γ

¹^−γL₂ ^f

kr

^k

k

²

(Prop. 4.3(i));

518

2) then, at step 3 an update direction d

^k

at ¯x

^k

(not at x

^k

!) is selected;

519

3) because of the sufficient decrease x

^k

7→ ¯x

^k

on ϕ

γ

and the continuity of ϕ

γ

, at

520

step 4 a stepsize τ

k

can be found with finite many backtrackings τ

k

← βτ

^k

521

that ensures a decrease for ϕ

γ

of at least σkr

^k

k

²

in the update x

^k

7→ ¯x

^k

+τ

k

d

^k

,

522

for any σ < γ

¹^−γL₂ ^f

.

523

In order to reduce the number of backtrackings, p

k

< 1 can be selected resulting

524

in a nonmonotone linesearch. The sufficient decrease is enforced with respect to a

525

parameter ¯Φ

k

≥ ϕ

^γ

(x

^k

) (cf. Lem. 5.1), namely a convex combination of ϕ

γ

(x

ⁱ

)

k i=0

.

526

For the sake of convergence, (p

k

)

_k∈IN

can be selected arbitrarily in (0, 1] as long as it is

527

bounded away from 0, hence the role of the user-set lower bound p

min

. Consequently,

528

small values of σ and p

k

concur in reducing conservatism in the linesearch by favoring

529

larger stepsizes.

530

Lemma 5.1 (Nonmonotone linesearch globalization). For all k ∈ IN the iterates

531

generated by ZeroFPR satisfy

532

(5.5) ϕ

γ

(¯ x

^k

) ≤ ϕ(¯x

^k

) ≤ ϕ

^γ

(x

^k

) ≤ ¯ Φ

k 533

and there exists τ ¯

k

> 0 such that

534

(5.6) ϕ

γ

(¯ x

^k

+ τ d

^k

) ≤ ¯ Φ

k

− σkr

^k

k

²

∀τ ∈ [0, ¯τ

^k

].

535

In particular, the number of backtrackings at step 4 is finite.

536

Proof. The first two inequalities in (5.5) are due to Prop.s 4.3(i) and 4.3(ii), respec-

537

tively. Moreover,

538

Φ ¯

k+1

= (1 − p

^k

) ¯ Φ

k

+ p

k

ϕ

γ

(x

^k+1

) ≥ (1 − p

^k

)ϕ

γ

(x

^k+1

) + p

k

ϕ

γ

(x

^k+1

) = ϕ

γ

(x

^k+1

),

539

where the inequality follows by the linesearch condition (5.1); this proves the last

540

inequality in (5.5). As to (5.6), let k be fixed and contrary to the claim suppose that

541

for all ε > 0 there exists τ

ε

∈ [0, ε] such that the point x

^ε

= ¯ x

^k

+ τ

ε

d

^k

satisfies

542

ϕ

γ

(x

ε

) > ϕ

γ

(x

^k

) − σkr

^k

k

²

. By taking the limit for ε → 0

⁺

, so that x

ε

→ ¯x

^k

, we have

543

ϕ

γ

(¯ x

^k

) = lim

ε→0⁺

ϕ

γ

(x

ε

) ≥ ϕ

^γ

(x

^k

) − σkr

^k

k

²

≥ ϕ(¯x

^k

) + γ

¹^−γL₂ ^f

− σkr

^k

k

²

> ϕ(¯ x

^k

)

544

which contradicts Prop. 4.3(i). Here, the equality follows from the continuity of ϕ

γ 545

(Prop. 4.2), the first inequality from the property of x

ε

, the second one from Prop.

546