• No results found

Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity

N/A
N/A
Protected

Academic year: 2021

Share "Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity"

Copied!
26
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity

PUYA LATAFAT1, ANDREAS THEMELIS2, MASOUD AHOOKHOSH3, AND PANAGIOTIS PATRINOS1

Abstract. We introduce two algorithms for nonconvex regularized finite sum minimization, where typical Lipschitz differentiability assumptions are relaxed to the notion of relative smoothness [7]. The first one is a Bregman extension of Finito/MISO [28,42], studied for fully nonconvex problems when the sampling is random, or under convexity of the nonsmooth term when it is essentially cyclic. The second algorithm is a low-memory variant, in the spirit of SVRG [34] and SARAH [48], that also allows for fully nonconvex formulations. Our analysis is made remarkably simple by employing a Bregman Moreau envelope as Lyapunov function. In the randomized case, linear convergence is established when the cost function is strongly convex, yet with no convexity requirements on the individual functions in the sum. For the essentially cyclic and low-memory variants, global and linear convergence results are established when the cost function satisfies the Kurdyka-Łojasiewicz property.

1. Introduction We consider the following regularized finite sum minimization

minimizex∈’nϕ(x) BN1 PN

i=1fi(x) + g(x) subject to x ∈ C, (P) where C denotes the closure of C B TNi=1int dom hi, for some convex functions hi, i ∈ [N] B {1, . . . , N}. Our goal in this paper is to study such problems without imposing convexity assumptions on fi and g, and in a setting where fi are differentiable but their gradients need not be Lipschitz continuous. Our full setting is formalized inAssumption I.

To relax the Lipschitz differentiability assumption, we adopt the notion of smoothness relative to a distance-generating function [7], and following [40] we will use the terminology of relative smoothness. Despite the lack of Lipschitz differentiability, in many applications the involved func- tions satisfy a descent property where the usual quadratic upper bound is replaced by a Bregman distance (cf.Fact 2.5(i)andDefinition 2.1). Owing to this property, Bregman extensions for many classical schemes have been proposed [7,40,6,59,49,1].

In the setting of finite sum minimization, the incremental aggregated algorithm PLIAG was pro- posed recently [69] as a Bregman variant of the incremental aggregated gradient method [15,16,65].

The analysis of PLIAG is limited to the convex case and requires restrictive assumptions for the Bregman kernel [69, Thm. 4.1, Assump. B4]. Stochastic mirror descent (SMD) is another relevant algorithm which can tackle more general stochastic optimization problems. SMD may be viewed as a Bregman extension of the stochastic (sub)gradient method and has long been studied [46,61,11,45].

More recently, [32] studied SMD for convex and relatively smooth formulations, and (sub)gradient versions have been analyzed under relative continuity in a convex setting [39], as well as relative weak convexity [71,25].

1Department of Electrical Engineering (ESAT-STADIUS) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

2Faculty of Information Science and Electrical Engineering (ISEE) – Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan

3Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium E-mail addresses: puya.latafat@esat.kuleuven.be, andreas.themelis@ees.kyushu-u.ac.jp,

masoud.ahookhosh@uantwerp.be, panos.patrinos@esat.kuleuven.be. 1991 Mathematics Subject Classification. 90C06, 90C25, 90C26, 49J52, 49J53.

Key words and phrases. Nonsmooth nonconvex optimization, incremental aggregated algorithms, Bregman Moreau enve- lope, KL inequality.

This work was supported by the Research Foundation Flanders (FWO) PhD grant 1196820N and research projects G0A0920N, G086518N, and G086318N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS project no 30468160 (SeLMA).

1

(2)

Algorithm 1Bregman Finito/MISO (BFinito) for the regularized finite sum minimization (P) Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

stepsizes γi∈ (0,N/Lfi)

initial point xinit∈ C B TNi=1int dom hi

Initialize table s0=(s01, . . . ,s0N) ∈ ’nNof vectors s0i =γ1

i∇hi(xinit) −N1∇fi(xinit)

’n-vector ˜s0=PN i=1s0i Repeat for k = 0, 1, . . . until convergence

1.1: Compute zk∈ arg minw∈’nng(w) + PNi=1γ1ihi(w) − h˜sk,wio

1.2: Select a subset of indices Ik+1⊆ [N] B {1, . . . , N} and update the table sk+1as follows:

sk+1i =





1

γi∇hi(zk) −N1∇fi(zk) if i ∈ Ik+1,

ski otherwise

1.3: Update the vector ˜sk+1=˜sk+P

i∈Ik+1(sk+1i − ski) Return zk

Motivated by these recent works, we propose a Bregman extension of the popular Finito/MISO algorithm [28,42] in a fully nonconvex setting and with very general sampling strategies that will be made percise shortly after. In a nutshell, our analysis revolves around the fact that, regardless of the index selection strategy, the function L : ’n× ’nN→ ’ defined as

L(z, s) B ϕ(z) + PNi=1Dˆh

i si,∇ˆhi(z), (1.1)

where ˆhi denotes the convex conjugate of ˆhiBhi/γifi/N, monotonically decreases along the iterates (zk, sk)k∈Žgenerated byAlgorithm 1(seeAssumption Ifor the requirements on hi,fi). Our method- ology leverages an interpretation of Finito/MISO as a block-coordinate algorithm that was observed in [37] in the Euclidean setting. In fact, the analysis is here further simplified after noticing that the smooth function can be “hidden” in the distance-generating function, resulting in a Lyapunov function L that can be expressed as a Bregman Moreau envelope (cf.Lemma 3.2).

We cover a wide range of sampling strategies for the index set Ik+1atstep 1.2, which we can summarize into the following two umbrella categories:

Random sampling rule: ∃p1, . . . ,pN>0 : Pkhi ∈ Ik+1i

=pi ∀k ∈ Ž, i ∈ [N]. (S1) Essentially cyclic rule: ∃T > 0 : STt=1Ik+t=[N] ∀k ∈ Ž. (S2) The randomized setting (S1), in which Pkdenotes the probability conditional to the knowledge at iteration k, covers, for instance, a mini-batch strategy of size b. Another notable case is when each index i is selected at random with probability piindependently of other indices.

The essentially cyclic sampling (S2) is also very general and has been considered by many authors [62,60,33,24,67]. Two notable special cases of single index selection rules complying with (S2) are the cyclic and shuffled cyclic sampling strategies:

Shuffled cyclic rule: Ik+1=πbk/Nc mod(k, N) + 1 , (S2shuf) where π0, π1, . . .are permutations of the set of indices [N] (chosen randomly or deterministically).

When πbk/Nc=id one recovers the (plain) cyclic sampling rule

Cyclic rule: Ik+1={mod(k, N) + 1}. (S2cycl) We remark that, in the cyclic case, our algorithm generalizes DIAG [44] for smooth strongly convex problems, which itself may be seen as a cyclic variant of Finito/MISO.

1.1. Low-memory variant. One iteration ofAlgorithm 1involves the computation of zkatstep 1.1 and that of the gradients ∇(hi/γifi/N), i ∈ Ik+1, atstep 1.3. Consequently, the overall complexity of each iteration is independent of the number N of functions appearing in problem (P), and is instead proportional to the number of sampled indices, which the user is allowed to upper bound to any integer between 1 and N. As is the case for all incremental gradient methods, the low iteration cost comes at the price of having to store in memory a table skof N many ’nvectors, which can become problematic when N grows large. Other incremental algorithms for convex optimization such as

(3)

Algorithm 2Low-memory Bregman Finito/MISO

Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

stepsizes γi∈ (0,N/Lfi)

initial point xinit∈ C B TNi=1int dom hi

Initialize ’n-vector ˜s0=PN i=11

γi∇hi(xinit) −N1∇fi(xinit)

set of selectable indices K0=∅ .Conventionally set toso as to start with a full update

Repeat for k = 0, 1, . . . until convergence

2.1: zk∈ arg minw∈’nng(w) + PNi=1γ1ihi(w) − h˜sk,wio

2.2: if Kk=∅ then .No index left to be sampled: full update

2.3: Ik+1=Kk+1=[N] .activate all indices and reset the selectable indices 2.4: ˜zk=zk .store the full updatezk

2.5: ˜sk+1=PN i=11

γi∇hi(zk) −N1 PN i=1∇fi(zk)

2.6: else

2.7: select a nonempty subset of indices Ik+1⊆ Kk .select among the indices not yet sampled

2.8: Kk+1=Kk\ Ik+1 .update the set of selectable indices

2.9: ˜zk=˜zk−1

2.10: ˜sk+1=˜sk+P

i∈Ik+1

h 1

γi∇hi(zk) −N1∇fi(zk) − γ1i∇hi(˜zk) −N1∇fi(˜zk)i Return ˜zk

IAG [15,16,65], IUG [64], SAG [54], and SAGA [27], can considerably reduce memory allocation from O(nN) to O(n) in applications such as logistic regression and lasso where the gradients ∇fi

can be expressed as scaled versions of the data vectors. Despite the favorable performance of the Finito/MISO algorithm on such problems as observed in [27], this memory reduction trick can not be employed due to the fact that the vectors sistored in the table depend not only on the gradients, but also on the vectors ∇hi(zk). Nevertheless, inspired by the popular stochastic methods SVRG [34,66]

and SARAH [48], by suitably interleaving incremental and full gradient evaluations it is possible to completely waive the need of a memory table and match the O(n) storage requirement.

In a nutshell, after a full update — which inAlgorithm 1corresponds to selecting Ik+1=[N] — all vectors sk+1i in the table only depend on variable zkcomputed atstep 1.1, until i is sampled again.

As long as full gradient updates are frequent enough so that no index is sampled twice in between, it thus suffices to keep track of zk ∈ ’n instead of the table sk ∈ ’nN. The variant is detailed in Algorithm 2, in which Kk⊆ [N] keeps track of the indices that have not yet been sampled between full gradient updates (and is thus reset whenever such full steps occur, cf.step 2.3). Vector ˜zk ∈ ’n is equal to zkcorresponding to the latest full gradient update (cf.step 2.4) and acts as a low-memory surrogate of the table skofAlgorithm 1. Similarly to SVRG and SARAH, this reduction in the storage requirements comes at the cost of an extra gradient evaluation per sampled index (cf.step 2.10).

Since full gradient updates correspond to selecting all indices,Algorithm 2may be viewed as Algorithm 1with an essentially cyclic sampling rule of period N, a claim that will be formally shown inLemma 4.12. In fact, not only does it naturally inherit all the convergence results, but its particular sampling strategy also allows us to waive convexity requirements on g that are necessary for more general essentially cyclic rules.

1.2. Contributions. As a means to informally summarize the content of the paper, inTable 1we synopsize the convergence results of the two algorithms.

1. To the best of our knowledge, this is the first analysis of an incremental aggregated method in a fully nonconvex setting and without Lipschitz differentiability assumptions. Our analysis is surpris- ingly simple, and yet it covers randomized and essentially cyclic samplings altogether, relies on a sure descent property on the Bregman Moreau envelope (cf.Lemma 4.2).

2. We propose a novel low-memory variant of the (Bregman) Finito/MISO algorithm, that in the spirit of SVRG [34,66] and SARAH [48] alternates between incremental steps and a full proxi- mal gradient step. It is highly interesting even in the Euclidean case, as it can accommodate fully nonconvex formulations while maintaining a O(n) memory requirement.

(4)

Sampling Requirements(additionally toAssumption I) Property Reference zk

bounded any ϕ + δClevel bounded sure Lemma 4.2(iv)

ϕ(zk) convergent

S1 a.s. Theorem 4.6(ii)

S2 C = ’n; g cvx; hi(loc) str cvx smooth; (ϕ level bounded) sure Theorem 4.9

LM Theorem 4.13

ω(zk) stationary

S1 either C = ’n a.s. Theorem 4.6(iv)

or dom hiclosed; ϕ cvx Theorem 4.6(vi)

S2 C = ’n; g cvx; hi(loc) str cvx smooth; (ϕ level bounded) sure Theorem 4.9

LM C = ’n Theorem 4.13

zk convergent

S1 either C = ’n; ϕ cvx

a.s. Theorem 4.6(vii) orAssumption II; ϕ + δCcvx level bounded

S2 Assumption III; g cvx sure Theorem 4.11(i)

LM Assumption III Theorem 4.14(i)

ϕ(zk) and zk linearly convergent

S1 C = ’n; ϕ str cvx; hiloc smooth E Theorem 4.7

S2 Assumption III; ϕ kl1/2; g cvx sure Theorem 4.11(iii)

LM Assumption III; ϕ kl1/2 Theorem 4.14(iii)

Table 1. Summary of the convergence results forAlgorithm 1with randomized sampling (S1) and essen- tially cyclic sampling (S2), and for the low-memory variant ofAlgorithm 2(LM). Claims are either sure, almost sure (a.s.), or in expectation (E).

Other abbreviations: loc: locally; cvx: convex; str: strongly; smooth: Lipschitz differentiable; ω: set of limit points; kl: Kurdyka-Łojasiewicz property with exponent θ.

3. Linear convergence ofAlgorithm 1in the randomized case is established when the cost function ϕis strongly convex, yet with no convexity requirement on fior g. To the best of our knownledge, this is a novelty even in the Euclidean case, for all available results are bound to strong convexity of each term fiin the sum; see e.g., [28,42,44,37,50].

4. We leverage the Kurdyka-Łojasiewicz (KL) property to establish global (as apposed to subsequen- tial) convergence as well as linear convergence, forAlgorithm 1with (essentially) cyclic sampling and for the low-memoryAlgorithm 2.

1.3. Organization. We conclude this section by introducing some notational conventions. The prob- lem setting is formally described inSection 2together with a list of related definitions and known facts involving Bregman distances, relative smoothness, and proximal mapping.Section 3offers an alternative interpretation ofAlgorithm 1as the block-coordinate Bregman proximal pointAlgorithm 3, which majorly simplifies the analysis, addressed inSection 4. Some auxiliary results are deferred toAppendices AandB.Section 5applies the proposed algorithms to sparse phase retrieval problems, andSection 6concludes the paper.

1.4. Notation. The set of real and extended-real numbers are ’ B (−∞, ∞) and ’ B ’ ∪ {∞}, while the positive and strictly positive reals are ’+B [0,∞) and ’++B (0,∞). With id we indicate the identity function x 7→ x defined on a suitable space. We denote by h · , · i and k · k the standard Euclidean inner product and the induced norm. For a vector w = (w1, . . . ,wr) ∈ ’Pini, wi ∈ ’ni is used to denote its i-th block coordinate. int E and bdry E respectively denote the interior and boundary of a set E, and for a sequence (xk)k∈Žwe write (xk)k∈Ž ⊆ E to indicate that xk∈ E for all k ∈ Ž. We say that (xk)k∈Žconverges at Q-linear rate (resp. R-linear rate) to a point x if there exists c ∈ (0, 1) such that kxk+1− xk ≤ ckxk− xk (resp. kxk− xk ≤ ρckfor some ρ > 0) holds for all k ∈ Ž.

We use the notation H : ’n ⇒ ’m to indicate a mapping from each point x ∈ ’nto a subset H(x) of ’m. The graph of H is the set gph H B {(x, y) ∈ ’n× ’m| y ∈ H(x)}. We say that H is outer semicontinuous (osc) if gph H is a closed subset of ’n× ’m, and locally bounded if for every bounded U ⊂ ’nthe set Sx∈UH(x) is bounded.

The domain and epigraph of an extended-real-valued function h : ’n→ ’ are the sets dom h B {x ∈ ’n| h(x) < ∞} and epi h B {(x, α) ∈ ’n× ’ | h(x) ≤ α}. Function h is said to be proper if dom h , ∅, and lower semicontinuous (lsc) if epi h is a closed subset of ’n+1. We say that h is

(5)

level bounded if its α-sublevel set lev≤αh B {x ∈ ’n| h(x) ≤ α} is bounded for all α ∈ ’. The con- jugate of h, is defined by h(y) B supx∈’n{hy, xi − h(x)}. The indicator function of a set E ⊆ ’nis denoted by δE, namely δE(x) = 0 if x ∈ E and ∞ otherwise.

We denote by ˆ∂h : ’n⇒ ’nthe regular subdifferential of h, where v ∈ ˆ∂h( ¯x) ⇔ lim inf¯x,x→ ¯x h(x) − h( ¯x) − hv, x − ¯xi

kx − ¯xk ≥ 0.

A necessary condition for local minimality of x for h is 0 ∈ ˆ∂h(x), see [53, Th. 10.1]. The (limiting) subdifferential of h is ∂h : ’n ⇒ ’n, where v ∈ ∂h(x) iff x ∈ dom h and there exists a sequence (xk,vk)k∈Ž ⊆ gph ˆ∂h such that (xk,h(xk), vk) → (x, h(x), v) as k → ∞. Finally, the set of r times continuously differentiable functions from X to ’ is denoted by Cr(X).

2. Problem setting and preliminaries

Throughout this paper, problem (P) is studied under the following assumptions.

Assumption I(basic requirements). In problem (P),

a1 fi: ’n→ ’ are Lfi-smooth relative to Legendre kernels hi(Definitions 2.2and2.4);

a2 g : ’n→ ’ is proper and lower semicontinuous (lsc);

a3 a solution exists: arg minnϕ(x) | x ∈ Co , ∅;

a4 for given γi∈ (0,N/Lfi), i ∈ [N], it holds that for any s ∈ ’n T(s) B arg min

w∈’n ng(w) + PNi=1γ1ihi(w) − hs, wio ⊆ C. (2.1) As it will become clear inSection 3, the subproblem (2.1) is in fact a reformulation of a (Breg- man) proximal mapping.Assumption I.a4excludes boundary points from range T. This is a standard assumption that usually holds in practice [20,56], e.g., when g is convex or when the intersection of dom hi, i ∈ [N], is an open set.

Definition 2.1(Bregman distance). For a convex function h : ’n→ ’ that is continuously differen- tiable on int dom h , ∅, the Bregman distance Dh: ’n× ’n→ ’ is defined as

Dh(x, y) B

(h(x) − h(y) − h∇h(y), x − yi if y ∈ int dom h,

∞ otherwise. (2.2)

Function h will be referred to as a distance-generating function.

Definition 2.2 (Legendre kernel). A proper, lsc, and strictly convex function h : ’n → ’ with int dom h , ∅ and such that h ∈ C1(int dom h) is said to be a Legendre kernel if it is (i) 1-coercive, i.e., such that limkxk→∞h(x)/kxk = ∞, and (ii) essentially smooth, i.e., if k∇h(xk)k → ∞ for every sequence (xk)k∈Ž⊆ int dom h converging to a boundary point of dom h.

Fact 2.3. The following hold for a Legendre kernel h : ’n→ ’, x ∈ ’n, and y, z ∈ int dom h:

(i) h∈ C1n) is strictly convex and ∇h−1=∇h[52, Thm. 26.5 and Cor. 13.3.1].

(ii) Dh(x, z) = Dh(x, y) + Dh(y, z) + hx − y, ∇h(y) − ∇h(z)i [22, Lem. 3.1].

(iii) Dh(y, z) = Dh(∇h(z), ∇h(y)) [8, Thm. 3.7(v)].

(iv) Dh( · , z) and Dh(z, · ) are level bounded [9, Lem. 7.3(v)-(viii)].

(v) If dom h is closed and Dh(xk,yk) → 0 for some xk ∈ dom h and yk ∈ int dom h, then (xk)k∈Ž

converges to a point x iff so does (yk)k∈Ž[56, Thm. 2.4].

Moreover, for any convex set U ⊆ int dom h and u, v ∈ U the following hold:

(vi) If h is µh, U-strongly convex on U, thenµh, U2 kv − uk2≤ Dh(v, u) ≤1h, Uk∇h(v) − ∇h(u)k2. (vii) If ∇h is `h, U-Lipschitz on U, then2`1h, Uk∇h(v) − ∇h(u)k2≤ Dh(v, u) ≤`h, U2 kv − uk2.

Definition 2.4(relative smoothness [7]). We say that a proper, lsc function f : ’n → ’ is smooth relative to a Legendre kernel h : ’n→ ’ if dom f ⊇ dom h, and there exists Lf ≥ 0 such that Lfh± f are convex functions on int dom h. We will simply say that f is Lf-smooth relative to h to make the modulus Lf explicit.

(6)

Fact 2.5. Let f : ’n → ’ be Lf-smooth relative to a Legendre kernel h : ’n → ’. Then, f ∈ C1(int dom h) and the following hold:

(i)

f (y)− f (x) − h∇f (x), y − xi

≤ LfDh(y, x) for all x, y ∈ int dom h.

(ii) −Lf2h  ∇2f  Lf2h on int dom h, provided that f, h ∈ C2(int dom h).

(iii) If ∇h is Lipschitz continuous with modulus `h, Uon a convex set U, then so is ∇f with modulus

`f, U=Lf`h, U[1, Prop. 2.5(ii)].

Relative to a Legendre kernel h : ’n→ ’, the Bregman proximal mapping of ψ is the set-valued map proxhψ: int dom h ⇒ ’ngiven by

proxhψ(x) B arg min

z∈’n {ψ(z) + Dh(z, x)}, (2.3)

and the corresponding Bregman-Moreau envelope is ψh: ’n→ [−∞, ∞] defined as ψh(x) B inf

z∈’n{ψ(z) + Dh(z, x)}. (2.4)

Fact 2.6(regularity properties of proxhψand ψh [35]). The following hold for a Legendre kernel h : ’n→ ’ and a proper, lsc, lower bounded function ψ : ’n→ ’:

(i) proxhψis locally bounded, compact-valued, and outer semicontinuous on int dom h.

(ii) ψhis real-valued and continuous on int dom h; in fact, it is locally Lipschitz if so is ∇h.

Fact 2.7(relation between ψ and ψh). Let h be a Legendre kernel and ψ : ’n → ’ be proper, lsc, and lower bounded on dom h. Then, for every x ∈ int dom h, y ∈ dom h, and ¯x ∈ proxhψ(x)

(i) ψh(x)(def)= ψ( ¯x) + Dh( ¯x, x) ≤ ψ(y) + Dh(y, x), and in particular ψh(x) ≤ ψ(x);

(ii) if ψ is convex, then ψh(x) ≤ ψ(y) + Dh(y, x) − Dh(y, ¯x) [59, Lem. 3.1].

Moreover, if range proxhψ⊆ int dom h, then the following also hold [1, Prop.3.3]:

(iii) infdom hψ≤ infint dom hψ = inf ψhand arg min ψh=arg minint dom hψ.

(iv) ψ + δdom his level bounded iff so is ψh.

3. A block-coordinate interpretation By introducing N copies of x, problem (P) can equivalently be written as

minimize

x=(x1,...,xN)∈’nNΦ(x) =N1 PN i=1fi(xi)

F(x)

+N1PN

i=1g(xi) + δ(x)

G(x)

subject to x ∈ C × · · · × C, (3.1) where ∆ Bn

x =(x1, . . . ,xN) ∈ ’nN| x1=x2=· · · = xNo is the consensus set. The equivalence be- tween (3.1) and the original problem (P) is formally established inLemma A.1. Note thatAssumption I.a1implies that F as in (3.1) is smooth with respect to the Legendre kernel

H : ’nN→ ’ defined as H(x) = PNi=1hi(xi), (3.2) making Bregman forward-backward iterations x+ ∈ arg min{h∇F(x), · i + G(·) + 1γDH( · , x)} for some stepsize γ > 0 a suitable option to address problem (3.1). In fact, it can be easily verified that LF =N1maxi=1...NLfiis a smoothness modulus of F relative to H, indicating that fixed point iterations x← x+underAssumption Iconverge (in some sense to be made precise) to a stationary point of the problem whenever γ ∈ (0,1/LF). Notice that a higher degree of flexibility can be granted by considering an N-uple of individual stepsizes Γ = (γ1, . . . , γN), giving rise to the forward-backward operator TΓF,G: ’nN⇒ ’nNin the Bregman metric (z, x) 7→ PNi=1 1

γiDhi(zi,xi), namely TF,GΓ (x) B arg min

z∈’nN nF(x) + h∇F(x), z − xi + G(z) + PNi=1 1

γiDhi(zi,xi)o. (3.3) This intuition is validated in the next result, which asserts that whenever the stepsizes γiare selected as inAlgorithm 1the operator TF,GΓ coincides with a proximal mapping on a suitable Legendre kernel function ˆH. This observation leads to a much simpler analysis ofAlgorithm 1, which will be shown to be a block-coordinate variant of a Bregman proximal point method.

(7)

Lemma 3.1. Suppose thatAssumption I.a1holds and let γi∈ (0,N/Lfi) be selected as inAlgorithm 1.

Then, ˆhiB γ1ihiN1 fi(with the convention ∞ − ∞ = ∞) is a Legendre kernel with dom ˆhi=dom hi, i ∈ [N], and thus so is the function

H : ’ˆ nN→ ’ defined as H(x) = Pˆ Ni=1ˆhi(xi). (3.4) Moreover, for any (z, x) ∈ ’nN× ’nNit holds that

Φ(z) + DHˆ(z, x) = F(x) + h∇F(x), z − xi + G(z) + PNi=1 1

γiDhi(zi,xi), (3.5) and in particular the forward-backward operator (3.3) satisfies

TΓF,G(x) = proxΦHˆ(x). (3.6)

WhenAssumption Iis satisfied, then the following also hold:

(i) DHˆ(z, x) ≥ PNi=1(γ1iLNfi) Dhi(zi,xi).

(ii) proxΦHˆ(x) = n(z, · · · , z) | z ∈ T(PNi=1∇ˆhi(xi))o, with T as in (2.1), is a nonempty and compact subset of C × · · · × C for any x ∈ int dom h1× · · · × int dom hN.

(iii) If z ∈ proxHΦˆ(x), then ∇ ˆH(x) − ∇ ˆH(z) ∈ ˆ∂Φ(z); the converse also holds when Φ is convex.

(iv) If ∇hiis `hi, Ui-Lipschitz on a convex set Ui ⊆ int dom hi, then ∇ˆhiis `ˆhi, Ui-Lipschitz on Uiwith

`ˆhi, Uiγ1i+LNfi`hi, Ui. If, in addition, fiµfi,Ui2 k · k2is convex on Uifor some µfi, Ui∈ ’, then

`ˆhi`hi,Uiγiµfi,UiN .

(v) If hiis µhi, Ui-strongly convex on a convex set Ui ⊆ dom hi, then ˆhiis µˆhi, Ui-strongly convex on Uiwith µˆhi, Uiγ1iLNfihi, Ui.

Proof. The claims on ˆhiare shown in [1, Thm. 4.1], and (3.5) and (3.6) then easily follow.

♠3.1(i) This is an immediate consenquence ofFact 2.5(i).

♠3.1(ii) Let x be as in the statement, and observe that x ∈ int dom ˆH; nonemptyness and com- pactness of proxHΦˆ then follows fromFact 2.6(i). Let now u ∈ proxHΦˆ(x) be fixed, and note that the consensus constraint encoded in Φ ensures that ui=ujfor all i, j ∈ [N]. Thus,

ui=arg min

w∈’n

nΦ(w, . . . , w) + ˆH(w, . . . , w) − h∇ ˆH(x), (w, . . . , w)io

=arg min

w∈’n

n1

N

PN

i=1fi(w) + g(w) + PNi=1ˆhi(w) − h∇ˆhi(xi), wio

=arg min

w∈’n ng(w) + PNi=1γ1ihi(w) − h PNi=1∇ˆhi(xi), wio(2.1)

=T Pi=1N ∇ˆhi(xi) ⊆ C, where the inclusion follows fromAssumption I.a4.

♠3.1(iii) Observe first that necessarily x ∈ int dom hi× · · · × int dom hN, for otherwise no such z exists. Moreover, from assertion3.1(ii)it follows that also z belongs to such open set, onto which ˆH is continuously differentiable. The claim then follows from the necessary condition for optimality of z in the minimization problem (2.4) — which is also sufficient when Φ is convex, for so is Φ+DHˆ(·, x) in this case — having

0 ∈ ˆ∂[Φ + DHˆ( · , x)](z) = ˆ∂[Φ + ˆH − h∇ ˆH(x), · i](z) = ˆ∂Φ(z) + ∇ ˆH(z) − ∇ ˆH(x).

The last equality follows from [53, Ex. 8.8(c)], owing to smoothness of ˆH at z.

♠3.1(iv)and3.1(v) Observe that

N−γiLfi

i hiN−γiLifihi+

convex

N1(Lfihi− fi) = ˆhi=N+γiLifihi

convex

N1(Lfihi+fi) N+γiLifihi, (3.7) where for notational convenience we used the partial ordering “”, defined as α  β iff β − α is convex. The claimed moduli `ˆhi, Uiγ1i +LNfi`hi, Ui and µˆhi, Uiγ1iLNfihi, Ui are thus readily inferred. In case fiis µfi, Ui-strongly convex on Ui, we may write

ˆhi=hγi

iµfi,Ui2N k · k2N1( fiµfi,Ui2 k · k2

convex

) hγiiµ2Nfi,Uik · k2

(8)

to obtain the tighter bound `ˆhi, Ui`hi,Uiγiµfi,UiN . 

Algorithm 3Block-coordinate proximal point formulation ofAlgorithm 1 Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

stepsizes γi∈ (0,N/Lfi)

initial point xinit∈ ∩Ni=1int dom hi=C

Denote x0=(xinit, . . . ,xinit), ˆhjB γ1ihjN1 fj, ˆH(x) B PNi=1ˆhi(xi) Repeat for k = 0, 1, . . . until convergence

3.1: uk∈ arg minw∈’nN

nΦ(w) + ˆH(w) − h∇ ˆH(xk), wio

= arg minw∈’nN

nΦ(w) + DHˆ(w, xk)o

3.2: Select a subset of indices Jk+1⊆ [N]

3.3: xk+1

Jk+1=uk

Jk+1 and xk+1[N]\Jk+1=xk[N]\Jk+1 Return ˜zk

3.1. Block-coordinate proximal point reformulation ofAlgorithm 1. Algorithm 3presents a block coordinate (BC) proximal point algorithm with the distance generating function ˆH. Note that in a departure from most of the existing literature on BC proximal methods that consider separable nonsmooth terms (see e.g., [63,47,12,19,31]), here the nonsmooth function G in (3.1) is nonsep- arable. It is shown in the next lemma that this conceptual algorithm is equivalent to the Bregman Finito/MISOAlgorithm 1.

Lemma 3.2(equivalence ofAlgorithms 1and3). As long as the same initialization parameters are chosen in the two algorithms, to any sequence (sk,˜sk,zk, Ik+1)k∈Žgenerated byAlgorithm 1there corresponds a sequence (xk, uk, Jk+1)k∈Ž generated byAlgorithm 3(and viceversa) satisfying the following identities for all k ∈ Ž and i ∈ [N]:

(i) Ik+1=Jk+1 (ii) (zk, . . . ,zk) = uk

(iii) ski =γ1i∇hi(xki) −N1∇fi(xki) (or, equivalently, xki =∇ˆhi(ski)) (iv) ˜sk=PN

i=1∇ˆhi(xki)

(v) ϕ(zk) = Φ(uk) = ΦHˆ(xk) − DHˆ(uk, xk) (vi) ΦHˆ(xk) = L(zk, sk) where L is as in (1.1).

Proof. Let the index sets Ik+1and Jk+1be chosen identically, k ∈ Ž. It follows fromLemma 3.1(ii) that uki =ukjfor all k ∈ Ž and i, j ∈ [N], with

uki =arg min

w∈’n ng(w) + Pi=1N γ1ihi(w) − h

Bvk

PN

i=1∇ˆhi(xki), wio. (3.8) We now proceed by induction to show assertions3.2(ii), 3.2(iii), and3.2(iv). Note that the latter amounts to showing that vkas defined in (3.8) coincides with ˜sk; by comparing (3.8) and the expres- sion of zk instep 1.1, the claimed correspondence of ukand zkas in assertion3.2(ii)is then also obtained and, in turn, so is the identity in3.2(v).

For k = 0 assertions3.2(iii)and3.2(iv)hold because of the initialization of ˜s0 inAlgorithm 1 and of x0inAlgorithm 3; in turn, as motivated above, the base case for assertion3.2(ii)also holds.

Suppose now that the three assertions hold for some k ≥ 0; then, vk+1=PN

i=1∇ˆhi(xk+1i ) = X

i∈Ik+1

∇ˆhi(uki) + vk − X

i∈Ik+1

∇ˆhi(xki)

(induction)= X

i∈Ik+1

∇ˆhi(zk) + ˜sk − X

i∈Ik+1

ski

= X

i∈Ik+1

sk+1i + ˜sk − X

i∈Ik+1

ski =˜sk+1,

where the last two equalities are due tosteps 1.2and1.3. Therefore, vk+1 = ˜sk+1and thus uk+1 = (zk+1, . . . ,zk+1). It remains to show that sk+1i = γ1

i∇hi(xk+1i ) − 1N∇fi(xk+1i ). For i ∈ Ik+1 this holds

(9)

because of the update rule atstep 1.3and the fact that xk+1i =uki =zkowing tostep 3.3. For i < Ik+1 this holds because (xk+1i ,sk+1i ) = (xki,ski). Finally,

ΦHˆ(xk)(def)= Φ(uk) + DHˆ(uk, xk)3.2(v)= ϕ(zk) +

N

X

i=1

Dˆhi(zk,xki)3.2(iii)= ϕ(zk) +

N

X

i=1

Dˆhi(zk,∇ˆhi(ski)), and the last term is L(zk, sk) (cf.Facts 2.3(i)and2.3(iii)), yielding assertion3.2(vi). 

4. Convergence analysis

The block coordinate interpretation ofAlgorithm 1presented inSection 3plays a crucial role in the proposed methodology, and leads to a remarkably simple convergence analysis. In fact, many key facts can be established without confining the discussion to a particular sampling strategy. These preliminary results are presented in the next subsection and will be extensively referred to in the subsequent subsections that are instead devoted to a specific sampling strategy.

4.1. General sampling results. Unlike classical analyses of BC proximal methods that employ the cost as a Lyapunov function (see e.g., [10, §11]), here, the nonseparability of G precludes this possibility. To address this challenge, we instead employ the Bregman Moreau envelope equipped with the distance generating function ˆH (see (3.4)). Before showing its Lyapunov-type behavior for Algorithm 3, we list some of its properties and its relation with the original problem. The proof is a simple consequence ofFacts 2.6(ii)and2.7and the fact that ˆH is a Legendre kernel with dom ˆH = dom h1× · · · × dom hN(cf.Lemma 3.1).

Lemma 4.1(connections between ϕ + δCand ΦHˆ). Suppose thatAssumption Iholds. Then, (i) ΦHˆ is continuous on dom ΦHˆ =int dom h1× · · · × int dom hN, in fact, locally Lipschitz if so is

∇hion int dom hi, i ∈ [N];

(ii) minCϕ≤ infCϕ = inf ΦHˆand arg min ΦHˆ =(x?, . . . ,x?) | x?∈ arg minCϕ ; (iii) ΦHˆ is level bounded iff so is ϕ + δC.

Lemma 4.2(sure descent). Suppose thatAssumption Iholds, and consider the iterates generated by Algorithm 3. Then, uk=(uk, . . . ,uk) for some uk ∈ C and xk ∈ C × · · · × C ⊆ int dom ˆH for every k ∈ Ž, and the algorithm is thus well defined. Moreover, the following hold:

(i) ΦHˆ(xk+1) ≤ ΦHˆ(xk) − DHˆ(xk+1, xk) = ΦHˆ(xk) − Pi∈Jk+1Dˆhi(uk,xki) for every k ∈ Ž; when Φ is convex (i.e., when so is ϕ), then the inequality can be strengthened to ΦHˆ(xk+1) ≤ ΦHˆ(xk) −

DHˆ(xk+1, xk) − DHˆ(uk, uk+1).

(ii) (ΦHˆ(xk))k∈Žmonotonically decreases to a finite value ϕ?≥ infCϕ≥ minCϕ.

(iii) The sequence (DHˆ(xk+1, xk))k∈Žhas finite sum (and in particular vanishes); the same holds also for (DHˆ(uk, uk+1))k∈Žwhen Φ is convex (i.e., when so is ϕ).

(iv) If ϕ + δCis level bounded, then (xk)k∈Žand (uk)k∈Žare bounded.

(v) If dom hiis closed, a subsequence (xki)k∈Kconverges to a point x?iff so does (xk+1i )k∈K. (vi) If C = ’n, then ΦHˆis constant (and equals ϕ?as above) on the set of accumulation points of

(xk)k∈Ž.

Proof. It follows fromLemma 3.1(ii)that uk∈ C holds for every k ∈ Ž. Notice that for every i ∈ [N]

and k ∈ Ž, either xki = xinit∈ C (by initialization), or there exists ki ≤ k such that xki =zki ∈ C. It readily follows that xk ∈ C × · · · × C ⊆ int dom H = int dom ˆH, hence that proxHΦˆ(xk) , ∅ for all k ∈ Ž byLemma 3.1(ii), whence the well definedness of the algorithm. We now show the numbered claims.

♠4.2(i) It follows fromFacts 2.7(i)and2.7(ii)that ΦHˆ(xk+1) ≤ Φ(uk) + DHˆ(uk, xk+1) − ck, where ck≥ 0 can be taken as ck=DHˆ(uk, uk+1) when Φ is convex. Therefore,

ΦHˆ(xk+1) ≤ Φ(uk) + DHˆ(uk, xk+1) − ck= ΦHˆ(xk) − DHˆ(uk, xk) + DHˆ(uk, xk+1) − ck 2.3(ii)

= ΦHˆ(xk) − DHˆ(xk+1, xk) − huk− xk+1,∇ ˆH(xk+1) − ∇ ˆH(xk)i − ck.

(10)

The claim follows by noting that the inner product is zero:

huk− xk+1,∇ ˆH(xk+1) − ∇ ˆH(xk)i = X

j∈[N]

huk− xk+1j

=ukfor j∈Jk+1

,∇ˆhj(

=xkjfor j<Jk+1

xk+1j ) − ∇ˆhj(xkj)i = 0.

♠4.2(ii) Monotonic decrease of (ΦHˆ(xk))k∈Ž follows from assertion4.2(i). This ensures that the sequence converges to some value ϕ?, bounded below by minCϕin light ofLemma 4.1(ii).

♠4.2(iii) It follows from assertion4.2(i)that

Pk∈ŽDHˆ(xk+1, xk) ≤ ΦHˆ(x0) − inf ΦHˆ ≤ ΦHˆ(x0) − infCϕ <∞

owing toLemma 4.1(ii)andAssumption I.a3. When ϕ is convex, the tighter bound in assertion4.2(i) yields the similar claim for (DHˆ(uk, uk+1))k∈Ž.

♠4.2(iv) It follows from assertion4.2(ii)that the entire sequence (xk)k∈Žis contained in the sublevel set {w | ΦHˆ(w) ≤ ΦHˆ(x0)}, which is bounded provided that ϕ + δCis level bounded as shown inLem- ma 4.1(iii). In turn, boundedness of (uk)k∈Žthen follows from local boundedness of TF,GΓ =proxHΦˆ, cf. (3.6) andFact 2.6(i).

♠4.2(v)Follows fromFact 2.3(v), since xki ∈ int dom hi=int dom ˆhifor every k (with equality owing toLemma 3.1), and Dˆhi(xk+1i ,xki) → 0 by assertion4.2(iii).

♠4.2(vi) Follows from assertion4.2(ii)and the continuity of ΦHˆ, seeFact 2.6(ii).  In conclusion of this subsection we provide an overview of the ingredients that are needed to show that the limit points of the sequence (zk)k∈Žgenerated byAlgorithm 1are stationary for problem (P).

As will be shown inLemma 4.4, these amount to the vanishing of the residual DHˆ(uk, xk) together with some assumptions on the distance-generating functions hi. For the iterates ofAlgorithm 1, this translates to Dˆh

i(ski,∇ˆhi(zk)) → 0 for all indices i ∈ [N], indicating that all vectors sk+1i in the table should be good estimates of ∇ˆhi(zk+1) = γ1i∇hi(zk+1) −N1∇fi(zk+1), as opposed toγ1i∇hi(zk) −N1∇fi(zk) and for the indices in Ik+1only (cf.step 1.3). As a result, we may view this property as jointly having zk− zk+1vanish, desirable if any convergence of (zk)k∈Žis expected, and the fact that a consensus is eventually reached among the sampled blocks.

In line with any result in the literature we are aware of, a complete convergence analysis for nonconvex problems will ultimately require C = ’n. For convex problems, that is, when the cost function ϕ is convex without any among fi and g being necessarily so, the following requirement will instead suffice to our purposes in the randomized sampling setting of (S1).

Assumption II(requirements on the distance-generating functions). For i ∈ [N], dom hiis closed, and whenever int dom hi3 zk→ z ∈ bdry dom hiit holds that Dhi(z, zk) → 0.

Remark 4.3.

(i)Assumption IIis vacuously satisfied when dom hi= ’n, having bdry ’n=∅.

(ii) For any i ∈ [N], function hicomplies withAssumption IIiff so does ˆhi, owing to the inequalities

N−γiLfi

i Dhi ≤ DˆhiN+γiLifiDhi(cf. (3.7)). 

Lemma 4.4(subsequential convergence recipe). Suppose thatAssumption I holds, and consider the iterates generated byAlgorithm 1. Let xki =∇ˆhi(ski) and zk =ukbe the corresponding iterates generated byAlgorithm 3as inLemma 3.2, and suppose that

a1 DHˆ(uk, xk) → 0 (or equivalently, Dˆhi(ski,∇ˆhi(zk)) → 0, i ∈ [N]).

Then, letting ϕ?be as inLemma 4.2(ii), the following hold:

(i) ϕ(zk) = Φ(uk) → ϕ?as k → ∞.

(ii) If dom hi is closed, i ∈ [N], then having (a) (zk)k∈K → z, (b) (xki)k∈K → z ∃i ∈ [N], and (c) (zk+1,xk+1i )k∈K → (z, z) ∀i ∈ [N], are all equivalent conditions. In particular, if (zk)k∈Ž is bounded (e.g., when ϕ + δCis level bounded), then kzk+1− zkk → 0 holds, and the set of its limit points, be it ω, is thus nonempty, compact, and connected.

(iii) UnderAssumption II, ϕ ≡ ϕ?on ω (the set of limit points of (zk)k∈Ž).

Referenties

GERELATEERDE DOCUMENTEN

A widely used technique for dealing with long-ranged interactions in Monte Carlo or molecular dynamics calcula- tions is to impose periodic boundary conditions on the cell

•. for assertion con demand no more than counter-assertion and what is affirmed on the one side, we on the other can simply deny. Francis Herbert Bradley. In het dagelijkse

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Section 25 of the Ordinance for the Administration of Justice in Superior Courts (Orange River Colony) 14 provides: “The High Court [established by section 1] shall

so long as the country of employment has ratified the Convention in question, the social security rights of migrants are to be respected, irrespective of whether the migrant

Nonsmooth nonconvex optimization, Bregman forward-backward envelope, rela- tive smoothness, prox-regularity, KL inequality, nonlinear error bound, nonisolated critical

In Section 5 we describe Algorithm 1 , a line-search based method for finding critical points of ϕ, discuss its global and local linear convergence.. Section 6 is devoted to

The main results of this paper are the exact convergence rates of the gradient descent method with exact line search and its noisy variant for strongly convex functions with