Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity

(1)

Bregman Finito/MISO for nonconvex regularized finite sum minimization without Lipschitz gradient continuity

PUYA LATAFAT¹, ANDREAS THEMELIS², MASOUD AHOOKHOSH³, AND PANAGIOTIS PATRINOS¹

Abstract. We introduce two algorithms for nonconvex regularized finite sum minimization, where typical Lipschitz differentiability assumptions are relaxed to the notion of relative smoothness [7]. The first one is a Bregman extension of Finito/MISO [28,42], studied for fully nonconvex problems when the sampling is random, or under convexity of the nonsmooth term when it is essentially cyclic. The second algorithm is a low-memory variant, in the spirit of SVRG [34] and SARAH [48], that also allows for fully nonconvex formulations. Our analysis is made remarkably simple by employing a Bregman Moreau envelope as Lyapunov function. In the randomized case, linear convergence is established when the cost function is strongly convex, yet with no convexity requirements on the individual functions in the sum. For the essentially cyclic and low-memory variants, global and linear convergence results are established when the cost function satisfies the Kurdyka-Łojasiewicz property.

1. Introduction We consider the following regularized finite sum minimization

minimize_x∈ⁿϕ(x) B_N¹ PN

i=1fi(x) + g(x) subject to x ∈ C, (P) where C denotes the closure of C B T^N_i=1int dom h_i, for some convex functions h_i, i ∈ [N] B {1, . . . , N}. Our goal in this paper is to study such problems without imposing convexity assumptions on fi and g, and in a setting where fi are differentiable but their gradients need not be Lipschitz continuous. Our full setting is formalized inAssumption I.

To relax the Lipschitz differentiability assumption, we adopt the notion of smoothness relative to a distance-generating function [7], and following [40] we will use the terminology of relative smoothness. Despite the lack of Lipschitz differentiability, in many applications the involved functions satisfy a descent property where the usual quadratic upper bound is replaced by a Bregman distance (cf.Fact 2.5(i)andDefinition 2.1). Owing to this property, Bregman extensions for many classical schemes have been proposed [7,40,6,59,49,1].

In the setting of finite sum minimization, the incremental aggregated algorithm PLIAG was proposed recently [69] as a Bregman variant of the incremental aggregated gradient method [15,16,65].

The analysis of PLIAG is limited to the convex case and requires restrictive assumptions for the Bregman kernel [69, Thm. 4.1, Assump. B4]. Stochastic mirror descent (SMD) is another relevant algorithm which can tackle more general stochastic optimization problems. SMD may be viewed as a Bregman extension of the stochastic (sub)gradient method and has long been studied [46,61,11,45].

More recently, [32] studied SMD for convex and relatively smooth formulations, and (sub)gradient versions have been analyzed under relative continuity in a convex setting [39], as well as relative weak convexity [71,25].

1Department of Electrical Engineering (ESAT-STADIUS) – KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

2Faculty of Information Science and Electrical Engineering (ISEE) – Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan

3Department of Mathematics and Computer Science, University of Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium E-mail addresses: puya.latafat@esat.kuleuven.be, andreas.themelis@ees.kyushu-u.ac.jp,

masoud.ahookhosh@uantwerp.be, panos.patrinos@esat.kuleuven.be. 1991 Mathematics Subject Classification. 90C06, 90C25, 90C26, 49J52, 49J53.

Key words and phrases. Nonsmooth nonconvex optimization, incremental aggregated algorithms, Bregman Moreau envelope, KL inequality.

This work was supported by the Research Foundation Flanders (FWO) PhD grant 1196820N and research projects G0A0920N, G086518N, and G086318N; Research Council KU Leuven C1 project No. C14/18/068; Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS project no 30468160 (SeLMA).

1

(2)

Algorithm 1Bregman Finito/MISO (BFinito) for the regularized finite sum minimization (P) Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

stepsizes γi∈ (0,^N/L_fi)

initial point x^init∈ C B T^Ni=1int dom hi

Initialize table s⁰=(s⁰₁, . . . ,s⁰_N) ∈ ^nNof vectors s⁰_i =_γ¹

i∇hi(x^init) −N¹∇fi(x^init)

ⁿ-vector ˜s⁰=PN i=1s⁰_i Repeat for k = 0, 1, . . . until convergence

1.1: Compute z^k∈ arg minw∈ⁿng(w) + P^N_i=1_γ¹_ihi(w) − h˜s^k,wio

1.2: Select a subset of indices I^k+1⊆ [N] B {1, . . . , N} and update the table s^k+1as follows:

s^k+1_i =











1

γi∇hi(z^k) −_N¹∇fi(z^k) if i ∈ I^k+1,

s^k_i otherwise

1.3: Update the vector ˜s^k+1=˜s^k+P

i∈I^k+1(s^k+1_i − s^k_i) Return z^k

Motivated by these recent works, we propose a Bregman extension of the popular Finito/MISO algorithm [28,42] in a fully nonconvex setting and with very general sampling strategies that will be made percise shortly after. In a nutshell, our analysis revolves around the fact that, regardless of the index selection strategy, the function L : ⁿ× ^nN→ defined as

L(z, s) B ϕ(z) + P^N_i=1D_ˆh∗

i si,∇ˆhi(z), (1.1)

where ˆh^∗_i denotes the convex conjugate of ˆhiB^hⁱ/γ_i−^fⁱ/N, monotonically decreases along the iterates (z^k, s^k)_k∈generated byAlgorithm 1(seeAssumption Ifor the requirements on hi,fi). Our methodology leverages an interpretation of Finito/MISO as a block-coordinate algorithm that was observed in [37] in the Euclidean setting. In fact, the analysis is here further simplified after noticing that the smooth function can be “hidden” in the distance-generating function, resulting in a Lyapunov function L that can be expressed as a Bregman Moreau envelope (cf.Lemma 3.2).

We cover a wide range of sampling strategies for the index set I^k+1atstep 1.2, which we can summarize into the following two umbrella categories:

Random sampling rule: ∃p1, . . . ,pN>0 : Pkhi ∈ I^k+1i

=pi ∀k ∈ , i ∈ [N]. (S1) Essentially cyclic rule: ∃T > 0 : S^Tt=1I^k+t=[N] ∀k ∈ . (S2) The randomized setting (S1), in which Pkdenotes the probability conditional to the knowledge at iteration k, covers, for instance, a mini-batch strategy of size b. Another notable case is when each index i is selected at random with probability piindependently of other indices.

The essentially cyclic sampling (S₂) is also very general and has been considered by many authors [62,60,33,24,67]. Two notable special cases of single index selection rules complying with (S2) are the cyclic and shuffled cyclic sampling strategies:

Shuffled cyclic rule: I^k+1=π_bk/Nc mod(k, N) + 1 , (S₂^shuf) where π₀, π₁, . . .are permutations of the set of indices [N] (chosen randomly or deterministically).

When π_bk/Nc=id one recovers the (plain) cyclic sampling rule

Cyclic rule: I^k+1={mod(k, N) + 1}. (S₂^cycl) We remark that, in the cyclic case, our algorithm generalizes DIAG [44] for smooth strongly convex problems, which itself may be seen as a cyclic variant of Finito/MISO.

1.1. Low-memory variant. One iteration ofAlgorithm 1involves the computation of z^katstep 1.1 and that of the gradients ∇(^hⁱ/γi−^fⁱ/N), i ∈ I^k+1, atstep 1.3. Consequently, the overall complexity of each iteration is independent of the number N of functions appearing in problem (P), and is instead proportional to the number of sampled indices, which the user is allowed to upper bound to any integer between 1 and N. As is the case for all incremental gradient methods, the low iteration cost comes at the price of having to store in memory a table s^kof N many ⁿvectors, which can become problematic when N grows large. Other incremental algorithms for convex optimization such as

(3)

Algorithm 2Low-memory Bregman Finito/MISO

Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

initial point x^init∈ C B T^Ni=1int dom hi

Initialize ⁿ-vector ˜s⁰=PN i=1₁

γ_i∇hi(x^init) −_N¹∇fi(x^init)

set of selectable indices K⁰=∅ ^.Conventionally set to∅so as to start with a full update

Repeat for k = 0, 1, . . . until convergence

2.1: z^k∈ arg minw∈ⁿng(w) + P^N_i=1_γ¹_ihi(w) − h˜s^k,wio

2.2: if K^k=∅ then ^.No index left to be sampled: full update

2.3: I^k+1=K^k+1=[N] .activate all indices and reset the selectable indices 2.4: ˜z^k=z^k ^.store the full updatez^k

2.5: ˜s^k+1=PN i=11

γ_i∇hi(z^k) −_N¹ PN i=1∇fi(z^k)

2.6: else

2.7: select a nonempty subset of indices I^k+1⊆ K^k ^.select among the indices not yet sampled

2.8: K^k+1=K^k\ I^k+1 ^.update the set of selectable indices

2.9: ˜z^k=˜z^k−1

2.10: ˜s^k+1=˜s^k+P

i∈I^k+1

h ₁

γ_i∇hi(z^k) −_N¹∇fi(z^k) − γ¹_i∇hi(˜z^k) −_N¹∇fi(˜z^k)i Return ˜z^k

IAG [15,16,65], IUG [64], SAG [54], and SAGA [27], can considerably reduce memory allocation from O(nN) to O(n) in applications such as logistic regression and lasso where the gradients ∇fi

can be expressed as scaled versions of the data vectors. Despite the favorable performance of the Finito/MISO algorithm on such problems as observed in [27], this memory reduction trick can not be employed due to the fact that the vectors sistored in the table depend not only on the gradients, but also on the vectors ∇hi(z^k). Nevertheless, inspired by the popular stochastic methods SVRG [34,66]

and SARAH [48], by suitably interleaving incremental and full gradient evaluations it is possible to completely waive the need of a memory table and match the O(n) storage requirement.

In a nutshell, after a full update — which inAlgorithm 1corresponds to selecting I^k+1=[N] — all vectors s^k+1_i in the table only depend on variable z^kcomputed atstep 1.1, until i is sampled again.

As long as full gradient updates are frequent enough so that no index is sampled twice in between, it thus suffices to keep track of z^k ∈ ⁿ instead of the table s^k ∈ ^nN. The variant is detailed in Algorithm 2, in which K^k⊆ [N] keeps track of the indices that have not yet been sampled between full gradient updates (and is thus reset whenever such full steps occur, cf.step 2.3). Vector ˜z^k ∈ ⁿ is equal to z^kcorresponding to the latest full gradient update (cf.step 2.4) and acts as a low-memory surrogate of the table s^kofAlgorithm 1. Similarly to SVRG and SARAH, this reduction in the storage requirements comes at the cost of an extra gradient evaluation per sampled index (cf.step 2.10).

Since full gradient updates correspond to selecting all indices,Algorithm 2may be viewed as Algorithm 1with an essentially cyclic sampling rule of period N, a claim that will be formally shown inLemma 4.12. In fact, not only does it naturally inherit all the convergence results, but its particular sampling strategy also allows us to waive convexity requirements on g that are necessary for more general essentially cyclic rules.

1.2. Contributions. As a means to informally summarize the content of the paper, inTable 1we synopsize the convergence results of the two algorithms.

1. To the best of our knowledge, this is the first analysis of an incremental aggregated method in a fully nonconvex setting and without Lipschitz differentiability assumptions. Our analysis is surpris- ingly simple, and yet it covers randomized and essentially cyclic samplings altogether, relies on a sure descent property on the Bregman Moreau envelope (cf.Lemma 4.2).

2. We propose a novel low-memory variant of the (Bregman) Finito/MISO algorithm, that in the spirit of SVRG [34,66] and SARAH [48] alternates between incremental steps and a full proximal gradient step. It is highly interesting even in the Euclidean case, as it can accommodate fully nonconvex formulations while maintaining a O(n) memory requirement.

(4)

Sampling Requirements(additionally toAssumption I) Property Reference z^k

bounded any ϕ + δ_Clevel bounded sure Lemma 4.2(iv)

ϕ(z^k) convergent

S1 a.s. Theorem 4.6(ii)

S2 C = ⁿ; g cvx; hi(loc) str cvx smooth; (ϕ level bounded) sure Theorem 4.9

LM Theorem 4.13

ω(z^k) stationary

S1 either C = ⁿ a.s. Theorem 4.6(iv)

or dom hiclosed; ϕ cvx Theorem 4.6(vi)

S2 C = ⁿ; g cvx; hi(loc) str cvx smooth; (ϕ level bounded) sure Theorem 4.9

LM C = ⁿ Theorem 4.13

z^k convergent

S1 either C = ⁿ; ϕ cvx

a.s. Theorem 4.6(vii) orAssumption II; ϕ + δ_Ccvx level bounded

S2 Assumption III; g cvx sure Theorem 4.11(i)

LM Assumption III Theorem 4.14(i)

ϕ(z^k) and z^k linearly convergent

S1 C = ⁿ; ϕ str cvx; hiloc smooth E Theorem 4.7

S2 Assumption III; ϕ kl1/2; g cvx sure Theorem 4.11(iii)

LM Assumption III; ϕ kl1/2 Theorem 4.14(iii)

Table 1. Summary of the convergence results forAlgorithm 1with randomized sampling (S₁) and essentially cyclic sampling (S₂), and for the low-memory variant ofAlgorithm 2(LM). Claims are either sure, almost sure (a.s.), or in expectation (E).

Other abbreviations: loc: locally; cvx: convex; str: strongly; smooth: Lipschitz differentiable; ω: set of limit points; kl: Kurdyka-Łojasiewicz property with exponent θ.

3. Linear convergence ofAlgorithm 1in the randomized case is established when the cost function ϕis strongly convex, yet with no convexity requirement on f_ior g. To the best of our knownledge, this is a novelty even in the Euclidean case, for all available results are bound to strong convexity of each term fiin the sum; see e.g., [28,42,44,37,50].

4. We leverage the Kurdyka-Łojasiewicz (KL) property to establish global (as apposed to subsequential) convergence as well as linear convergence, forAlgorithm 1with (essentially) cyclic sampling and for the low-memoryAlgorithm 2.

1.3. Organization. We conclude this section by introducing some notational conventions. The problem setting is formally described inSection 2together with a list of related definitions and known facts involving Bregman distances, relative smoothness, and proximal mapping.Section 3offers an alternative interpretation ofAlgorithm 1as the block-coordinate Bregman proximal pointAlgorithm 3, which majorly simplifies the analysis, addressed inSection 4. Some auxiliary results are deferred toAppendices AandB.Section 5applies the proposed algorithms to sparse phase retrieval problems, andSection 6concludes the paper.

1.4. Notation. The set of real and extended-real numbers are B (−∞, ∞) and B ∪ {∞}, while the positive and strictly positive reals are +B [0,∞) and ++B (0,∞). With id we indicate the identity function x 7→ x defined on a suitable space. We denote by h · , · i and k · k the standard Euclidean inner product and the induced norm. For a vector w = (w₁, . . . ,w_r) ∈ ^Pⁱⁿⁱ, w_i ∈ ⁿⁱ is used to denote its i-th block coordinate. int E and bdry E respectively denote the interior and boundary of a set E, and for a sequence (x^k)_k∈we write (x^k)_k∈ ⊆ E to indicate that x^k∈ E for all k ∈ . We say that (x^k)_k∈converges at Q-linear rate (resp. R-linear rate) to a point x if there exists c ∈ (0, 1) such that kx^k+1− xk ≤ ckx^k− xk (resp. kx^k− xk ≤ ρc^kfor some ρ > 0) holds for all k ∈ .

We use the notation H : ⁿ ⇒ ^m to indicate a mapping from each point x ∈ ⁿto a subset H(x) of ^m. The graph of H is the set gph H B {(x, y) ∈ ⁿ× ^m| y ∈ H(x)}. We say that H is outer semicontinuous (osc) if gph H is a closed subset of ⁿ× ^m, and locally bounded if for every bounded U ⊂ ⁿthe set S_x∈UH(x) is bounded.

The domain and epigraph of an extended-real-valued function h : ⁿ→ are the sets dom h B {x ∈ ⁿ| h(x) < ∞} and epi h B {(x, α) ∈ ⁿ× | h(x) ≤ α}. Function h is said to be proper if dom h , ∅, and lower semicontinuous (lsc) if epi h is a closed subset of ⁿ⁺¹. We say that h is

(5)

level bounded if its α-sublevel set lev≤αh B {x ∈ ⁿ| h(x) ≤ α} is bounded for all α ∈ . The conjugate of h, is defined by h^∗(y) B sup_x∈ⁿ{hy, xi − h(x)}. The indicator function of a set E ⊆ ⁿis denoted by δ_E, namely δ_E(x) = 0 if x ∈ E and ∞ otherwise.

We denote by ˆ∂h : ⁿ⇒ ⁿthe regular subdifferential of h, where v ∈ ˆ∂h( ¯x) ⇔ lim inf_{¯x,x→ ¯x} h(x) − h( ¯x) − hv, x − ¯xi

kx − ¯xk ≥ 0.

A necessary condition for local minimality of x for h is 0 ∈ ˆ∂h(x), see [53, Th. 10.1]. The (limiting) subdifferential of h is ∂h : ⁿ ⇒ ⁿ, where v ∈ ∂h(x) iff x ∈ dom h and there exists a sequence (x^k,v^k)_k∈ ⊆ gph ˆ∂h such that (x^k,h(x^k), v^k) → (x, h(x), v) as k → ∞. Finally, the set of r times continuously differentiable functions from X to is denoted by C^r(X).

2. Problem setting and preliminaries

Throughout this paper, problem (P) is studied under the following assumptions.

Assumption I(basic requirements). In problem (P),

a1 fi: ⁿ→ are Lfi-smooth relative to Legendre kernels h_i(Definitions 2.2and2.4);

a2 g : ⁿ→ is proper and lower semicontinuous (lsc);

a3 a solution exists: arg minnϕ(x) | x ∈ Co , ∅;

a4 for given γi∈ (0,^N/L_fi), i ∈ [N], it holds that for any s ∈ ⁿ T(s) B arg min

w∈ⁿ ng(w) + P^N_i=1_γ¹_ihi(w) − hs, wio ⊆ C. (2.1) As it will become clear inSection 3, the subproblem (2.1) is in fact a reformulation of a (Breg- man) proximal mapping.Assumption I.a4excludes boundary points from range T. This is a standard assumption that usually holds in practice [20,56], e.g., when g is convex or when the intersection of dom hi, i ∈ [N], is an open set.

Definition 2.1(Bregman distance). For a convex function h : ⁿ→ that is continuously differentiable on int dom h , ∅, the Bregman distance Dh: ⁿ× ⁿ→ is defined as

Dh(x, y) B

(h(x) − h(y) − h∇h(y), x − yi if y ∈ int dom h,

∞ otherwise. (2.2)

Function h will be referred to as a distance-generating function.

Definition 2.2 (Legendre kernel). A proper, lsc, and strictly convex function h : ⁿ → with int dom h , ∅ and such that h ∈ C¹(int dom h) is said to be a Legendre kernel if it is (i) 1-coercive, i.e., such that lim_kxk→∞^h(x)/kxk = ∞, and (ii) essentially smooth, i.e., if k∇h(xk)k → ∞ for every sequence (xk)_k∈⊆ int dom h converging to a boundary point of dom h.

Fact 2.3. The following hold for a Legendre kernel h : ⁿ→ , x ∈ ⁿ, and y, z ∈ int dom h:

(i) h^∗∈ C¹(ⁿ) is strictly convex and ∇h⁻¹=∇h^∗[52, Thm. 26.5 and Cor. 13.3.1].

(ii) Dh(x, z) = Dh(x, y) + Dh(y, z) + hx − y, ∇h(y) − ∇h(z)i [22, Lem. 3.1].

(iii) D_h(y, z) = D_h∗(∇h(z), ∇h(y)) [8, Thm. 3.7(v)].

(iv) Dh( · , z) and Dh(z, · ) are level bounded [9, Lem. 7.3(v)-(viii)].

(v) If dom h is closed and Dh(x^k,y^k) → 0 for some x^k ∈ dom h and y^k ∈ int dom h, then (x^k)_k∈

converges to a point x iff so does (y^k)_k∈[56, Thm. 2.4].

Moreover, for any convex set U ⊆ int dom h and u, v ∈ U the following hold:

(vi) If h is µh, U-strongly convex on U, then^µ^{h, U}₂ kv − uk²≤ Dh(v, u) ≤_2µ¹_{h, U}k∇h(v) − ∇h(u)k². (vii) If ∇h is `h, U-Lipschitz on U, then_2`¹_{h, U}k∇h(v) − ∇h(u)k²≤ Dh(v, u) ≤^`^{h, U}₂ kv − uk².

Definition 2.4(relative smoothness [7]). We say that a proper, lsc function f : ⁿ → is smooth relative to a Legendre kernel h : ⁿ→ if dom f ⊇ dom h, and there exists Lf ≥ 0 such that Lfh± f are convex functions on int dom h. We will simply say that f is Lf-smooth relative to h to make the modulus L_f explicit.

(6)

Fact 2.5. Let f : ⁿ → be Lf-smooth relative to a Legendre kernel h : ⁿ → . Then, f ∈ C¹(int dom h) and the following hold:

(i)

f (y)− f (x) − h∇f (x), y − xi

≤ LfDh(y, x) for all x, y ∈ int dom h.

(ii) −Lf∇²h ∇²f Lf∇²h on int dom h, provided that f, h ∈ C²(int dom h).

(iii) If ∇h is Lipschitz continuous with modulus `h, Uon a convex set U, then so is ∇f with modulus

`_{f, U}=L_f`_{h, U}[1, Prop. 2.5(ii)].

Relative to a Legendre kernel h : ⁿ→ , the Bregman proximal mapping of ψ is the set-valued map prox^h_ψ: int dom h ⇒ ⁿgiven by

prox^h_ψ(x) B arg min

z∈ⁿ {ψ(z) + Dh(z, x)}, (2.3)

and the corresponding Bregman-Moreau envelope is ψ^h: ⁿ→ [−∞, ∞] defined as ψ^h(x) B inf

z∈ⁿ{ψ(z) + Dh(z, x)}. (2.4)

Fact 2.6(regularity properties of prox^h_ψand ψ^h [35]). The following hold for a Legendre kernel h : ⁿ→ and a proper, lsc, lower bounded function ψ : ⁿ→ :

(i) prox^h_ψis locally bounded, compact-valued, and outer semicontinuous on int dom h.

(ii) ψ^his real-valued and continuous on int dom h; in fact, it is locally Lipschitz if so is ∇h.

Fact 2.7(relation between ψ and ψ^h). Let h be a Legendre kernel and ψ : ⁿ → be proper, lsc, and lower bounded on dom h. Then, for every x ∈ int dom h, y ∈ dom h, and ¯x ∈ prox^hψ(x)

(i) ψ^h(x)^(def)= ψ( ¯x) + Dh( ¯x, x) ≤ ψ(y) + Dh(y, x), and in particular ψ^h(x) ≤ ψ(x);

(ii) if ψ is convex, then ψ^h(x) ≤ ψ(y) + Dh(y, x) − Dh(y, ¯x) [59, Lem. 3.1].

Moreover, if range prox^h_ψ⊆ int dom h, then the following also hold [1, Prop.3.3]:

(iii) inf_{dom h}ψ≤ infint dom hψ = inf ψ^hand arg min ψ^h=arg min_{int dom h}ψ.

(iv) ψ + δ_{dom h}is level bounded iff so is ψ^h.

3. A block-coordinate interpretation By introducing N copies of x, problem (P) can equivalently be written as

minimize

x=(x1,...,xN)∈^nNΦ(x) =_N¹ PN i=1fi(xi)

F(x)

+_N¹PN

i=1g(xi) + δ∆(x)

G(x)

subject to x ∈ C × · · · × C, (3.1) where ∆ Bn

x =(x₁, . . . ,x_N) ∈ ^nN| x1=x₂=· · · = xNo is the consensus set. The equivalence between (3.1) and the original problem (P) is formally established inLemma A.1. Note thatAssumption I.a1implies that F as in (3.1) is smooth with respect to the Legendre kernel

H : ^nN→ defined as H(x) = P^Ni=1hi(xi), (3.2) making Bregman forward-backward iterations x⁺ ∈ arg min{h∇F(x), · i + G(·) + ¹γDH( · , x)} for some stepsize γ > 0 a suitable option to address problem (3.1). In fact, it can be easily verified that LF =_N¹maxi=1...NLfiis a smoothness modulus of F relative to H, indicating that fixed point iterations x← x⁺underAssumption Iconverge (in some sense to be made precise) to a stationary point of the problem whenever γ ∈ (0,¹/LF). Notice that a higher degree of flexibility can be granted by considering an N-uple of individual stepsizes Γ = (γ1, . . . , γ_N), giving rise to the forward-backward operator T_Γ^F,G: ^nN⇒ ^nNin the Bregman metric (z, x) 7→ P^Ni=1 1

γiD_h_i(z_i,x_i), namely T^F,G_Γ (x) B arg min

z∈^nN nF(x) + h∇F(x), z − xi + G(z) + P^Ni=1 1

γ_iDhi(zi,xi)o. (3.3) This intuition is validated in the next result, which asserts that whenever the stepsizes γiare selected as inAlgorithm 1the operator T^F,G_Γ coincides with a proximal mapping on a suitable Legendre kernel function ˆH. This observation leads to a much simpler analysis ofAlgorithm 1, which will be shown to be a block-coordinate variant of a Bregman proximal point method.

(7)

Lemma 3.1. Suppose thatAssumption I.a1holds and let γi∈ (0,^N/L_fi) be selected as inAlgorithm 1.

Then, ˆhiB _γ¹_ihi−_N¹ fi(with the convention ∞ − ∞ = ∞) is a Legendre kernel with dom ˆhi=dom hi, i ∈ [N], and thus so is the function

H : ˆ ^nN→ defined as H(x) = Pˆ ^N_i=1ˆhi(xi). (3.4) Moreover, for any (z, x) ∈ ^nN× ^nNit holds that

Φ(z) + DHˆ(z, x) = F(x) + h∇F(x), z − xi + G(z) + P^Ni=1 1

γ_iDhi(zi,xi), (3.5) and in particular the forward-backward operator (3.3) satisfies

T_Γ^F,G(x) = prox_Φ^H^ˆ(x). (3.6)

WhenAssumption Iis satisfied, then the following also hold:

(i) DHˆ(z, x) ≥ P^Ni=1(_γ¹_i−^L_N^fi) Dhi(zi,xi).

(ii) prox_Φ^H^ˆ(x) = n(z, · · · , z) | z ∈ T(P^Ni=1∇ˆhi(xi))o, with T as in (2.1), is a nonempty and compact subset of C × · · · × C for any x ∈ int dom h1× · · · × int dom hN.

(iii) If z ∈ prox^HΦ^ˆ(x), then ∇ ˆH(x) − ∇ ˆH(z) ∈ ˆ∂Φ(z); the converse also holds when Φ is convex.

(iv) If ∇hiis `hi, U_i-Lipschitz on a convex set Ui ⊆ int dom hi, then ∇ˆhiis `_ˆh_i, U_i-Lipschitz on Uiwith

`_ˆh_i_{, U}_i ≤ γ¹_i+^L_N^fi`_h_i, U_i. If, in addition, fi−^µ^fi,Ui₂ k · k²is convex on Uifor some µfi, U_i∈ , then

`_ˆh_i≤^`^hi,Ui_γ_i −^µ^fi,Ui_N .

(v) If hiis µhi, U_i-strongly convex on a convex set Ui ⊆ dom hi, then ˆhiis µ_ˆh_i_{, U}_i-strongly convex on U_iwith µ_ˆh_i_{, U}_i≥ γ¹_i −^L_N^fiµ_h_i, U_i.

Proof. The claims on ˆhiare shown in [1, Thm. 4.1], and (3.5) and (3.6) then easily follow.

♠3.1(i) This is an immediate consenquence ofFact 2.5(i).

♠3.1(ii) Let x be as in the statement, and observe that x ∈ int dom ˆH; nonemptyness and com- pactness of prox^H_Φ^ˆ then follows fromFact 2.6(i). Let now u ∈ prox^HΦ^ˆ(x) be fixed, and note that the consensus constraint encoded in Φ ensures that ui=ujfor all i, j ∈ [N]. Thus,

u_i=arg min

w∈ⁿ

nΦ(w, . . . , w) + ˆH(w, . . . , w) − h∇ ˆH(x), (w, . . . , w)io

=arg min

w∈ⁿ

n₁

N

PN

i=1fi(w) + g(w) + P^N_i=1ˆhi(w) − h∇ˆhi(xi), wio

=arg min

w∈ⁿ ng(w) + P^N_i=1_γ¹_ihi(w) − h P^Ni=1∇ˆhi(xi), wio^(2.1)

=T P_i=1^N ∇ˆhi(xi) ⊆ C, where the inclusion follows fromAssumption I.a4.

♠3.1(iii) Observe first that necessarily x ∈ int dom hi× · · · × int dom hN, for otherwise no such z exists. Moreover, from assertion3.1(ii)it follows that also z belongs to such open set, onto which ˆH is continuously differentiable. The claim then follows from the necessary condition for optimality of z in the minimization problem (2.4) — which is also sufficient when Φ is convex, for so is Φ+DHˆ(·, x) in this case — having

0 ∈ ˆ∂[Φ + DH^ˆ( · , x)](z) = ˆ∂[Φ + ˆH − h∇ ˆH(x), · i](z) = ˆ∂Φ(z) + ∇ ˆH(z) − ∇ ˆH(x).

The last equality follows from [53, Ex. 8.8(c)], owing to smoothness of ˆH at z.

♠3.1(iv)and3.1(v) Observe that

N−γiL_fi

Nγi hi^N−γ_Nγⁱ^L_i^fihi+

convex

N1(Lf_ihi− fi) = ˆhi=^N+γ_Nγⁱ^L_i^fihi−

convex

N1(Lf_ihi+fi) ^N+γ_Nγⁱ^L_i^fihi, (3.7) where for notational convenience we used the partial ordering “”, defined as α β iff β − α is convex. The claimed moduli `_ˆh_i_{, U}_i ≤ _γ¹_i +^L_N^fi`_h_i, U_i and µ_ˆh_i_{, U}_i ≥ _γ¹_i −^L_N^fiµ_h_i, U_i are thus readily inferred. In case fiis µfi, U_i-strongly convex on Ui, we may write

ˆhi=^h_γⁱ

i −^µ^fi,Ui_2N k · k²−_N¹( fi−^µ^fi,Ui₂ k · k²

convex

) ^hγⁱ_i−^µ_2N^fi,Uik · k²

(8)

to obtain the tighter bound `_ˆh_i, U_i≤^`^hi,Ui_γ_i −^µ^fi,Ui_N .

Algorithm 3Block-coordinate proximal point formulation ofAlgorithm 1 Require Legendre kernels hisuch that fiis Lfi-smooth relative to hi

initial point x^init∈ ∩^N_i=1int dom hi=C

Denote x⁰=(x^init, . . . ,x^init), ˆhjB _γ¹_ihj−_N¹ fj, ˆH(x) B P^Ni=1ˆhi(xi) Repeat for k = 0, 1, . . . until convergence

3.1: u^k∈ arg minw∈^nN

nΦ(w) + ˆH(w) − h∇ ˆH(x^k), wio

= arg min_w∈nN

nΦ(w) + DHˆ(w, x^k)o

3.2: Select a subset of indices J^k+1⊆ [N]

3.3: x^k+1

J^k+1=u^k

J^k+1 and x^k+1_[N]\J_k+1=x^k_[N]\J_k+1 Return ˜z^k

3.1. Block-coordinate proximal point reformulation ofAlgorithm 1. Algorithm 3presents a block coordinate (BC) proximal point algorithm with the distance generating function ˆH. Note that in a departure from most of the existing literature on BC proximal methods that consider separable nonsmooth terms (see e.g., [63,47,12,19,31]), here the nonsmooth function G in (3.1) is nonsep- arable. It is shown in the next lemma that this conceptual algorithm is equivalent to the Bregman Finito/MISOAlgorithm 1.

Lemma 3.2(equivalence ofAlgorithms 1and3). As long as the same initialization parameters are chosen in the two algorithms, to any sequence (s^k,˜s^k,z^k, I^k+1)_k∈generated byAlgorithm 1there corresponds a sequence (x^k, u^k, J^k+1)_k∈ generated byAlgorithm 3(and viceversa) satisfying the following identities for all k ∈ and i ∈ [N]:

(i) I^k+1=J^k+1 (ii) (z^k, . . . ,z^k) = u^k

(iii) s^k_i =_γ¹_i∇hi(x^k_i) −N¹∇fi(x^k_i) (or, equivalently, x^k_i =∇ˆh^∗_i(s^k_i)) (iv) ˜s^k=PN

i=1∇ˆhi(x^k_i)

(v) ϕ(z^k) = Φ(u^k) = Φ^H^ˆ(x^k) − DH^ˆ(u^k, x^k) (vi) Φ^H^ˆ(x^k) = L(z^k, s^k) where L is as in (1.1).

Proof. Let the index sets I^k+1and J^k+1be chosen identically, k ∈ . It follows fromLemma 3.1(ii) that u^k_i =u^k_jfor all k ∈ and i, j ∈ [N], with

u^k_i =arg min

w∈ⁿ ng(w) + P_i=1^N _γ¹_ihi(w) − h

Bv^k

PN

i=1∇ˆhi(x^k_i), wio. (3.8) We now proceed by induction to show assertions3.2(ii), 3.2(iii), and3.2(iv). Note that the latter amounts to showing that v^kas defined in (3.8) coincides with ˜s^k; by comparing (3.8) and the expres- sion of z^k instep 1.1, the claimed correspondence of u^kand z^kas in assertion3.2(ii)is then also obtained and, in turn, so is the identity in3.2(v).

For k = 0 assertions3.2(iii)and3.2(iv)hold because of the initialization of ˜s⁰ inAlgorithm 1 and of x⁰inAlgorithm 3; in turn, as motivated above, the base case for assertion3.2(ii)also holds.

Suppose now that the three assertions hold for some k ≥ 0; then, v^k+1=PN

i=1∇ˆhi(x^k+1_i ) = X

i∈I^k+1

∇ˆhi(u^k_i) + v^k − X

i∈I^k+1

∇ˆhi(x^k_i)

(induction)= X

i∈I^k+1

∇ˆhi(z^k) + ˜s^k − X

i∈I^k+1

s^k_i

= X

i∈I^k+1

s^k+1_i + ˜s^k − X

i∈I^k+1

s^k_i =˜s^k+1,

where the last two equalities are due tosteps 1.2and1.3. Therefore, v^k+1 = ˜s^k+1and thus u^k+1 = (z^k+1, . . . ,z^k+1). It remains to show that s^k+1_i = _γ¹

i∇hi(x^k+1_i ) − ¹_N∇fi(x^k+1_i ). For i ∈ I^k+1 this holds

(9)

because of the update rule atstep 1.3and the fact that x^k+1_i =u^k_i =z^kowing tostep 3.3. For i < I^k+1 this holds because (x^k+1_i ,s^k+1_i ) = (x^k_i,s^k_i). Finally,

Φ^H^ˆ(x^k)^(def)= Φ(u^k) + DHˆ(u^k, x^k)^3.2(v)= ϕ(z^k) +

N

X

i=1

D_ˆh_i(z^k,x^k_i)^3.2(iii)= ϕ(z^k) +

N

X

i=1

D_ˆh_i(z^k,∇ˆh^∗i(s^k_i)), and the last term is L(z^k, s^k) (cf.Facts 2.3(i)and2.3(iii)), yielding assertion3.2(vi).

4. Convergence analysis

The block coordinate interpretation ofAlgorithm 1presented inSection 3plays a crucial role in the proposed methodology, and leads to a remarkably simple convergence analysis. In fact, many key facts can be established without confining the discussion to a particular sampling strategy. These preliminary results are presented in the next subsection and will be extensively referred to in the subsequent subsections that are instead devoted to a specific sampling strategy.

4.1. General sampling results. Unlike classical analyses of BC proximal methods that employ the cost as a Lyapunov function (see e.g., [10, §11]), here, the nonseparability of G precludes this possibility. To address this challenge, we instead employ the Bregman Moreau envelope equipped with the distance generating function ˆH (see (3.4)). Before showing its Lyapunov-type behavior for Algorithm 3, we list some of its properties and its relation with the original problem. The proof is a simple consequence ofFacts 2.6(ii)and2.7and the fact that ˆH is a Legendre kernel with dom ˆH = dom h1× · · · × dom hN(cf.Lemma 3.1).

Lemma 4.1(connections between ϕ + δ_Cand Φ^H^ˆ). Suppose thatAssumption Iholds. Then, (i) Φ^H^ˆ is continuous on dom Φ^H^ˆ =int dom h₁× · · · × int dom hN, in fact, locally Lipschitz if so is

∇hion int dom hi, i ∈ [N];

(ii) min_Cϕ≤ infCϕ = inf Φ^H^ˆand arg min Φ^H^ˆ =(x^?, . . . ,x^?) | x^?∈ arg minCϕ ; (iii) Φ^H^ˆ is level bounded iff so is ϕ + δ_C.

Lemma 4.2(sure descent). Suppose thatAssumption Iholds, and consider the iterates generated by Algorithm 3. Then, u^k=(u^k, . . . ,u^k) for some u^k ∈ C and x^k ∈ C × · · · × C ⊆ int dom ˆH for every k ∈ , and the algorithm is thus well defined. Moreover, the following hold:

(i) Φ^H^ˆ(x^k+1) ≤ Φ^H^ˆ(x^k) − DH^ˆ(x^k+1, x^k) = Φ^H^ˆ(x^k) − Pi∈J^k+1D_ˆh_i(u^k,x^k_i) for every k ∈ ; when Φ is convex (i.e., when so is ϕ), then the inequality can be strengthened to Φ^H^ˆ(x^k+1) ≤ Φ^H^ˆ(x^k) −

DHˆ(x^k+1, x^k) − DHˆ(u^k, u^k+1).

(ii) (Φ^H^ˆ(x^k))_k∈monotonically decreases to a finite value ϕ?≥ infCϕ≥ minCϕ.

(iii) The sequence (DHˆ(x^k+1, x^k))_k∈has finite sum (and in particular vanishes); the same holds also for (DHˆ(u^k, u^k+1))_k∈when Φ is convex (i.e., when so is ϕ).

(iv) If ϕ + δ_Cis level bounded, then (x^k)_k∈and (u^k)_k∈are bounded.

(v) If dom hiis closed, a subsequence (x^k_i)_k∈Kconverges to a point x^?iff so does (x^k+1_i )_k∈K. (vi) If C = ⁿ, then Φ^H^ˆis constant (and equals ϕ?as above) on the set of accumulation points of

(x^k)_k∈.

Proof. It follows fromLemma 3.1(ii)that u^k∈ C holds for every k ∈ . Notice that for every i ∈ [N]

and k ∈ , either x^ki = x^init∈ C (by initialization), or there exists ki ≤ k such that x^ki =z^kⁱ ∈ C. It readily follows that x^k ∈ C × · · · × C ⊆ int dom H = int dom ˆH, hence that prox^H_Φ^ˆ(x^k) , ∅ for all k ∈ byLemma 3.1(ii), whence the well definedness of the algorithm. We now show the numbered claims.

♠4.2(i) It follows fromFacts 2.7(i)and2.7(ii)that Φ^H^ˆ(x^k+1) ≤ Φ(u^k) + DHˆ(u^k, x^k+1) − ck, where ck≥ 0 can be taken as ck=DHˆ(u^k, u^k+1) when Φ is convex. Therefore,

Φ^H^ˆ(x^k+1) ≤ Φ(u^k) + DHˆ(u^k, x^k+1) − ck= Φ^H^ˆ(x^k) − DH^ˆ(u^k, x^k) + DHˆ(u^k, x^k+1) − ck 2.3(ii)

= Φ^H^ˆ(x^k) − DH^ˆ(x^k+1, x^k) − hu^k− x^k+1,∇ ˆH(x^k+1) − ∇ ˆH(x^k)i − ck.

(10)

The claim follows by noting that the inner product is zero:

hu^k− x^k+1,∇ ˆH(x^k+1) − ∇ ˆH(x^k)i = X

j∈[N]

hu^k− x^k+1j

=u^kfor j∈J^k+1

,∇ˆhj(

=x^k_jfor j<J^k+1

x^k+1_j ) − ∇ˆhj(x^k_j)i = 0.

♠4.2(ii) Monotonic decrease of (Φ^H^ˆ(x^k))_k∈ follows from assertion4.2(i). This ensures that the sequence converges to some value ϕ?, bounded below by min_Cϕin light ofLemma 4.1(ii).

♠4.2(iii) It follows from assertion4.2(i)that

Pk∈DHˆ(x^k+1, x^k) ≤ Φ^H^ˆ(x⁰) − inf Φ^H^ˆ ≤ Φ^H^ˆ(x⁰) − infCϕ <∞

owing toLemma 4.1(ii)andAssumption I.a3. When ϕ is convex, the tighter bound in assertion4.2(i) yields the similar claim for (DHˆ(u^k, u^k+1))_k∈.

♠4.2(iv) It follows from assertion4.2(ii)that the entire sequence (x^k)_k∈is contained in the sublevel set {w | Φ^H^ˆ(w) ≤ Φ^H^ˆ(x⁰)}, which is bounded provided that ϕ + δCis level bounded as shown inLem- ma 4.1(iii). In turn, boundedness of (u^k)_k∈then follows from local boundedness of T^F,G_Γ =prox^H_Φ^ˆ, cf. (3.6) andFact 2.6(i).

♠4.2(v)Follows fromFact 2.3(v), since x^k_i ∈ int dom hi=int dom ˆhifor every k (with equality owing toLemma 3.1), and D_ˆh_i(x^k+1_i ,x^k_i) → 0 by assertion4.2(iii).

♠4.2(vi) Follows from assertion4.2(ii)and the continuity of Φ^H^ˆ, seeFact 2.6(ii). In conclusion of this subsection we provide an overview of the ingredients that are needed to show that the limit points of the sequence (z^k)_k∈generated byAlgorithm 1are stationary for problem (P).

As will be shown inLemma 4.4, these amount to the vanishing of the residual DHˆ(u^k, x^k) together with some assumptions on the distance-generating functions hi. For the iterates ofAlgorithm 1, this translates to D_ˆh∗

i(s^k_i,∇ˆhi(z^k)) → 0 for all indices i ∈ [N], indicating that all vectors s^k+1i in the table should be good estimates of ∇ˆhi(z^k+1) = _γ¹_i∇hi(z^k+1) −_N¹∇fi(z^k+1), as opposed to_γ¹_i∇hi(z^k) −_N¹∇fi(z^k) and for the indices in I^k+1only (cf.step 1.3). As a result, we may view this property as jointly having z^k− z^k+1vanish, desirable if any convergence of (z^k)_k∈is expected, and the fact that a consensus is eventually reached among the sampled blocks.

In line with any result in the literature we are aware of, a complete convergence analysis for nonconvex problems will ultimately require C = ⁿ. For convex problems, that is, when the cost function ϕ is convex without any among fi and g being necessarily so, the following requirement will instead suffice to our purposes in the randomized sampling setting of (S₁).

Assumption II(requirements on the distance-generating functions). For i ∈ [N], dom hiis closed, and whenever int dom hi3 z^k→ z ∈ bdry dom hiit holds that Dh_i(z, z^k) → 0.

Remark 4.3.

(i)Assumption IIis vacuously satisfied when dom hi= ⁿ, having bdry ⁿ=∅.

(ii) For any i ∈ [N], function hicomplies withAssumption IIiff so does ˆhi, owing to the inequalities

N−γiL_fi

Nγi Dhi ≤ Dˆhi ≤ ^N+γ_Nγⁱ^L_i^fiDhi(cf. (3.7)).

Lemma 4.4(subsequential convergence recipe). Suppose thatAssumption I holds, and consider the iterates generated byAlgorithm 1. Let x^k_i =∇ˆh^∗i(s^k_i) and z^k =u^kbe the corresponding iterates generated byAlgorithm 3as inLemma 3.2, and suppose that

a1 DHˆ(u^k, x^k) → 0 (or equivalently, Dˆh^∗_i(s^k_i,∇ˆhi(z^k)) → 0, i ∈ [N]).

Then, letting ϕ?be as inLemma 4.2(ii), the following hold:

(i) ϕ(z^k) = Φ(u^k) → ϕ^?as k → ∞.

(ii) If dom hi is closed, i ∈ [N], then having (a) (z^k)_k∈K → z, (b) (x^ki)_k∈K → z ∃i ∈ [N], and (c) (z^k+1,x^k+1_i )_k∈K → (z, z) ∀i ∈ [N], are all equivalent conditions. In particular, if (z^k)_k∈ is bounded (e.g., when ϕ + δ_Cis level bounded), then kz^k+1− z^kk → 0 holds, and the set of its limit points, be it ω, is thus nonempty, compact, and connected.

(iii) UnderAssumption II, ϕ ≡ ϕ?on ω (the set of limit points of (z^k)_k∈).