In the case when the coupling between agents is represented only through the common function, we employ the primal-dual algorithm proposed by V ˜u and Condat

(1)

Multi-agent structured optimization over message-passing architectures with bounded communication delays

Puya Latafat, Panagiotis Patrinos

Abstract— We consider the problem of solving structured convex optimization problems over a network of agents with communication delays. It is assumed that each agent performs its local updates using possibly outdated information from its neighbors under the assumption that the delay with respect to each neighbor is bounded but otherwise arbitrary. The private objective of each agent is represented by the sum of two possibly nonsmooth functions one of which is composed with a linear mapping. The global optimization problem consists of the aggregate of the local cost functions and a common Lipschitz- differentiable function. In the case when the coupling between agents is represented only through the common function, we employ the primal-dual algorithm proposed by V ˜u and Condat.

In the case when the linear maps introduce additional coupling between agents a new algorithm is developed. In both cases convergence is obtained under a strong convexity assumption.

To the best of our knowledge, this is the first time that this form of delay is analyzed for a primal-dual algorithm in a message-passing local-memory model.

I. INTRODUCTION

In this paper we consider a class of structured optimization problems that can be represented as follows:

minimize

x∈IRⁿ f (x) +

m

X

i=1

gi(xi) + hi(Nix), (1) where x = (x1, . . . , xm), Ni is a linear mapping, hi, gi are proper closed convex (possibly) nonsmooth functions, and f is convex, continuously differentiable with Lipschitz continuous gradient. The goal is to solve (1) over a network of agents through local communications. Each agent is assumed to maintain its own private cost functions giand hi◦Ni, while f and (possibly) the linear mappings Ni represent the coupling between the agents. An important challenge in such a network is the assumption that the agents have access to the latest information required for their computations.

Most iterative algorithms for convex optimization can be written as

x^k+1= x^k− T x^k, (2) where the mapping Id − T (Id is the identity operator) has some contractive property resulting in the convergence of the sequence to a zero of T . In distributed optimization the

Puya Latafat^1,2; Email: puya.latafat@{kuleuven.be,imtlucca.it}

Panagiotis Patrinos¹; Email: panos.patrinos@esat.kuleuven.be

This work was supported by: FWO PhD fellowship 1196818N; FWO projects: G086318N; G086518N; Fonds de la Recherche Scientifique – FNRS and the Fonds Wetenschappelijk Onderzoek – Vlaanderen under EOS Project no 30468160 (SeLMA)

1KU Leuven, Department of Electrical Engineering (ESAT-STADIUS), Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium.

2IMT School for Advanced Studies Lucca, Piazza San Francesco 19, 55100 Lucca, Italy.

goal is to devise algorithms where a group of agents/processors distributively update certain coordinates of x while guaranteeing convergence to a zero of T .

There are two main computational models in distributed optimization (depicted inFig. 1) with a range of hybrid models in between [1, Chap. 1]. These models are conceptually different and require different analysis. The model considered in this work is the local/private-memory model. Let us first describe the two models.

Shared-memory model: This model is characterized by the access of all agents/processors to a shared memory. A large body of literature exists for parallel coordinate descent algorithms for this problem. Typically, coordinate descent algorithms would require a memory lock to ensure consistent reading. Interesting recent works allow inconsistent reads [2], [3]. In this model, for the fixed point iteration (2), each processor reads the global memory and proceeds to choose a random coordinate i ∈ {1, . . . , m} and to perform

x^k+1_i = x^k_i − T_ixˆ^k,

where ˆx^k denotes the data loaded from the global memory to the local storage at the clock tick k, and Ti repre- sents the operator that updates the ith coordinate. This form of updates are asynchronous in the sense that the processors update the global memory simultaneously resulting in possibly inconsistent local copy ˆx^k due to other processors modifying the global memory during a read. The analysis of such algorithms would in general rely on either using the properties of the operator that updates the ith coordinate when possible (coordinate-wise Lipschitz continuity in the case of the gradient [2]), or the properties of the global operator (see [3] for nonexpansive operators). A crucial point in the convergence analysis of such methods is the fact that for a given processor, the index of the coordinate to be updated is selected at random, but no matter which coordinate is selected the same local data ˆx^k is used for the update.

Let ˆTix := (0, . . . , 0, Tix, 0 . . . , 0). Then in a randomized scheme the operators ˆTi can be summed over i:

m

X

i=1

Tˆixˆ^k = T ˆx^k,

allowing one to use the properties known for the global operator (see the proof of [3, Lem. 2]). As we discuss below, the difficulty in the local-memory model is precisely due to the fact that this summation no longer holds.

Local/private-memory model: In this model each agen- t/processor has its own private local memory. The agents can send and receive information to other agents as needed, and agent i can only update xi. This model is also referred to as

(2)

Local/private-memory model Global memory

Shared-memory model

Local storages

Fig. 1. The main two memory models; (left) agents cooperating to perform a task, (right) processors updating a global memory

message-passing model[1].

In the absence of delay between agents, randomized block- coordinate updates may be used to develop distributed asynchronous algorithms. Such schemes would typically involve random independent activation of agents to perform their local updates, and are in this sense also referred to as asynchronous [4]–[7]. In this work we are only concerned with the use of outdated information by the agents and do not pursue this form of asynchrony.

In accordance with the notation of the seminal work [1, Chap. 7] we define the following local (outdated) version of the generic vector x^k= (x^k₁, . . . , x^k_m) used by agent i:

x^k[i] := x^τ

i 1(k) 1 , . . . , x^τ

i m(k) m

, (3)

where τ_jⁱ(k) is the latest time at which the value of xj is transmitted to agent i by agent j. In our setting the delay is assumed to be bounded:

Assumption 1. There exists an integer B such that for all k ≥ 0 the following holds

(∀i, j) 0 ≤ k − τ_jⁱ(k) ≤ B, and τ_iⁱ(k) = k.

The fact that each agent knows its own local variable without delay is projected in the assumption τ_iⁱ(k) = k. This is a natural assumption and is satisfied in practice. Notice that for ease of notation we defined the complete outdated vector while in practice each agent would only keep a local copy of the coordinates that are required for its computation, see Fig. 1. The direction of the arrows in Fig. 1 signify the nature of the coupling between two agents. For example, the arrow from A4to A3 indicates that agent A3requires x4for its computation. Such a relation between agents is depen- dent on the formulation and the nature of coupling between agents. For instance, in the minimization (1), the coupling is represented through f and possibly Ni. As we shall see in

§II the coupling through f may be one sided since agent i may require x_j for computing ∇_if (the partial derivative of f with respect to xi) without agent j requiring x_i.

In summary, each agent controls only one block of coor-

dinates and updates according to x^k+1_i = x^k_i − Tix^k[i],

the result of which will be sent (possibly with delay) to the agents that require it in their computations. The difficulty in this model comes from the impossibility of summing Tix^k[i]

over all i given that x^k[i] is different for each i.

In addition to the above described delay, the partially asyn- chronousmodel considered in [1, Chap. 7] involves a second assumption: each agent must perform an update at least once during any time interval of a given length. Instead, we are not concerned with asynchrony but rather with the use of outdated information by the agents. We emphasize that de- veloping partially asynchronous schemes for primal-dual algorithms or randomized schemes that comply with the delay model described in (3) remains a challenge.

In [1, Chap. 7.5] a partially asynchronous variant of the gradient method is studied. This analysis is further extended to the projected-gradient method in the convex case. In [8]

a periodic linear convergence rate is established for the projected-gradient method. The recent work [9] extends this analysis to the proximal-gradient method. Notice that the aforementioned primal methods are not well equipped for problems with more complex structures as in (1).

In this work we study two primal-dual algorithms for solving (1) in the presence of bounded communication delays.

Primal-dual proximal algorithms are a class of first-order methods that are easy to implement, are parallelizable, and yield the primal and dual solutions simultaneously. They are able to exploit the structure in (1) efficiently, resulting in fully splitalgorithms applicable to a wide range of applications. It is worth noting that while this paper focuses on two particular primal-dual algorithms, a similar analysis should be applicable to other primal-dual methods such as the ones developed in [6], [10]–[13].

A. Main Contributions

• To the best of our knowledge this is the first work that con- siders the general delay described in (3) for a primal-dual

(3)

algorithm. Unlike primal methods (gradient or proximal- gradient), this scheme can be applied to solve problems with complex structures as in (1) without the need to in- vert matrices or to solve inner loops.

• The analysis of [1], [8], [9] rely on the use of the cost as the Lyapunov function. In contrast, we show that under the bounded delay assumption and some strong convexity assumption, the generated sequence is quasi-Fejér monotone provided that the stepsizes are sufficiently small.

Moreover, linear convergence is established with an ex- plicit convergence factor.

• Two primal-dual algorithms are presented: (i) when the coupling between agents is enforced only through f , the algorithm of [14], [15] is considered, (ii) when the coupling is enforced through f and the linear mapping a modified algorithm is developed which appears to be new. In the second case due to the presence of additional coupling smaller stepsizes must be used to ensure convergence.

B. Motivating Example

Consider the problem of formation control [16], where each agent (vehicle) has its own private dynamics and cost function and the goal is to achieve a specific formation while communicating only with a selected number of agents. Let w_i= (ξ_i, v_i) where ξiand videnote the local state and input sequences. The location of agent i is given by yi = Cξ_i and the set of its neighbors is denoted by A_i. The linear dynamics of each agent over a control horizon is represented by the constraints Eiwi= bi. In order to enforce a formation between agents i and j the quadratic cost function kC(ξi− ξj) − dijk² is used where dij is the target relative distance between them (refer to [16] for details). Hence, the formation control problem is formulated as the following constrained minimization:

minimize ¹₂

m

X

i=1

X

j∈A_i

kC(ξi− ξj) − dijk²+¹₂

m

X

i=1

w_i^>Qiwi

subject to E_iw_i= b_i, i = 1,...,m

This problem can be easily cast in the form of (1) by setting f equal to the first term, giequal to the quadratic local cost, hi the indicator of the point bi and the linear mapping Ni= Ei. Therefore, the objective is to enforce a formation between agents by solving this optimization problem in presence of communication delays by allowing the agents to use outdated information. Notice that in this case the coupling between agents is enforced only through f . This special case of (1) is studied in §III.

C. Notation and Preliminaries

Throughout, IRⁿ is the n-dimensional Euclidean space with inner product h·,·i and induced norm k · k. For a positive definite matrix P we define the scalar product hx,yiP = hx,P yi and the induced norm kxk_P=phx,xi_P.

For a set C, we denote its relative interior by riC. Let q : IRⁿ→ IR := IR ∪ {+∞} be a proper closed convex function.

Its domain is denoted by domq. Its subdifferential is the

set-valued operator ∂q : IRⁿ⇒ IRⁿ

∂q(x) = {y ∈ IRⁿ| ∀z ∈ IRⁿ, hz − x,yi + f (x) ≤ f (z)}.

For a positive scalar ρ the proximal map associated with q is the single-valued mapping defined by

prox_ρq(x) := argmin

z∈IRⁿ

{q(z) +_2ρ¹kx − zk²}.

The Fenchel conjugate of q, denoted by q^∗, is defined as q^∗(v) := sup_x∈IRn{hv,xi − q(x)}. The function q is said to be µ-convex with µ ≥ 0 if q(x) −^µ₂kxk² is convex.

A sequence (w^k)_k∈IN is said to be quasi-Fejér monotone relative to the set U if for all v ∈ U and all k ∈ IN

kw^k+1− vk²≤ kw^k− vk²+ ε^k,

where (ε^k)_k∈IN is a summable nonnegative sequence [17].

The positive part of x ∈ IR is denoted by [x]+:= max{x,0}.

II. PROBLEMSETUP

Throughout this paper the primal and dual vectors, denoted x and u, are assumed to be composed of m blocks as follows

x = (x1,...,xm) ∈ IRⁿ, u = (u1,...,um) ∈ IR^r, where xi∈ IRⁿⁱ and ui∈ IR^rⁱ. Consider a linear mapping L : IRⁿ→ IR^r that is partitioned as follows:

L =







L11 ··· L1m

... . .. ... Lm1 ··· Lmm





, (4)

where Lij: IRⁿⁱ→ IR^r^j. Furthermore, the ith (block) row of L is denoted by Li•: IRⁿ→ IR^rⁱ and the ith (block) column by L•i: IRⁿⁱ→ IR^r, i.e.,

L =





 L1•

... L_m•





= L•1 ··· L•m.

The following holds hLx,ui =

m

X

i=1

hLi•x,uii =

m

X

i=1

hxi,L^>_•_iui. (5) Consider the structured optimization problem (1) where the linear mapping Ni has been replaced by Li• defined above in order to clarify the structure of the mapping:

minimize

x∈IRⁿ f (x) +

m

X

i=1

gi(xi) + hi(Li•x). (6) The cost functions g_i and h_i◦ L_i• are private functions be- longing to agent i. The coupling between agents is through the smooth term f and the linear term Li•x. An agent i is assumed to have access to the information required for its computation, be it outdated, cf.Algorithms 1and2.

Let the following assumptions hold Assumption 2.

(i) For i = 1,...,m, gi: IRⁿⁱ→ IR, hi: IR^rⁱ→ IR are proper closed convex functions, and Li•: IRⁿ→ IR^rⁱ is a linear mapping.

(ii) f : IRⁿ→ IR is convex, continuously differentiable, and

∇f is β-Lipschitz continuous for some nonnegative β:

k∇f (x) − ∇f (x⁰)k ≤ βkx − x⁰k, ∀x,x⁰∈ IRⁿ.

(4)

(iii) For every i = 1,...,m there exists a nonnegative con- stant ¯βisuch that for allx,x⁰∈ IRⁿ satisfyingxi= x⁰_i: k∇if (x) − ∇if (x⁰)k ≤ ¯βikx − x⁰k. (7) (iv) The set of solutions to (6) is nonempty.

(v) (Constraint qualification) There exists xi∈ ridomgi, for i = 1,...,m such that L_j•x ∈ ridomh_j, for j = 1,...,m.

Assumption 2(iii) quantifies the strength of the coupling (through f ) between agents [1, Sec. 7.5]. In particular, if f is separable, i.e., f (x) =Pm

i=1fi(xi), then there is no coupling and ¯βi= 0.

Problem (6) can be compactly represented as minimize

x∈IRⁿ f (x) + g(x) + h(Lx), where g(x) =Pm

i=1gi(xi), h(u) =Pm

i=1hi(ui), and L is as in (4). The dual problem is given by

minimize

u∈IR^r (g + f )^∗(−L^>u) + h^∗(u).

Under the constraint qualification of Assumption 2(v), the set of solutions to the dual problem is nonempty and the duality gap is zero [18, Cor. 31.2.1]. Furthermore, x^? is a primal solution and u^? is a dual solution if and only if the pair (x^?,u^?) satisfies

0 ∈ ∂g(x^?) + ∇f (x^?) + L^>u^?,

0 ∈ ∂h^∗(u^?) − Lx^?. (8)

Such a point is called a primal-dual solution and the set of all primal-dual solutions is denoted by S.

Let us define a few parameters used throughout the paper.

For each agent i ∈ {1,...,m} define the positive stepsizes γ_i, σithat are associated with the primal and the dual variables, respectively. Moreover set

β := ¯¯ β₁,..., ¯β_m,

Γ := blkdiag(γ1In1,...,γmInm), Σ := blkdiag(σ1Ir₁,...,σmIr_m).

Applying the algorithm of V˜u and Condat [14], [15] to (6), with stepsize matrices Σ and Γ as defined above, results in the following updates for agent i at iteration k:

x^k+1_i = prox_γ_i_g_i x^k_i− γiL^>_•_iu^k− γi∇if (x^k)

(9a) u^k+1_i = prox_σ_i_h∗

i u^k_i+ σiL•i(2x^k+1− x^k). (9b) Notice that each agent requires the latest variables x^k, x^k+1 and u^k in the above updates, which may not be available due to communication delays. In the next section we consider the case when L is block-diagonal. The case of general L is studied inSection IVwhere a modified primal-dual algorithm is proposed in place of (9) to allow for a larger stepsize in this case.

III. THECASE OFBLOCK-DIAGONALLINEARMAPPING

Throughout this section we assume that the linear mapping L has a block-diagonal structure. Therefore, the coupling between agents is enacted only through the smooth function f . The example of formation control inSection I-Bis of this structure.

Under this assumption problem (6) becomes minimize

x∈IRⁿ f (x) +

m

X

i=1

g_i(x_i) + h_i(L_iix_i), where Lii is the ith diagonal block of L, see (4). Given this diagonal structure, in the updates (9), agent i must receive those xj’s that are required for the computation of ∇if and all other operations are local. Let us define the set of agents that are required to send their variable to i as follows:

N_iⁱⁿ:= {j | ∇if depends on xj},

and the set of j’s that agent i must send xi to as N_i^out:=

{j | i ∈ N_jⁱⁿ}.

Algorithm 1 summarizes the proposed scheme for this problem. At every iteration each agent i performs the updates described in (9) using the last information it has received from agents j ∈ N_iⁱⁿ. It then transmits the updated x^k+1_i to the agents that require it (possibly with delay). Note that x^k[i] was defined as the outdated version of the full vector x^k for simplicity of notation, and in practical implemen- tation it would only involve the coordinates that are required for the computation of ∇if .

Algorithm 1 V˜u-Condat algorithm with bounded delays Initialize: x⁰_i ∈ IRⁿⁱ, u⁰_i ∈ IR^rⁱ for each i ∈ {1,...,m}.

for k = 0,1,... do

for each agent i = 1,...,m do

– perform the local updates using the last received information, i.e., using the locally stored vector x^k[i]

as defined in (3):

x^k+1_i = prox_γ_i_g_i x^k_i− γiL^>_iiu^k_i− γi∇if (x^k[i]) u^k+1_i = prox_σ_i_h∗

i u^k_i+ σiLii(2x^k+1_i − x^k_i) – send x^k+1_i to all j ∈ N_i^out(possibly with different delays)

As shown in Theorem 1, for small enough stepsizes the generated sequence converges to a primal-dual solution under the bounded delay assumption, and provided that functions gi are strongly convex. Such needed requirements are summarized below:

Assumption 3. For i = 1,...,m

(i) (Strong convexity)gi isµⁱ_g-convex for someµⁱ_g> 0.

(ii) (Convergence condition) The stepsizes σi,γi> 0 sat- isfy the following assumption:

γ_i< 1

σikLiik²+ β +^B₂²k ¯βk²

M_g⁻¹

, (10)

where

Mg= blkdiag µ¹_gIn₁,...,µ^m_gIn_m. (11) Notice that according toAssumption 3(ii)we require a one time global communication of k ¯βk_M−1

g and β when initiating the algorithm.

Before proceeding with the convergence results, let us define the following

P :=Γ⁻¹ −L^>

−L Σ⁻¹

. (12)

(5)

Noting that Σ,Γ are positive definite, and using Schur com- plement we have that P is positive definite if and only if Γ⁻¹− L^>ΣL is positive definite, a condition that holds if (10) is satisfied (since L has a block-diagonal structure).

Our analysis inTheorem 1relies on showing that the generated sequence is quasi-Fejér monotone relative to the set of primal-dual solutions in the space equipped with the inner product h·,·iP. Notice that without communication delays (B ≡ 0), this analysis leads to the usual Fejér monotonicity of the sequence. The use of outdated information introduces additional error terms that are shown to be tolerated by the algorithm if the stepsizes are small enough and the functions gi are strongly convex.

The proof ofTheorem 1can be found in [19].

Theorem 1. Consider Algorithm 1 and let Assumptions 1 to3hold. Then the sequence (z^k)_k∈IN is quasi-Fejér monotone relative toS in the space equipped with the inner prod- ucth·,·iP. Furthermore,(z^k)_k∈INconverges to somez^?∈ S.

IV. THECASE OFGENERALLINEARMAPPING

In this section we consider the general optimization problem (6) where additional coupling is present through the linear maps, i.e., L is not block-diagonal. We consider a modified primal-dual algorithm that resembles (9) with the differ- ence that in the dual update the linear map L_i• operates on x^k[i] in place of 2x^k+1[i] − x^k[i]. This modification results in the possibility of using larger stepsizes since the terms 2x^k+1[i] − x^k[i] would introduce additional sources of error.

Let us define the following two sets:

M^p_i := {j | Lji6= 0}, M^d_i := {j | Lij6= 0}, where 0 denotes a zero matrix of appropriate dimensions. In Algorithm 2, due to the additional coupling through the linear maps, the primal vector of agent i must be transmitted to all j ∈ M^p_i∪ N_i^outwhile the dual vector is to be transmitted to all j ∈ M^d_i. Notice that the outdated primal and dual vectors x^k[i] and u^k[i], need not have the same delay pattern and are arbitrary as long asAssumption 1is satisfied, i.e., agent i may use the primal vector x^k_j¹ and the dual vector u^k_j² transmitted by j at times k1 and k2.

Algorithm 2 A primal-dual algorithm with bounded delays Initialize: x⁰_i ∈ IRⁿⁱ, u⁰_i ∈ IR^rⁱ for each i ∈ {1,...,m}.

for k = 0,1,... do

for each agent i = 1,...,m do

– perform the local updates using the last received information, i.e., using the locally stored vectors x^k[i]

and u^k[i] as defined in (3):

x^k+1_i = prox_γ_i_g_i x^k_i− γiL^>_•_iu^k[i] − γi∇if (x^k[i]) u^k+1_i = prox_σ

ih^?_i u^k_i+ σ_iL_i•x^k[i]

– send x^k+1_i to all j ∈ N_i^out∪ M^p_i, and u^k+1_i to all j ∈ M^d_i (possibly with different delays)

InTheorem 2 convergence is established forAlgorithm 2 when the stepsizes are small enough, under the assumption

that the functions gi are strongly convex and hiare continuously differentiable with Lipschitz continuous gradient. We summarize these requirements below:

Assumption 4. For all i = 1,...,m:

(i) (Strong convexity)g_i isµⁱ_g-convex for someµⁱ_g> 0.

(ii) (Lipschitz continuity)hi is continuously differentiable, and ∇hi is _µ¹i

h

-Lipschitz continuous for someµⁱ_h> 0.

Equivalently, h^∗_i isµⁱ_h-convex.

(iii) (Convergence condition) The stepsizes σi,γi> 0 sat- isfy the following inequalities

σi< 1

Cs(B + 1)², γi< 1

β +¹₂Rs(B + 1)²+ B²k ¯βk²

Mg⁻¹

, where

R_s:=

m

X

i=1 1

µⁱ_hkLi•k², C_s:=

m

X

i=1 1

µⁱ_gkL^>_•_ik². (13) Notice that by Assumption 4(iii) we require a one time global communication of Rs, Cs, β and k ¯βk_M−1

g .

Let us define the following positive definite matrix that is used in the convergence analysis

D :=blkdiag(Γ⁻¹,Σ⁻¹). (14) We proceed with the convergence results for Algorithm 2.

The proofs ofTheorems 2 and3 can be found in [19].

Theorem 2. Consider Algorithm 2 and let Assumptions 1, 2 and 4 hold. Then the sequence (z^k)_k∈IN is quasi-Fejér monotone relative to S in the space equipped with h·,·iD. Furthermore,(z^k)_k∈IN converges to somez^?∈ S.

Next theorem provides a sufficient condition for the stepsizes under which linear convergence is attained.

Theorem 3 (Linear convergence). ConsiderAlgorithm 2and letAssumption 1,2,4(i)and 4(ii)hold. Let c be a positive scalar and setγi=_µ^ci

g,σi=_µ^ci h

fori = 1,...,m. Let µ^min_g = minµ¹_g,...,µ^m_g , µ^min_h = minµ¹_h,...,µ^m_h . Suppose that the following holds:

c ≤ (1 + c₂)

1 B+1− 1, where

c2= min

( µ^min_g

2Bk ¯βk²

M_g⁻¹+ Rs(B + 1) + β, µ^min_h 2C_s(B + 1)

) . Then the following linear convergence rate holds

kz^k− z^?k²≤ _1+c¹ k

kz⁰− z^?k². V. CONCLUSION& FUTUREWORKS

In this paper we considered the application of primal-dual algorithms for solving structured optimization problems in a message-passing network model. It is shown that the communication delay is tolerated by the considered algorithms provided that the stepsizes are small enough, and that some strong convexity assumption holds. Future work consists of extending the convergence analysis to the partially asynchronous framework. Another research direction is to devise randomized schemes where in addition to the use of outdated

(6)

information, the agents would wake up at random indepen- dently from one another.

REFERENCES

[1] D. P. Bertsekas and J. N. Tsitsiklis, Parallel and distributed computation: numerical methods. Prentice-Hall, 1989, vol. 23.

[2] J. Liu and S. J. Wright, “Asynchronous stochastic coordinate descent:

Parallelism and convergence properties,” SIAM Journal on Optimiza- tion, vol. 25, no. 1, pp. 351–376, 2015.

[3] Z. Peng, Y. Xu, M. Yan, and W. Yin, “ARock: An algorithmic framework for asynchronous parallel coordinate updates,” SIAM Journal on Scientific Computing, vol. 38, no. 5, pp. A2851–A2879, 2016.

[4] F. Iutzeler, P. Bianchi, P. Ciblat, and W. Hachem, “Asynchronous distributed optimization using a randomized alternating direction method of multipliers,” in 52nd IEEE Conference on Decision and Control, 2013, pp. 3671–3676.

[5] P. Bianchi, W. Hachem, and F. Iutzeler, “A coordinate descent primal- dual algorithm and application to distributed asynchronous optimization,” IEEE Transactions on Automatic Control, vol. 61, no. 10, pp.

2947–2957, Oct 2016.

[6] P. Latafat, N. M. Freris, and P. Patrinos, “A new randomized block- coordinate primal-dual proximal algorithm for distributed optimization,” arXiv preprint arXiv:1706.02882, 2017.

[7] J.-C. Pesquet and A. Repetti, “A class of randomized primal-dual algorithms for distributed optimization,” Journal of Nonlinear and Convex Analysis, vol. 16, no. 12, pp. 2453–2490, 2015.

[8] P. Tseng, “On the rate of convergence of a partially asynchronous gradient projection algorithm,” SIAM Journal on Optimization, vol. 1, no. 4, pp. 603–619, 1991.

[9] Y. Zhou, Y. Liang, Y. Yu, W. Dai, and E. P. Xing, “Distributed proximal gradient algorithm for partially asynchronous computer clusters,”

Journal of Machine Learning Research, vol. 19, no. 19, pp. 1–32, 2018.

[10] P. L. Combettes and J.-C. Pesquet, “Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators,” Set-Valued and Variational Analysis, vol. 20, no. 2, pp. 307–330, 2012.

[11] L. M. Briceño-Arias and P. L. Combettes, “A monotone + skew splitting model for composite monotone inclusions in duality,” SIAM Jour- nal on Optimization, vol. 21, no. 4, pp. 1230–1250, 2011.

[12] Y. Drori, S. Sabach, and M. Teboulle, “A simple algorithm for a class of nonsmooth convex-concave saddle-point problems,” Operations Re- search Letters, vol. 43, no. 2, pp. 209–214, 2015.

[13] P. Latafat and P. Patrinos, “Asymmetric forward–backward–adjoint splitting for solving monotone inclusions involving three operators,”

Computational Optimization and Applications, pp. 1–37, 2017.

[14] L. Condat, “A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms,” Journal of Optimization Theory and Applications, vol. 158, no. 2, pp. 460–479, 2013.

[15] B. C. V˜u, “A splitting algorithm for dual monotone inclusions involving cocoercive operators,” Advances in Computational Mathematics, vol. 38, no. 3, pp. 667–681, 2013.

[16] R. L. Raffard, C. J. Tomlin, and S. P. Boyd, “Distributed optimization for cooperative agents: application to formation flight,” in 43rd IEEE Conference on Decision and Control, vol. 3, 2004, pp. 2453–2459.

[17] P. L. Combettes, “Quasi-Fejérian analysis of some optimization algorithms,” Studies in Computational Mathematics, vol. 8, pp. 115–152, 2001.

[18] R. Rockafellar, Convex analysis. Princeton University Press, 1997.

[19] P. Latafat and P. Patrinos, “Multi-agent structured optimization over message-passing architectures with bounded communication delays,”

arXiv preprint arXiv:1809.07199, 2018.