order generalization of vectors and matrices and can be found in a variety of appli-

(1)

CANONICAL POLYADIC DECOMPOSITION OF INCOMPLETE

1

TENSORS WITH LINEARLY CONSTRAINED FACTORS

^∗

2

NICO VERVLIET^†, OTTO DEBALS^†, AND LIEVEN DE LATHAUWER^† 3

Abstract. Incomplete datasets are prevalent in domains like machine learning, signal processing 4

and scientific computing. Simple models often allow missing data to be predicted or can be used as 5

representations for full datasets to reduce the computational cost. When data are explicitly or im- 6

plicitly represented as tensors, which are multiway arrays of numerical values, a low-rank assumption 7

is often appropriate. A nonlinear least squares algorithm to decompose an incomplete tensor into a 8

sum of rank-1 terms is presented here. Furthermore, the algorithm allows the incorporation of prior 9

knowledge in the form of linearly constrained factor matrices. By combining the exact Gramian 10

and a novel statistical preconditioner, the algorithm allows faster and more accurate decomposi- 11

tions compared to state-of-the-art methods when few entries are known or when missing entries are 12

structured. The accuracy is improved further while fewer known entries are required, by carefully 13

exploiting the linear constraints. A specialized variant of the algorithm removes the data dependence 14

from the optimization step by utilizing a compressed representation. When relatively many entries 15

are known, computation time can be reduced significantly using this variant, as illustrated in a novel 16

materials science application.

17

Key words. canonical polyadic decomposition, candecomp, parafac, CPD, CANDELINC, miss- 18

ing entries, incomplete tensors, big data, compressed sensing, multivariate regression 19

AMS subject classifications. 15A69 20

1. Introduction. Tensors, or multiway arrays of numerical values, are a higher-

21

order generalization of vectors and matrices and can be found in a variety of appli-

22

cations in, e.g., signal separation, object recognition, pattern detection, numerical

23

simulations, function approximation and uncertainty quantification. Many more ex-

24

amples can be found in numerous overview papers [11, 15, 16, 22, 32]. As the number

25

of entries in a tensor scales exponentially in its order, tensor problems become large-

26

scale quickly. The computational challenges related to this exponential increase, are

27

known as the curse of dimensionality. To extract information or to efficiently repre-

28

sent tensors, low rank assumptions are often appropriate. The (canonical) polyadic

29

decomposition (CPD), which is also known as parafac, candecomp, tensor rank de-

30

composition or separated representation, writes an N th-order tensor as a (minimal)

31

number of R rank-1 terms, each of which is the outer product of N non-zero vectors,

32

with R the tensor rank. Mathematically, we have

33

T =

R

X

r=1

a

⁽¹⁾_r ^⊗

· · ·

^⊗

a

^{(N )}_r

= r

A

⁽¹⁾

, . . . , A

^{(N )}

z (1)

34 35

∗Submitted to the editors on April 19, 2017.

Funding: Nico Vervliet is supported by an Aspirant Grant from the Research Foundation

— Flanders (FWO). Otto Debals is supported by a Ph.D. grant of the Agency for Innovation by Science and Technology (IWT). Research furthermore supported by: (1) Flemish Government: FWO:

projects: G.0830.14N, G.0881.14N; (2) EU: The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Advanced Grant: BIOTENSORS (n^◦339804). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information;

(3) KU Leuven Internal Funds C16/15/059.

†KU Leuven, Dept. of Electrical Engineering ESAT/STADIUS, Kasteelpark Arenberg 10, bus 2446, B-3001 Leuven, Belgium; and Group Science, Engineering and Technology, KU Leuven - Ku- lak, E. Sabbelaan 53, 8500 Kortrijk, Belgium (Nico.Vervliet@kuleuven.be,Otto.Debals@kuleuven.be, Lieven.DeLathauwer@kuleuven.be).

1

(2)

where

^⊗

denotes the outer product and a

⁽ⁿ⁾r

is the rth column of factor matrices

36

A

⁽ⁿ⁾

, for r = 1, . . . , R and n = 1, . . . , N . Many applications benefit from the mild

37

uniqueness properties of a CPD, allowing components to be separated using few re-

38

strictions [11, 22, 32]. In contrast to the number of entries, the number of parameters

39

in a CPD scales linearly in the order, allowing a compact representation of higher-

40

order data. To handle large-scale problems, various approaches have been proposed

41

including techniques based on compression [34], randomization [29, 42] and sparsity

42

combined with parallel architectures [18, 35].

43

Measuring, computing and/or storing all entries is often expensive in terms of

44

time, memory or cost. For example, in high-dimensional integration, the number

45

of points easily exceeds the number of atoms in the universe [28]; in uncertainty

46

quantification, each entry requires the solution of a PDE [23]; or in multivariate

47

regression in many variables it may be impossible to store all possible values of a

48

discretized function [7]. Reducing the number of entries by deliberately taking few

49

samples is therefore important or even necessary to keep the problem manageable [43].

50

Also, some entries can be missing or unknown due to, e.g., broken sensors, artifacts

51

which have been removed or physically impossible situations. Whether few entries are

52

sampled or some entries are missing, specialized algorithms, e.g., [2, 20, 39, 40, 43], are

53

key to decompose and analyze the resulting incomplete tensors. It is often important

54

that the creation of the full tensor is avoided, especially for large-scale tensors.

55

Prior knowledge is often available, allowing the construction of more accurate

56

and/or interpretable models. Moreover, additional constraints are important to fur-

57

ther lower the number of required samples or to improve the accuracy. In this paper,

58

we focus on prior knowledge that can be represented as linear constraints on the factor

59

matrices of a CPD, i.e., A

⁽ⁿ⁾

= B

⁽ⁿ⁾

C

⁽ⁿ⁾

in which B

⁽ⁿ⁾

is a known matrix, while C

⁽ⁿ⁾

60

contains the unknown coefficients. This model is also known as candelinc [ 13] and

61

is ubiquitous although it is not always recognized as a constrained CPD. For exam-

62

ple, in multivariate regression with polynomials or sparse grids, B

⁽ⁿ⁾

can be a basis

63

of polynomials or multilevel tent functions [7, 14]. Some applications benefit from

64

other types of bases like Gaussian kernels or trigonometric functions [45]. Another

65

example can be found in compressed sensing (CS) where a compressed measurement

66

b is represented in a dictionary D by a sparse coefficient vector x [10, 12]. If D is

67

expressed as a Kronecker product and x is a (possibly sparse) vectorized CPD [9, 33],

68

the CS problem can be written as a linearly constrained CPD.

69

Computing a linearly constrained CPD of a full tensor is straightforward as the

70

tensor can be compressed using the known matrices B

⁽ⁿ⁾

after which the matrices

71

C

⁽ⁿ⁾

can be found from the unconstrained CPD of the compressed tensor. Because

72

of its efficiency, compression using suitable basis matrices [4] or the Tucker decom-

73

position/MLSVD [21, 8] is a commonly used preprocessing step. For incomplete ten-

74

sors, only algorithms for unconstrained decompositions are currently available. These

75

algorithms can be divided into imputation or expectation-maximization algorithms

76

which replace missing entries with some value, and direct minimization techniques

77

which take only known entries into account. Examples for the former can be found

78

in [17, 30, 40 ], while ccd++ [ 20 ], cp wopt [ 2 ], indafac [ 40 ] and cpd [ 39] are ex-

79

amples of the latter. The structured data fusion (SDF) framework from [39] can be

80

used to model linear constraints, but it does not fully exploit the inherent multilinear

81

structure. Dedicated algorithms have been presented for specific problems, e.g., in

82

[7] an alternating least squares (ALS) algorithm is presented for multivariate regres-

83

sion problems. Here, we focus on the direct minimization techniques, as the memory

84

(3)

requirements often prohibit the construction of the imputed tensor.

85

The efficient computation of a linearly constrained CPD of an incomplete tensor

86

is our main focus. Two variants are presented in section 2: a data-dependent algo-

87

rithm cpdli dd which has a per-iteration complexity linear in the number of known

88

entries and a data-independent algorithm cpdli di which has a per-iteration com-

89

plexity depending only on the number of variables in the decomposition and not on

90

the tensor entries. The presented algorithms are of the (inexact) Gauss–Newton type

91

(see subsection 1.2); quasi-Newton or ALS algorithms can be derived easily. After

92

fixing the notation and reviewing the used optimization framework in the remainder

93

of this section, the algorithms and their complexity are derived in section 2. The

94

unconstrained CPD of an incomplete tensor is discussed as a special case in section 3.

95

A new statistical preconditioner is discussed in section 4. Section 5 and section 6 illus-

96

trate the performance of the presented algorithms on synthetic data and a materials

97

science example, respectively. To keep the main text and idea clear, the derivations

98

of the algorithms and preconditioner are discussed in the supplementary material

99

accompanying this paper.

100

1.1. Notation. Scalars, vectors, matrices and tensors are denoted by lower case,

101

e.g., a, bold lower case, e.g., a, bold upper case, e.g., A and calligraphic, e.g., T ,

102

letters, respectively. K denotes either R or C. Sets are indexed by superscripts

103

within parentheses, e.g., A

⁽ⁿ⁾

, n = 1, . . . , N . An incomplete tensor T has N

ke

known

104

entries at positions (i

⁽¹⁾_k

, i

⁽²⁾_k

, . . . , i

^{(N )}_k

), k = 1, . . . , N

ke

. We collect all indices of known

105

entries in index vectors i

⁽¹⁾

, . . . , i

^{(N )}

. The extended form of a matrix A

⁽ⁿ⁾

∈ K

^Iⁿ^×R

is

106

denoted by A

^[n]

= A

⁽ⁿ⁾

(i

⁽ⁿ⁾

, :) using Matlab style indexing. A

^[n]

∈ K

^N^ke^×R

contains

107

repeated rows as 1 ≤ i

⁽ⁿ⁾_k

≤ I

n

and N

ke

≥ I

n

and is used to facilitate the derivation

108

the algorithms. A mode-n vector is the generalization of a column (mode-1) and

109

row (mode-2) vector and is defined by fixing all but the nth index of an N th-order

110

tensor. The mode-n unfolding of a tensor T is denoted by T

(n)

and has the mode-n

111

vectors as its columns. The vectorization operator vec (T ) stacks all mode-1 vectors

112

into a column vector. Details on the order of the mode-n vectors in unfoldings and

113

vectorizations can be found in [11]. A number of products are needed. The mode-

114

n tensor-matrix product is denoted by T ·

_n

A. The Kronecker and Hadamard, or

115

element-wise, product are denoted by ⊗ and ∗, respectively. The column-wise (row-

116

wise) Khatri–Rao products between two matrices, denoted by (

^T

), is defined as

117

the column-wise (row-wise) Kronecker product. To simplify expressions, the following

118

shorthand notations are used

119

⊗

n

≡

⊗

1

n=N

⊗

k6=n

≡

⊗

1 k=N

k6=n

, (2)

120 121

where N is the order of the tensor and indices are traversed in reverse order. Similar

122

definitions are used for ,

^T

and ∗ . To reduce the number of parentheses needed, we

123

assume the matrix product takes precedence over the Kronecker, (row-wise) Khatri–

124

Rao and Hadamard products, e.g.,

125

AB CD ≡ (AB) (CD).

(3)

126127

3

(4)

Similarly, shorthand notations are expanded first, e.g., for N = 3, we have

128

k6=2

A

^(k)

A

⁽²⁾

≡

A

⁽³⁾

A

⁽¹⁾

A

⁽²⁾

, (4)

129

k6=2

B

^(k)

C

^(k)

⊗ A

⁽²⁾

≡

(B

⁽³⁾

C

⁽³⁾

) (B

⁽¹⁾

C

⁽¹⁾

)

⊗ A

⁽²⁾

. (5)

130 131

The complex conjugate, transpose, conjugated transpose and pseudo-inverse are de-

132

noted by ¯·, ·

^T

, ·

^H

and ·

^†

, respectively. The column-wise concatenation of vectors a

133

and b is written as x = [a; b] and is a shorthand for x = a

^T

b

^T

T

. The inner

134

product between A and B is denoted by hA, Bi = vec (B)

^H

vec (A). The weighted

135

inner product is defined as hA, Bi

_S

= vec (B)

^H

diag(vec (S))vec (A) where weight ten-

136

sor S has the same dimensions as A and B. The Frobenius norm is denoted by ||·||.

137

The real part of a complex number is written as Re (·) and the expectation operator

138

as E [·]. The permutation matrix P

⁽ⁿ⁾

maps entry (i

₁

, . . . , i

_n−1

, i

_n

, i

_n+1

, . . . , i

_N

) of a

139

vectorized tensor to entry (i

n

, i

1

, . . . , i

n−1

, i

n+1

, . . . , i

N

). Details on the used notation

140

can be found in [11].

141

1.2. Optimization framework. We discuss the expressions required to imple-

142

ment quasi-Newton and nonlinear least squares (NLS) algorithms to solve the opti-

143

mization problem

144

min

z

f = 1

2 ||F (z)||

²

. (6)

145 146

All presented algorithms allow complex tensors and/or factors. We use the complex

147

optimization framework from [36, 37] which implements a large number of qN and

148

NLS algorithms including implementations of various line search, plane search and

149

trust region strategies. This framework requires functions computing the objective

150

function, the gradient and, in the case of NLS algorithms, the Gramian or Gramian-

151

vector product. In the remainder of this subsection, we elaborate on a particular NLS

152

algorithm called Gauss–Newton. A more detailed explanation of other qN and NLS

153

algorithms and complex optimization can be found in [3, 27, 36, 38].

154

Unconstrained NLS algorithms attempt to find optimal values z

^∗

∈ K

^M

using

155

objective function (13). In iteration k + 1, the residual F (z) is linearized using the

156

first-order Taylor expansion at the current iterate z

k

:

157

F (z) ≈ F (z

k

) + ∂F

∂z

_z=z

k

(z − z

k

) (7)

158

= F (z

k

) + Jp, (8)

159160

in which J is the Jacobian of F (z) and p the step direction. Substituting the lin-

161

earization in objective function (6) leads to a quadratic model

162

min

p

1 2 F (z

_k

)

^H

F (z

_k

) + F (z

_k

)

^H

Jp + 1

2 p

^H

J

^H

Jp.

163

(9)

164

The first-order optimality criterion leads to the minimizer $p

^*

by solving

165

J

^H

Jp

^∗

= −J

^H

F (z

k

) (10)

166

Hp

^∗

= −¯ g.

(11)

167168

(5)

The matrix H is the Gramian (of the Jacobian) and approximates the Hessian of

169

f . The right-hand side is the negated and conjugated gradient ¯ g =

_df

dz

H

. Note

170

that (11) only requires the derivative w.r.t. z, even for complex variables z while f is

171

non-analytic. As shown in [36], the gradient actually consists of the partial derivative

172

w.r.t. z and ¯ z, keeping ¯ z and z constant, respectively. Both partial derivatives of f

173

are each other’s complex conjugate and the Jacobian ∂F/∂¯ z = 0, as F is analytic in

174

z.

175

When the number of variables M is large, the computational cost of computing the

176

(pseudo) inverse of H ∈ C

^{M ×M}

can be a bottleneck. To reduce the complexity, (pre-

177

conditioned) Conjugate Gradients (CG) can be used to solve (11) [27]. CG iteratively

178

refines an initial guess p

0

using only Gramian-vector products Gp

l

, l = 0, . . . , M − 1.

179

In the case of CPD computations, [38] showed that preconditioned CG can effectively

180

reduce the computational cost for large-scale systems.

181

Finally, the variables z are updated as

182

z

k+1

= z

k

+ α

k+1

p, (12)

183184

in which α

_k+1

is computed using a line search algorithm. Other variants use plane

185

search or trust-region methods to compute z

_k+1

. For more information on different

186

updating methods, we refer the interested reader to, e.g., [27]. In this paper, we use

187

a dogleg trust region (DLTR) approach [27, 38].

188

It is clear that an NLS based algorithm requires a way of computing the objective

189

function, the gradient and the Gramian or Gramian-vector products. In the latter

190

case, an effective preconditioner reduces the computational cost of the CG algorithm.

191

In section 2, we derive these expressions for the linearly constrained CPD of incomplete

192

tensors. An effective preconditioner is derived in section 4.

193

2. CPD with linearly constrained factors for incomplete tensors. The

194

proposed cpdli algorithms compute the rank-R CPD of an N th-order incomplete

195

tensor T ∈ K

^I¹^×···×I^N

with linear constraints on the factor matrices using an (inexact)

196

Gauss–Newton algorithm with dogleg trust regions. Mathematically, this results in

197

the following optimization problem:

198

min

C⁽¹⁾,...,C^{(N )}

f = 1 2 S ∗ r

A

⁽¹⁾

, . . . , A

^{(N )}

z

− T

2

(13)

199

subject to A

⁽ⁿ⁾

= B

⁽ⁿ⁾

C

⁽ⁿ⁾

n = 1, . . . , N.

(14)

200201

An entry in the binary sampling tensor S is one if the corresponding entry in T is

202

known and zero otherwise. The factor matrices A

⁽ⁿ⁾

∈ K

^Iⁿ^×R

are the product of

203

known matrices B

⁽ⁿ⁾

∈ K

^Iⁿ^×Dⁿ

and unknown coefficient matrices C

⁽ⁿ⁾

∈ K

^Dⁿ^×R

,

204

n = 1, . . . , N .

205

Two variants of cpdli are derived: the data-dependent algorithm uses the data

206

in every iteration, while the data-independent variant computes a projected repre-

207

sentation of the data prior to the optimization and uses this representation in every

208

iteration. The derivations of all expressions are presented in section SM1.

209

2.1. A data-dependent algorithm: cpdli dd. The cpdli dd algorithm

210

computes the CPD with linearly constrained factor matrices of an incomplete tensor

211

with a computational complexity linear in the number of known entries. This section

212

summarizes the elements needed for the algorithm outlined in Algorithm 1.

213

5

(6)

Algorithm 1 cpdli dd using Gauss–Newton with dogleg trust region

1:

Input: S, T , B

⁽ⁿ⁾

^N

n=1

and C

⁽ⁿ⁾

^N

n=1 2:

Output: C

⁽ⁿ⁾

^N

n=1 3:

while not converged do

4:

Compute gradient g using (15) (or (16)).

5:

Compute step p using either

• p = −H

^†

g with H from (18), or ¯

• PCG of Hp = −¯ g with Gramian-vector products Hp computed using (23), (25) and (26) and preconditioner (45).

6:

Update C

⁽ⁿ⁾

, n = 1, . . . , N , using DLTR from p, g and function value (13).

7:

end while

Objective function. Computing the function value is straightforward. After

214

constructing the factor matrices A

⁽ⁿ⁾

= B

⁽ⁿ⁾

C

⁽ⁿ⁾

, n = 1, . . . , N , the sum of the

215

squared entries of the residual tensor F = S ∗ qA

⁽¹⁾

, . . . , A

^{(N )}

y − T is computed.

216

For sparsely sampled tensor with N

_ke

I

1

I

₂

· · · I

N

, only the nonzero entries of F are

217

constructed.

218

Gradient. The gradient g of (13) w.r.t. the (vectorized) optimization variables

219

C

⁽ⁿ⁾

is partitioned as g = vec G

⁽¹⁾

; . . . ; vec G

^{(N )}

in which each subgradient

220

G

⁽ⁿ⁾

is given by

221

G

⁽ⁿ⁾

= ∂f

∂C

⁽ⁿ⁾

= B

⁽ⁿ⁾^T

F ¯

_(n)

k6=n

A

^(k)

(15)

222

= ¯ F ·

1

B

⁽¹⁾^T

· · · · ·

N

B

^{(N )}^T

(n)

k6=n

C

^(k)

, (16)

223 224

which follows from the chain rule and from multilinear identities. Which one of the

225

expressions (15) and (16) should be used depends on the order, rank and number of

226

known entries in the tensor and D

n

. Gradient expression (15) is computed using a

227

sparse matricized tensor times Khatri–Rao product (mtkrprod), which can be com-

228

puted in O 2N

²

N

_ke

R operations for the N subgradients G

⁽ⁿ⁾

. Gradient expression

229

(16) computes a projection onto basis matrices B

⁽ⁿ⁾

, n = 1, . . . , N , which is a series of

230

N sparse tensor-matrix products, followed by a dense mtkrprod. The former opera-

231

tion can also be seen as a matricized tensor times Kronecker product (mtkronprod).

232

The overall complexity of (16) is therefore O (2N N

_ke

Q

n

D

_n

) for the (sparse) mtkro-

233

nprod and O (2N R Q

n

D

_n

) for the dense mtkrprod. Both mtkrprod and mtkro-

234

nprod are standard operations in tensor computations. Efficient implementations for

235

dense and sparse tensors are described in, e.g., [5, 19, 31, 35, 41].

236

Gramian of the Jacobian. Finally, we compute the Gramian H = J

^H

J. Sim-

237

ilar to the gradient, the Gramian is partitioned into a grid of N × N blocks H

^(m,n)

,

238

m, n = 1, . . . , N , corresponding to the approximation of the second-order derivative

239

w.r.t. C

^(m)

and C

⁽ⁿ⁾

. Each block is computed from the explicitly constructed Jaco-

240

bians J

⁽ⁿ⁾

:

241

J

⁽ⁿ⁾

= ∂

∂vec C

⁽ⁿ⁾

vec (F ) = V

[n]

^T

B

^[n]

in which V

^[n]

= ∗

k6=n

A

^[n]

. (17)

242 243

(7)

The Jacobian is therefore a dense matrix of size N

_ke

× D

n

R and the Gramian blocks

244

H

^(m,n)

of size D

m

R × D

n

R are given by

245

H

^(m,n)

=

V

^[m]

^T

B

^[m]

H

V

^[n]

^T

B

^[n]

. (18)

246247

The data-independent algorithm cpdli di exploits the remaining structure in V

^[n]

.

248

To reduce the cost of computing the pseudoinverse of H, CG is used when dealing

249

with many variables. Each CG iteration requires the matrix-vector product y = Hx.

250

A partitioning similar to the gradient and Gramian is used for the vectors x and y:

251

x = h vec

X

⁽¹⁾

; . . . ; vec X

^{(N )}

i

,

252

(19)

y = h vec

Y

⁽¹⁾

; . . . ; vec Y

^{(N )}

i

. (20)

253254

The part of the product y = Hx corresponding to Y

^(m)

is then computed as

255

vec Y

^(m)

= J

^(m)^H

J

⁽¹⁾

· · · J

^{(N )}

h vec

X

⁽¹⁾

; . . . ; vec X

^{(N )}

i (21)

256

= J

^(m)^H

N

X

n=1

J

⁽ⁿ⁾

vec X

⁽ⁿ⁾

. (22)

257 258

Using multilinear identities, the definition of the Jacobian (17) and the auxiliary

259

vector z

⁽ⁿ⁾

, we then have

260

z

⁽ⁿ⁾

= J

⁽ⁿ⁾

vec X

⁽ⁿ⁾

=

V

^[n]

^T

B

^[n]

X

⁽ⁿ⁾

1 (23)

261

vec Y

^(m)

=

V

^[m]

^T

B

^[m]

H N

X

n=1

z

⁽ⁿ⁾

. (24)

262 263

The matrix B

^[n]

X

⁽ⁿ⁾

can contain many identical rows as B

^[n]

is the extended form

264

of B

⁽ⁿ⁾

. The computational cost can therefore be reduced by extending the product

265

B

⁽ⁿ⁾

X

⁽ⁿ⁾

instead of B

⁽ⁿ⁾

. The product V

^[m]

^T

B

^[m]

^H

P

N

n=1

z

⁽ⁿ⁾

can be computed

266

efficiently using accumulation. Let z = P

N

n=1

z

⁽ⁿ⁾

and define the ith row of Q

^(m)

∈

267

K

^I^m^×R

by

268

Q

^(m)

(i, :) = X

k∈π^(m)

z

k

V ¯

^[m]

(k, :) with π

^(m)

= n

k | i

^(m)

(k) = i o . (25)

269 270

The cost of the row-wise Khatri–Rao product in (24) is avoided by computing

271

Y

^(m)

= B

^(m)^H

Q

^(m)

. (26)

272273

2.2. Removing data dependence: cpdli di. Each iteration, the cpdli dd

274

algorithm has a complexity dependent on the number of known entries N

ke

. In this

275

section, we show how the structure resulting from the linear constraints allows the per-

276

iteration complexity to be independent of the number of known entries by computing a

277

data-dependent matrix D prior to the optimization process. This matrix D is defined

278

as the inner product

279

D =

n

T

B

^[n]

H

n

T

B

^[n]

, (27)

280 281

7

(8)

which only depends on the positions of the known entries and the given matrices

282

B

⁽ⁿ⁾

. Blocking is used to lower the intermediate memory requirements for expanding

283

^T_n

B

^[n]

. As the size of D is Q

n

D

n

× Q

n

D

n

, it is clear that this approach is limited

284

to tensors of low order N and small D

n

as the number of entries in D scales as D

^2N

285

assuming D

n

= D for n = 1, . . . , N .

286

In the remainder of this section, we show how D can be used to reduce the

287

computational cost of the ingredients in the cpdli algorithm. Algorithm 2 gives an

288

overview of cpdli di.

289

Algorithm 2 cpdli di using Gauss–Newton with dogleg trust region

1:

Input: S, T , B

⁽ⁿ⁾

^N

n=1

and C

⁽ⁿ⁾

^N

n=1 2:

Output: C

⁽ⁿ⁾

^N

n=1

3:

Compute D from (27), T

B

from (30), and ||S ∗ T ||

²

.

4:

while not converged do

5:

Compute gradient g using (36) and (34).

6:

Compute step p using either

• p = −H

^†

g with H from (37), or ¯

• PCG of Hp = −¯ g with Gramian-vector products Hp computed using (40) and (41), and preconditioner (45).

7:

Update C

⁽ⁿ⁾

, n = 1, . . . , N , using DLTR from p, g and function value (29).

8:

end while

Objective function. The objective function (13) can be expanded as follows

290

f = 1

2 ||S ∗ T ||

²

− Re Dr

A

⁽¹⁾

, . . . , A

^{(N )}

z

, T E

S

+ 1

2 S ∗ r

A

⁽¹⁾

, . . . , A

^{(N )}

z

2

. (28)

291292

Let ˆ t be the vector containing all values of known entries of T . The norm ||S ∗ T ||

²

293

is then computed prior to the optimization process as ˆ t

^H

ˆ t. In subsection SM1.3 we

294

show that f can be computed as

295

f = 1

2 ˆ t

^H

ˆ t + Re 1 2 vec r

C

⁽¹⁾

, . . . , C

^{(N )}

z

H

D − ˆ t

^H_B

vec r

C

⁽¹⁾

, . . . , C

^{(N )}

z

. (29)

296 297

The vector ˆ t

B

is computed a priori and is defined as

298

ˆ t

B

=

n

T

B

^[n]

H

ˆ t.

(30)

299 300

While using (29) is computationally more efficient, it is numerically less accurate.

301

Consider only the known entries ˆ t and assume an approximation up to an error vector

302

δ with ||δ|| = ˆ t

has been found. The data-dependent objective function is then

303

equivalent to

304

2f =

(ˆ t + δ) − ˆ t

2

= ||δ||

²

= O

²

. (31)

305306

For the data-independent objective function we have

307

2f =

(ˆ t + δ)

2

− 2Re (ˆ t + δ)

^H

ˆ t + ˆ t

(32)

2

308

= Re

ˆ t

2

+ 2ˆ t

^H

δ + ||δ||

²

− 2 ˆ t

2

+ ˆ t

^H

δ +

ˆ t

2

. (33)

309310

(9)

If < √

_mach

with

_mach

the machine precision, ||δ||

²

is numerically zero, and 2f = 0

311

or 2f = ±O (

mach

) instead of O

²_mach

. This limits the accuracy to O √

mach

.

312

In applications, however, other factors often put more stringent restrictions to the

313

accuracy. Due to the signal-to-noise ratio, for example, only a few accurate digits

314

are required in signal processing, or, in machine learning and data analysis, a limited

315

precision is expected because of model errors and regularization terms. In both ex-

316

amples, the limited accuracy caused by using (29) is therefore not a problem. If, in

317

a specific application, a solution accurate up to machine precision is required, the so-

318

lution found by the data-independent algorithm can be refined using a few iterations

319

of the data-dependent algorithm.

320

Gradient. Similar to the derivation of the objective function, the residual ten-

321

sor F in (16) is separated in a data part and a decomposition part. As shown in

322

subsection SM1.3, the gradient can be written as

323

G

⁽ⁿ⁾

= Z − ¯ ¯ T

B

(n)

k6=n

C

^(k)

, (34)

324 325

which is an mtkrprod involving the following compressed, dense tensors. The values

326

of compressed data representation ¯ T

B

= S ∗ ¯ T ·

1

B

⁽¹⁾^T

· · · · ·

N

B

^{(N )}^T

are ˆ t

B

from

327

(30). The conjugated auxiliary tensor ¯ Z has size D

1

× · · · × D

N

and is computed as

328

Z = ¯ S ∗ r

A ¯

⁽¹⁾

, . . . , ¯ A

^{(N )}

z

·

1

B

⁽¹⁾^T

· · · · ·

N

B

^{(N )}^T

(35)

329

= reshape ¯ Dvec r ¯ C

⁽¹⁾

, . . . , ¯ C

^{(N )}

z

, [D

₁

, . . . , D

_N

] (36) .

330331

Gramian of the Jacobian. The complexity of computing the data-dependent

332

Gramian in (18) can be reduced by removing the dependency on B

^[n]

. In subsec-

333

tion SM1.3, we show that

334

H

^(m,n)

=

k6=m

C

^(k)

⊗ I

H

P

^(m)

DP

⁽ⁿ⁾^T

k6=n

C

^(k)

⊗ I

. (37)

335 336

This is a variation on the mtkrprod called the Khatri–Rao times matricized tensor

337

times Khatri–Rao product (krmtkrprod). While an mtkrprod can be computed

338

in memory and cache efficient way, i.e., without permuting entries of the tensor [31,

339

41 ], permutations are required for krmtkrprod. Using similar ideas as in [ 31, 41],

340

the performance penalty endured by permuting the tensor can be minimized using

341

appropriate left-to-right and right-to-left contractions. The exact description of this

342

algorithm is out of the scope of this paper. An implementation of krmtkrprod will

343

be made available in Tensorlab [44].

344

Similar to the data-dependent algorithm, CG can be used for large-scale problems,

345

hence, requiring a Gramian-vector product. From (37) we have for a single Gramian

346

block H

^(m,n)

347

H

^(m,n)

vec X

⁽ⁿ⁾

=

k6=m

C

^(k)

⊗ I

H

P

^(m)

vec Z

⁽ⁿ⁾

(38)

348 349

in which the vectorized tensor Z

⁽ⁿ⁾

of size D

1

× · · · × D

N

is given by

350

vec Z

⁽ⁿ⁾

= Dvec r

C

⁽¹⁾

, . . . , X

⁽ⁿ⁾

, . . . , C

^{(N )}

z

(39)

351

vec (Z) =

N

X

n=1

vec Z

⁽ⁿ⁾

= D

N

X

n=1

vec r

C

⁽¹⁾

, . . . , X

⁽ⁿ⁾

, . . . , C

^{(N )}

z

. (40)

352 353

9

(10)

The matricized result Y

^(m)

is then computed using an mtkrprod:

354

Y

^(m)

=

N

X

n=1

H

^(m,n)

vec X

⁽ⁿ⁾

= Z

_(m)

k6=m

C ¯

^(k)

. (41)

355 356

2.3. Complexity. To conclude this section, the per-iteration complexity of both

357

variants is investigated and compared. For simplicity of notation, assume T is a

358

cubical tensor, i.e., I

n

= I and matrices B

⁽ⁿ⁾

have the same number of columns, i.e.,

359

D

n

= D, for n = 1, . . . , N . A summary of the per-iteration complexity for cpdli dd

360

and cpdli di is given in Table 1 and Table 2, respectively. For large-scale problems,

361

i.e., problems with a large number of variables, the choice between cpdli dd and

362

cpdli di depends on N

ke

and D: the former’s overall complexity per iteration scales

363

with O N

²

RDN

_ke

compared to O D

^2N

for the latter. Therefore, the cpdli di

364

algorithm is favored for low-order tensors with D small and relatively many known

365

entries.

366

Table 1

The per-iteration complexity of the cpdli dd algorithm is dominated by Gramian operations.

The overall complexity depends linearly on the number of known entries Nke. A trust region method is used to determine the update of the variables, requiring itTRadditional function evaluations.

Calls per iteration Complexity Function value 1 + it

TR

O (N

ke

N R)

Gradient (sparse) 1 O 2N

²

RN

ke

Gradient (dense) 1 O 2N D

^N

(R + N

_ke

)

Gramian 1 O N

²

D

²

R

²

N

_ke

+

⁸₃

N

³

D

³

R

³

Gramian-vector it

_CG

O 2N

²

DRN

_ke

Total Small-scale O N

²

D

²

R

²

N

ke

+

⁸₃

N

³

D

³

R

³

Large-scale O 2it

CG

N

ke

N

²

RD

Table 2

The per-iteration complexity of the cpdli di algorithm is dominated by Gramian operations involving D. The number of known entries Nkeappears only in the preparation complexity which is independent of the number of iterations. A trust region method is used to determine the update of the variables, requiring itTR additional function evaluations.

Calls per iteration Complexity

Preparation O 2N D

^2N

N

_ke

Function value 1 + it

_TR

O 2D

^2N

Gradient 1 O 2D

^2N

+ 3D

^N

RN

Gramian 1 O N

²

RD

^2N

+

⁸₃

N

³

D

³

R

³

Gramian-vector it

CG

O 2D

^2N

+ N

²

D

^N

R Total Small-scale O N

²

RD

^2N

+

⁸₃

N

³

D

³

R

³

Large-scale O 2(it

CG

+ 2)D

^2N

3. Unconstrained CPD for incomplete tensors. How to deal with missing

367

entries when computing a CPD has been an important topic for many years and many

368

algorithms have been proposed. We present a new algorithm called cpdi, which is a

369

(11)

special case of cpdli dd. In contrast to imputation based methods [ 17, 30, 40], the

370

computational per-iteration complexity is linear in N

ke

, and in contrast to other direct

371

minimization techniques [2, 20, 39], second-order convergence can be achieved by using

372

the exact Gramian. By combining fast convergence with efficient Gramian-vector

373

products, cpdi outperforms state-of-the-art methods when few entries are known or

374

the position of the missing entries follows a pattern, as shown in the experiments in

375

subsection 5.1. These cases are particularly interesting when dealing with large-scale

376

problems as illustrated in [43].

377

The unconstrained CPD of an incomplete tensor can be seen as a special case of

378

the cpdli algorithm in which the known matrices are identity matrices, i.e., B

⁽ⁿ⁾

=

379

I

_I_n

, for n = 1, . . . , N . As D

n

= I

_n

, the data independent variant cpdli di is usually

380

not attractive as I

_n

is often large. Therefore, the cpdi algorithm is based on cpdli dd.

381

The expressions for the function value and the gradient can be derived trivially from

382

cpdli di and result in the same expressions as in [ 2] which uses a quasi-Newton

383

approach and [39, 40] which use a Levenberg–Marquardt or Gauss–Newton approach

384

(as in this paper). The main difference with [40] is that an inexact algorithm is used,

385

meaning CG is used instead of inverting the Gramian. While the cpdi algorithm

386

proposed here uses the exact Gramian, [39] computes the Gramian for a full tensor

387

and then scales the result by the fraction of known entries. As this scaling is countered

388

by a longer step length in the line search or trust region step, the incompleteness is

389

in fact ignored.

390

In the experiments in subsection 5.1, we illustrate that the use of the exact

391

Gramian is beneficial if the known entries are not scattered uniformly at random

392

across the tensor, or if high accuracy is required while few entries are available. As

393

B

⁽ⁿ⁾

is an identity matrix in the case of cpdi, the Jacobian (Equation ( 17)) is sparse.

394

Exploiting this sparsity using sparse matrix-matrix or matrix-vector products reduces

395

the complexity to O N

²

R

²

N

_ke

+

⁸₃

N

³

D

³

R

³

for the small-scale version (first term is

396

a factor D

²

smaller compared to cpdli dd) and to O 2it

^CG

N

²

RN

ke

for the large-

397

scale version (a factor D smaller).

398

4. Preconditioner. For relatively large-scale tensors, the cost of inverting the

399

Gramian matrix in (11) can be alleviated by using preconditioned CG which only

400

requires matrix-vector products [27]. In PCG, the system

401

M

⁻¹

Hp = −M

⁻¹

g ¯ (42)

402403

is solved instead, in which the preconditioner M is a symmetric positive definite

404

matrix. As the convergence of CG depends on the clustering of the eigenvalues of H,

405

the number of CG iterations can be reduced if the eigenvalues of M

⁻¹

H are more

406

clustered. At the same time, applying the inverse of the preconditioner should be

407

computationally cheap.

408

For CPD algorithms, [38] proposes a block-Jacobi type preconditioner

409

M

_full

= blkdiag

M

⁽¹⁾

, . . . , M

^{(N )}

, (43)

410411

where M

⁽ⁿ⁾

= W

⁽ⁿ⁾

⊗ I with W

⁽ⁿ⁾

= V

⁽ⁿ⁾^H

V

⁽ⁿ⁾

which is computed efficiently as

412

∗

k6=n

A

⁽ⁿ⁾^H

A

⁽ⁿ⁾

. In [39], an extension to incomplete tensors uses a scaled precondi-

413

tioner

414

M

_inc,scaled

= ρM

_full

. (44)

415416

11

(12)

The fraction ρ = N

_ke

/ Q

n

I

n

is simply the fraction of known entries. This is a

417

reasonable approximation for tensors with known entries scattered randomly across

418

the tensor and with entries of similar magnitudes.

419

Here, we propose a similar block-Jacobi type preconditioner based on the expected

420

value of the diagonal Gramian blocks H

^(n,n)

. First, we derive a preconditioner for

421

the cpdi algorithm, which we then extend to the cpdli algorithm by incorporating

422

linear constraints.

423

To precondition the linear systems arising when decomposing incomplete tensors

424

using cpdi, we propose the block-Jacobi preconditioner M

inc,stat

with diagonal blocks

425

M

⁽ⁿ⁾_inc,stat

. The exact expression for the block diagonal of H

⁽ⁿ⁾

= J

⁽ⁿ⁾^H

J

⁽ⁿ⁾

would

426

require the inversion of a I

n

R×I

n

R matrix, which is often too expensive for large-scale

427

problems. Instead, the expected value of the block H

^(n,n)

is used. More concretely,

428

consider the i

n

th slice of order N − 1 and assume there are Q

⁽ⁿ⁾_i

n

known entries in this

429

slice. Instead of taking the exact positions, or distribution, of the known entries into

430

account, we assume that all possible distributions of exactly Q

⁽ⁿ⁾_i

n

known entries in

431

this slice are equally likely. By taking the expected value over all distributions, only

432

the fraction of known entries f

_i⁽ⁿ⁾

n

= Q

⁽ⁿ⁾_i

n

/ Q

k6=n

I

k

is required to approximate the

433

blocks on the diagonal by

434

M

⁽ⁿ⁾_inc,stat

= E h H

^(n,n)

i

= W

⁽ⁿ⁾

⊗ diag f

⁽ⁿ⁾

(45)

435436

as shown in section SM2. This expression reduces to the result from [39] if every slice

437

contains the same number of known entries.

438

For both cpdli algorithms, the preconditioner is extended to the case with linear

439

constraints on the factor matrices, resulting in the expression

440

M

⁽ⁿ⁾_linc,stat

= E h H

^(n,n)

i

= W

⁽ⁿ⁾

⊗

B

⁽ⁿ⁾^H

diag f

⁽ⁿ⁾

B

⁽ⁿ⁾

. (46)

441442

Even though this preconditioner only approximates the exact Gramian blocks, which

443

take the exact position of the known entries into account, the experiments in subsec-

444

tion 5.2 show that the number of CG iterations is effectively reduced.

445

5. Experiments. The performance of the cpdi method and the effectiveness of

446

the proposed preconditioner are shown with two experiments in this section. Section 6

447

illustrates the benefits of linear constraints and second-order information on a mate-

448

rials science dataset using cpdli. Details about the parameters and implementation

449

used for the different algorithms are reported in Appendix A to allow reproducibility.

450

5.1. CPD of incomplete tensors. The effectiveness of using second-order in-

451

formation to reduce the computational cost of incomplete tensor decompositions is

452

illustrated for two scenarios: randomly missing entries and structured missing entries.

453

The proposed cpdi algorithm is compared to another Gauss–Newton type algorithm

454

cpd [ 38, 39], and to two quasi-Newton algorithms based on nonlinear conjugate gra-

455

dients (NCG): cpd minf [ 38, 39 ] and cp wopt [ 2]. The first three algorithms are

456

implemented in Tensorlab [44]. The Tensor Toolbox [6] implementation is used for

457

cp wopt. For all algorithms, we compute a low accuracy solution (E

cpd

≈ 10

⁻⁵

) and

458

a high accuracy solution (E

_cpd

≤ 10

⁻¹²

), in which

459

E

_cpd

= max

n=1,...,N

A

⁽ⁿ⁾

− ˆ A

⁽ⁿ⁾

A

⁽ⁿ⁾

(47)

460 461

(13)

with A

⁽ⁿ⁾

the (known) exact factor matrices and ˆ A

⁽ⁿ⁾

the estimated factor matrices

462

after resolving the scaling and permutation indeterminacies. All experiments are

463

repeated 25 times using different tensors and different initial guesses.

464

First, let T ∈ R

1000×1000×1000

be a rank-10 tensor generated using random factor

465

matrices A

⁽ⁿ⁾

drawn from a uniform distribution U (0, 1). From this tensor, entries

466

are sampled uniformly and randomly. Figure 1 shows the computation time when

467

300 000 entries and 90 000 entries are sampled, corresponding to 0.03% and 0.009% of

468

all entries, respectively, or alternatively, 10 and 3 entries on average per variable, re-

469

spectively. Both in time and number of iterations, the Gauss–Newton based methods

470

cpdi and cpd outperform the NCG based methods. Hence the higher per-iteration

471

complexity is countered by the reduction in number of entries. While cpd outperforms

472

cpdi when many entries are known or when a low accuracy is needed, the number

473

of iterations is higher for cpd compared to cpdi. For high accuracy and few known

474

entries, cpdi benefits from its improved convergence properties. cp wopt fails to find

475

a high-accuracy solution due to its implementation of the objective function.

476

0 50 100 150 200 250

High accuracy

Low accuracy Random

90 000

25 270

11 25

55 575

- -

10 106

10 23

29 305

43 412

High accuracy

Low accuracy Random

300 000

16 74

21 16

171- 415-

8 36

cpd

19 15

cpdi

84 201

minf

132 260

cp wopt

Time (s) Iter.

Time (s)

Fig. 1. The Gauss–Newton type algorithms cpd and cpdi outperform the first-order NCG type algorithms as the higher per-iteration cost is countered by a significantly lower number of required iterations. When many entries are available, the lower complexity of cpd compared to cpdi is an advantage. For higher accuracies or fewer entries, cpdi is competitive because of its faster convergence rate. cp wopt fails to find a highly accurate solution. The reported timings and number of iterations are averages over 25 experiments.

The second scenario starts from a dense rank-10 tensor T ∈ R

100×100×100

con-

477

structed using random factor matrices with entries drawn from U (0, 1). A structured

478

missing data pattern is constructed as follows. From T , 95 mode-1 slices are selected

479

uniformly at random and from each of the selected slices 99% of the entries are dis-

480

carded at random. The remaining 5 mode-1 slices are dense. While the fraction of

481

known entries is relatively high (5.95%), the pattern prevents cpd from finding a so-

482

lution. Figure 2 shows the computational cost is again reduced significantly because

483

of the improved convergence properties for cpdi.

484

5.2. Preconditioner. cpdi and cpdli have a relatively high per-iteration com-

485

plexity while only a few iterations are necessary. Reducing the complexity or the

486

number of iterations has therefore a large impact on the overall computation time.

487

13

(14)

0 50 100 150 200

High accuracy

Low accuracy Slices

4 22

103 1514

- -

3 15

cpdi

42 583

minf

101 1312 cp wopt

Time (s) Iter.

Time (s)

Fig. 2. When missing entries follow a structured pattern, cpdi needs only few iterations to find a solution. cpd fails to find a solution in a reasonable amount of time for both accuracy levels and is therefore omitted. cp wopt fails to find a highly accurate solution. The reported timings and number of iterations are averages over 25 experiments.

Here, we show the effectiveness of the preconditioners proposed in section 4 in reduc-

488

ing the number of CG iterations and therefore in reducing the per-iteration complexity

489

of the algorithm.

490

Consider the following experiment. A random tensor T ∈ R

100×100×100

is con-

491

structed using linearly constrained factor matrices with random basis matrices B

⁽ⁿ⁾

∈

492

R

^100×10

and coefficient matrices C

⁽ⁿ⁾

∈ R

^10×5

. The entries of both B

⁽ⁿ⁾

and C

⁽ⁿ⁾

493

are drawn from a uniform distribution U (0, 1) with n = 1, 2, 3. To simulate a typical

494

iteration, the Gramian H and the gradient g are computed for random coefficients

495

C ˆ

⁽ⁿ⁾

and the system Hx = −¯ g is solved using PCG until convergence, i.e., until the

496

relative residual is smaller than 10

⁻⁶

. The number of CG iterations is monitored

497

when using no preconditioner, the incompleteness agnostic preconditioner (44) from

498

[39] and the proposed preconditioner (46). As in subsection 5.1, two scenarios are con-

499

sidered: 0.1% randomly sampled entries, and structured missing entries constructed

500

by removing 99% of the entries of 95 randomly selected mode-1 slices. Table 3 shows

501

the results averaged over 25 tensors and 20 random coefficient matrices ˆ C

⁽ⁿ⁾

. While

502

both preconditioners have a similar performance for randomly missing entries, the

503

proposed preconditioner outperforms the scaled preconditioner for structured miss-

504

ing entries. Both preconditioners reduce the number of CG iterations significantly

505

compared to the non-preconditioned case.

506

Table 3

The proposed PC and the incompleteness agnostic PC from [39] reduce the number of CG iterations significantly compared to the non-preconditioned case. When missing entries are scattered randomly across the tensor, both preconditioners have a similar performance, as expected. When the missing entries are structured, the proposed PC reduces the number of iterations by 14 on average.

The average (standard deviation) is computed over 25 tensors for 20 sets of variables, each.

Scenario No PC Agnostic PC (44) Proposed PC (46)

Uniformly at random 115.8 (19.7) 45.5 (9.3) 45.1 (9.2) Structured missing 124.6 (18.4) 53.3 (10.0) 39.3 (7.3)

6. Materials science application. Designing new materials often requires ex-

507

tensive numerical simulations depending on material parameters. For example, the

508

melting temperature of an alloy depends, among others, on the concentrations of its

509

(15)

constituent materials [43]. As a material property can depend on a large number of

510

variables, e.g., concentrations, temperature and pressure, the corresponding tensors

511

with measured or computed values often have a high order. The curse of dimen-

512

sionality therefore prohibits measuring all possible combinations. In [43], the CPD

513

of an incomplete tensor allowed modeling a ninth-order tensor using only 10

⁻¹¹

%

514

of the data. Here, we show that using incomplete tensors and a CPD with linearly

515

constrained factors allows the Gibbs free energy to be modeled using very few data

516

points, while improving the accuracy w.r.t. the unconstrained problem.

517

Concretely, the Gibbs free energy for an alloy with silver, copper, nickel and tin

518

as constituent materials is modeled in its liquid phase, at a temperature T = 1000 K.

519

All parameters other than the concentrations (expressed in mole fraction) of silver

520

(c

₁

), copper (c

₂

) and nickel (c

₃

) are kept constant. The concentration of tin (c

₄

) is a

521

dependent variable as all concentrations must sum to 1, hence c

₄

= 1 − c

₁

− c

₂

− c

₃

.

522

Following the Calphad model [24, 26], the Gibbs free energy can be modeled as

523

G(c

1

, c

2

, c

3

) = G

p

(c

1

, c

2

, c

3

) + G

log

(c

1

, c

2

, c

3

) (48)

524

G

_log

(c

₁

, c

₂

, c

₃

) =

4

X

n=1

U T c

_n

log

₁₀

c

_n

, (49)

525 526

in which U is the universal gas constant. Thermo-Calc software [1] with the COST531-

527

v3 database [25] computes G in chosen concentrations c

1

, c

2

and c

3

. By discretizing

528

the concentrations, an incomplete third-order tensor G is obtained with the concen-

529

trations as its modes. As the logarithmic terms in G

log

depend solely on known values,

530

we subtract these terms from G and model only G

p

as a multivariate polynomial. Note

531

that G is necessarily incomplete: if sampled densely, i.e., if all feasible entries are sam-

532

pled on the chosen discretized grid of concentrations, all entries lie inside a pyramid

533

(Figure 3) as c

1

+ c

₂

+ c

₃

> 1 is physically unfeasible. Consequently, variables at the

534

vertices of the pyramid are estimated using only a single data point.

535

0 0.5 1 0 0.5 1

0 0.5 1

c

₁

c

2

c

3

Fig. 3. A ‘dense’ sampling scheme for G results in pyramidal structure as c1+ c2+ c3≤ 1.

Five algorithms are compared in this section:

536

• the data-dependent (cpdli dd) and data-independent (cpdli di) algorithms

537

proposed in this paper,

538

• multivariate regression using alternating least squares (mvr als) [ 7],

539

• structured data fusion (sdf) using struct_matvec to implement the con-

540

straints [39 ] and the cpdi factorization which implements cpdi as proposed

541

here [44], and

542

15

(16)

• cpdi, i.e., without linear constraints.

543

The precise parameters are outlined in Appendix A. All tested algorithms behave

544

differently and achieve a different accuracy, defined as the median relative error over

545

all feasible entries of the discretized Gibbs free energy tensor:

546

E = median

G

p

(i, j, k) − P

R

r=1

a

⁽¹⁾_ir

a

⁽²⁾_jr

a

⁽³⁾_kr

G

p

(i, j, k)

.

547

(50)

548

As the accuracy of the final solutions computed by the different algorithms can differ,

549

accuracy thresholds are used to allow for a fair comparison of timing results. Five

550

levels with accuracy thresholds are set: 10

⁻³

, 5 · 10

⁻⁴

, 10

⁻⁴

, 5 · 10

⁻⁵

, 10

⁻⁵

. The algo-

551

rithm proceeds to the next level when the median relative error E is below the current

552

level’s threshold and the (cumulative) time for each level is recorded. A maximum

553

computation time of three minutes is set for each level. This is repeated for 50 initial

554

guesses for C

⁽ⁿ⁾

, n = 1, 2, 3.

555

In the first experiment, the entries of G

p

∈ R

^99×99×99

are sampled according to

556

a dense, pyramidal grid with c

n

= 0.005, 0.015, 0.025, . . . , 0.985 for n = 1, 2, 3 which

557

amounts to 166 338 known entries (17% of all entries). A CPD with R = 5 terms

558

is computed from G

p

with polynomial constraints on the factor matrices, i.e., each

559

factor vector is a polynomial of maximal degree d = 4, hence D

n

= 5, n = 1, 2, 3.

560

A monomial basis is used. Figure 4 shows that both cpdli algorithms find a high

561

accuracy approximation quickly. The computation time for the data-independent

562

algorithm cpdli di is almost a factor ten lower, as the number of known entries is

563

high compared to the number of variables. The accuracy of the solution achieved by

564

the other algorithms is lower, while the time needed to find a comparable accuracy is

565

higher. Table 4 shows that the cpdli algorithms result in a more accurate solution

566

proving the usefulness of second-order information. Compared to an unconstrained

567

model, computed by cpdi, the linear constraints improve accuracy more than an order

568

of magnitude. Note that we did not report values for the cpd method [ 39], as it failed

569

to reach the first accuracy level, illustrating again that its Gramian approximation

570

does not suffice for structured missing entries.

571

Table 4

The results computed by cpdli algorithms are almost an order of magnitude more accurate than those computed by mvr als and sdf in terms of median error and maximal error. Using linear constraints improves the achievable accuracy as the cpdi algorithm consistently returns the least accurate solution. The point-wise relative error on the training set is reported and is the median over 50 experiments.

Rel. Error Median rel. error E cpdli di

10

⁻⁶

10

⁻⁵

10

⁻⁴

10

⁻³

10

⁻²

6.91 · 10

⁻⁶

cpdli dd 7.03 · 10

⁻⁶

mvr als 5.29 · 10

⁻⁵

sdf 3.49 · 10

⁻⁵

cpdi 1.28 · 10

⁻⁴

In the second experiment, we sample 200 entries at random from G

p

and test

572

cpdli dd and the mvr als algorithm using the same parameters as in the first

573

experiment. All errors are computed for all feasible entries, i.e., E can be seen as the

574