CANONICAL POLYADIC DECOMPOSITION OF INCOMPLETE
1
TENSORS WITH LINEARLY CONSTRAINED FACTORS
∗2
NICO VERVLIET†, OTTO DEBALS†, AND LIEVEN DE LATHAUWER† 3
Abstract. Incomplete datasets are prevalent in domains like machine learning, signal processing 4
and scientific computing. Simple models often allow missing data to be predicted or can be used as 5
representations for full datasets to reduce the computational cost. When data are explicitly or im- 6
plicitly represented as tensors, which are multiway arrays of numerical values, a low-rank assumption 7
is often appropriate. A nonlinear least squares algorithm to decompose an incomplete tensor into a 8
sum of rank-1 terms is presented here. Furthermore, the algorithm allows the incorporation of prior 9
knowledge in the form of linearly constrained factor matrices. By combining the exact Gramian 10
and a novel statistical preconditioner, the algorithm allows faster and more accurate decomposi- 11
tions compared to state-of-the-art methods when few entries are known or when missing entries are 12
structured. The accuracy is improved further while fewer known entries are required, by carefully 13
exploiting the linear constraints. A specialized variant of the algorithm removes the data dependence 14
from the optimization step by utilizing a compressed representation. When relatively many entries 15
are known, computation time can be reduced significantly using this variant, as illustrated in a novel 16
materials science application.
17
Key words. canonical polyadic decomposition, candecomp, parafac, CPD, CANDELINC, miss- 18
ing entries, incomplete tensors, big data, compressed sensing, multivariate regression 19
AMS subject classifications. 15A69 20
1. Introduction. Tensors, or multiway arrays of numerical values, are a higher-
21
order generalization of vectors and matrices and can be found in a variety of appli-
22
cations in, e.g., signal separation, object recognition, pattern detection, numerical
23
simulations, function approximation and uncertainty quantification. Many more ex-
24
amples can be found in numerous overview papers [11, 15, 16, 22, 32]. As the number
25
of entries in a tensor scales exponentially in its order, tensor problems become large-
26
scale quickly. The computational challenges related to this exponential increase, are
27
known as the curse of dimensionality. To extract information or to efficiently repre-
28
sent tensors, low rank assumptions are often appropriate. The (canonical) polyadic
29
decomposition (CPD), which is also known as parafac, candecomp, tensor rank de-
30
composition or separated representation, writes an N th-order tensor as a (minimal)
31
number of R rank-1 terms, each of which is the outer product of N non-zero vectors,
32
with R the tensor rank. Mathematically, we have
33
T =
R
X
r=1
a
(1)r ⊗· · ·
⊗a
(N )r= r
A
(1), . . . , A
(N )z (1)
34 35
∗Submitted to the editors on April 19, 2017.
Funding: Nico Vervliet is supported by an Aspirant Grant from the Research Foundation
— Flanders (FWO). Otto Debals is supported by a Ph.D. grant of the Agency for Innovation by Science and Technology (IWT). Research furthermore supported by: (1) Flemish Government: FWO:
projects: G.0830.14N, G.0881.14N; (2) EU: The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Advanced Grant: BIOTENSORS (n◦339804). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information;
(3) KU Leuven Internal Funds C16/15/059.
†KU Leuven, Dept. of Electrical Engineering ESAT/STADIUS, Kasteelpark Arenberg 10, bus 2446, B-3001 Leuven, Belgium; and Group Science, Engineering and Technology, KU Leuven - Ku- lak, E. Sabbelaan 53, 8500 Kortrijk, Belgium (Nico.Vervliet@kuleuven.be,Otto.Debals@kuleuven.be, Lieven.DeLathauwer@kuleuven.be).
1
where
⊗denotes the outer product and a
(n)ris the rth column of factor matrices
36
A
(n), for r = 1, . . . , R and n = 1, . . . , N . Many applications benefit from the mild
37
uniqueness properties of a CPD, allowing components to be separated using few re-
38
strictions [11, 22, 32]. In contrast to the number of entries, the number of parameters
39
in a CPD scales linearly in the order, allowing a compact representation of higher-
40
order data. To handle large-scale problems, various approaches have been proposed
41
including techniques based on compression [34], randomization [29, 42] and sparsity
42
combined with parallel architectures [18, 35].
43
Measuring, computing and/or storing all entries is often expensive in terms of
44
time, memory or cost. For example, in high-dimensional integration, the number
45
of points easily exceeds the number of atoms in the universe [28]; in uncertainty
46
quantification, each entry requires the solution of a PDE [23]; or in multivariate
47
regression in many variables it may be impossible to store all possible values of a
48
discretized function [7]. Reducing the number of entries by deliberately taking few
49
samples is therefore important or even necessary to keep the problem manageable [43].
50
Also, some entries can be missing or unknown due to, e.g., broken sensors, artifacts
51
which have been removed or physically impossible situations. Whether few entries are
52
sampled or some entries are missing, specialized algorithms, e.g., [2, 20, 39, 40, 43], are
53
key to decompose and analyze the resulting incomplete tensors. It is often important
54
that the creation of the full tensor is avoided, especially for large-scale tensors.
55
Prior knowledge is often available, allowing the construction of more accurate
56
and/or interpretable models. Moreover, additional constraints are important to fur-
57
ther lower the number of required samples or to improve the accuracy. In this paper,
58
we focus on prior knowledge that can be represented as linear constraints on the factor
59
matrices of a CPD, i.e., A
(n)= B
(n)C
(n)in which B
(n)is a known matrix, while C
(n)60
contains the unknown coefficients. This model is also known as candelinc [ 13] and
61
is ubiquitous although it is not always recognized as a constrained CPD. For exam-
62
ple, in multivariate regression with polynomials or sparse grids, B
(n)can be a basis
63
of polynomials or multilevel tent functions [7, 14]. Some applications benefit from
64
other types of bases like Gaussian kernels or trigonometric functions [45]. Another
65
example can be found in compressed sensing (CS) where a compressed measurement
66
b is represented in a dictionary D by a sparse coefficient vector x [10, 12]. If D is
67
expressed as a Kronecker product and x is a (possibly sparse) vectorized CPD [9, 33],
68
the CS problem can be written as a linearly constrained CPD.
69
Computing a linearly constrained CPD of a full tensor is straightforward as the
70
tensor can be compressed using the known matrices B
(n)after which the matrices
71
C
(n)can be found from the unconstrained CPD of the compressed tensor. Because
72
of its efficiency, compression using suitable basis matrices [4] or the Tucker decom-
73
position/MLSVD [21, 8] is a commonly used preprocessing step. For incomplete ten-
74
sors, only algorithms for unconstrained decompositions are currently available. These
75
algorithms can be divided into imputation or expectation-maximization algorithms
76
which replace missing entries with some value, and direct minimization techniques
77
which take only known entries into account. Examples for the former can be found
78
in [17, 30, 40 ], while ccd++ [ 20 ], cp wopt [ 2 ], indafac [ 40 ] and cpd [ 39] are ex-
79
amples of the latter. The structured data fusion (SDF) framework from [39] can be
80
used to model linear constraints, but it does not fully exploit the inherent multilinear
81
structure. Dedicated algorithms have been presented for specific problems, e.g., in
82
[7] an alternating least squares (ALS) algorithm is presented for multivariate regres-
83
sion problems. Here, we focus on the direct minimization techniques, as the memory
84
requirements often prohibit the construction of the imputed tensor.
85
The efficient computation of a linearly constrained CPD of an incomplete tensor
86
is our main focus. Two variants are presented in section 2: a data-dependent algo-
87
rithm cpdli dd which has a per-iteration complexity linear in the number of known
88
entries and a data-independent algorithm cpdli di which has a per-iteration com-
89
plexity depending only on the number of variables in the decomposition and not on
90
the tensor entries. The presented algorithms are of the (inexact) Gauss–Newton type
91
(see subsection 1.2); quasi-Newton or ALS algorithms can be derived easily. After
92
fixing the notation and reviewing the used optimization framework in the remainder
93
of this section, the algorithms and their complexity are derived in section 2. The
94
unconstrained CPD of an incomplete tensor is discussed as a special case in section 3.
95
A new statistical preconditioner is discussed in section 4. Section 5 and section 6 illus-
96
trate the performance of the presented algorithms on synthetic data and a materials
97
science example, respectively. To keep the main text and idea clear, the derivations
98
of the algorithms and preconditioner are discussed in the supplementary material
99
accompanying this paper.
100
1.1. Notation. Scalars, vectors, matrices and tensors are denoted by lower case,
101
e.g., a, bold lower case, e.g., a, bold upper case, e.g., A and calligraphic, e.g., T ,
102
letters, respectively. K denotes either R or C. Sets are indexed by superscripts
103
within parentheses, e.g., A
(n), n = 1, . . . , N . An incomplete tensor T has N
keknown
104
entries at positions (i
(1)k, i
(2)k, . . . , i
(N )k), k = 1, . . . , N
ke. We collect all indices of known
105
entries in index vectors i
(1), . . . , i
(N ). The extended form of a matrix A
(n)∈ K
In×Ris
106
denoted by A
[n]= A
(n)(i
(n), :) using Matlab style indexing. A
[n]∈ K
Nke×Rcontains
107
repeated rows as 1 ≤ i
(n)k≤ I
nand N
ke≥ I
nand is used to facilitate the derivation
108
the algorithms. A mode-n vector is the generalization of a column (mode-1) and
109
row (mode-2) vector and is defined by fixing all but the nth index of an N th-order
110
tensor. The mode-n unfolding of a tensor T is denoted by T
(n)and has the mode-n
111
vectors as its columns. The vectorization operator vec (T ) stacks all mode-1 vectors
112
into a column vector. Details on the order of the mode-n vectors in unfoldings and
113
vectorizations can be found in [11]. A number of products are needed. The mode-
114
n tensor-matrix product is denoted by T ·
nA. The Kronecker and Hadamard, or
115
element-wise, product are denoted by ⊗ and ∗, respectively. The column-wise (row-
116
wise) Khatri–Rao products between two matrices, denoted by (
T), is defined as
117
the column-wise (row-wise) Kronecker product. To simplify expressions, the following
118
shorthand notations are used
119
⊗
n≡
⊗
1n=N
⊗
k6=n
≡
⊗
1 k=Nk6=n
, (2)
120 121
where N is the order of the tensor and indices are traversed in reverse order. Similar
122
definitions are used for ,
Tand ∗ . To reduce the number of parentheses needed, we
123
assume the matrix product takes precedence over the Kronecker, (row-wise) Khatri–
124
Rao and Hadamard products, e.g.,
125
AB CD ≡ (AB) (CD).
(3)
126127
3
Similarly, shorthand notations are expanded first, e.g., for N = 3, we have
128
k6=2
A
(k)A
(2)≡
A
(3)A
(1)A
(2), (4)
129
k6=2
B
(k)C
(k)⊗ A
(2)≡
(B
(3)C
(3)) (B
(1)C
(1))
⊗ A
(2). (5)
130 131
The complex conjugate, transpose, conjugated transpose and pseudo-inverse are de-
132
noted by ¯·, ·
T, ·
Hand ·
†, respectively. The column-wise concatenation of vectors a
133
and b is written as x = [a; b] and is a shorthand for x = a
Tb
TT. The inner
134
product between A and B is denoted by hA, Bi = vec (B)
Hvec (A). The weighted
135
inner product is defined as hA, Bi
S= vec (B)
Hdiag(vec (S))vec (A) where weight ten-
136
sor S has the same dimensions as A and B. The Frobenius norm is denoted by ||·||.
137
The real part of a complex number is written as Re (·) and the expectation operator
138
as E [·]. The permutation matrix P
(n)maps entry (i
1, . . . , i
n−1, i
n, i
n+1, . . . , i
N) of a
139
vectorized tensor to entry (i
n, i
1, . . . , i
n−1, i
n+1, . . . , i
N). Details on the used notation
140
can be found in [11].
141
1.2. Optimization framework. We discuss the expressions required to imple-
142
ment quasi-Newton and nonlinear least squares (NLS) algorithms to solve the opti-
143
mization problem
144
min
zf = 1
2 ||F (z)||
2. (6)
145 146
All presented algorithms allow complex tensors and/or factors. We use the complex
147
optimization framework from [36, 37] which implements a large number of qN and
148
NLS algorithms including implementations of various line search, plane search and
149
trust region strategies. This framework requires functions computing the objective
150
function, the gradient and, in the case of NLS algorithms, the Gramian or Gramian-
151
vector product. In the remainder of this subsection, we elaborate on a particular NLS
152
algorithm called Gauss–Newton. A more detailed explanation of other qN and NLS
153
algorithms and complex optimization can be found in [3, 27, 36, 38].
154
Unconstrained NLS algorithms attempt to find optimal values z
∗∈ K
Musing
155
objective function (13). In iteration k + 1, the residual F (z) is linearized using the
156
first-order Taylor expansion at the current iterate z
k:
157
F (z) ≈ F (z
k) + ∂F
∂z
z=zk
(z − z
k) (7)
158
= F (z
k) + Jp, (8)
159160
in which J is the Jacobian of F (z) and p the step direction. Substituting the lin-
161
earization in objective function (6) leads to a quadratic model
162
min
p1
2 F (z
k)
HF (z
k) + F (z
k)
HJp + 1
2 p
HJ
HJp.
163
(9)
164
The first-order optimality criterion leads to the minimizer $p
*by solving
165
J
HJp
∗= −J
HF (z
k) (10)
166
Hp
∗= −¯ g.
(11)
167168
The matrix H is the Gramian (of the Jacobian) and approximates the Hessian of
169
f . The right-hand side is the negated and conjugated gradient ¯ g =
dfdz
H. Note
170
that (11) only requires the derivative w.r.t. z, even for complex variables z while f is
171
non-analytic. As shown in [36], the gradient actually consists of the partial derivative
172
w.r.t. z and ¯ z, keeping ¯ z and z constant, respectively. Both partial derivatives of f
173
are each other’s complex conjugate and the Jacobian ∂F/∂¯ z = 0, as F is analytic in
174
z.
175
When the number of variables M is large, the computational cost of computing the
176
(pseudo) inverse of H ∈ C
M ×Mcan be a bottleneck. To reduce the complexity, (pre-
177
conditioned) Conjugate Gradients (CG) can be used to solve (11) [27]. CG iteratively
178
refines an initial guess p
0using only Gramian-vector products Gp
l, l = 0, . . . , M − 1.
179
In the case of CPD computations, [38] showed that preconditioned CG can effectively
180
reduce the computational cost for large-scale systems.
181
Finally, the variables z are updated as
182
z
k+1= z
k+ α
k+1p, (12)
183184
in which α
k+1is computed using a line search algorithm. Other variants use plane
185
search or trust-region methods to compute z
k+1. For more information on different
186
updating methods, we refer the interested reader to, e.g., [27]. In this paper, we use
187
a dogleg trust region (DLTR) approach [27, 38].
188
It is clear that an NLS based algorithm requires a way of computing the objective
189
function, the gradient and the Gramian or Gramian-vector products. In the latter
190
case, an effective preconditioner reduces the computational cost of the CG algorithm.
191
In section 2, we derive these expressions for the linearly constrained CPD of incomplete
192
tensors. An effective preconditioner is derived in section 4.
193
2. CPD with linearly constrained factors for incomplete tensors. The
194
proposed cpdli algorithms compute the rank-R CPD of an N th-order incomplete
195
tensor T ∈ K
I1×···×INwith linear constraints on the factor matrices using an (inexact)
196
Gauss–Newton algorithm with dogleg trust regions. Mathematically, this results in
197
the following optimization problem:
198
min
C(1),...,C(N )
f = 1 2 S ∗ r
A
(1), . . . , A
(N )z
− T
2
(13)
199
subject to A
(n)= B
(n)C
(n)n = 1, . . . , N.
(14)
200201
An entry in the binary sampling tensor S is one if the corresponding entry in T is
202
known and zero otherwise. The factor matrices A
(n)∈ K
In×Rare the product of
203
known matrices B
(n)∈ K
In×Dnand unknown coefficient matrices C
(n)∈ K
Dn×R,
204
n = 1, . . . , N .
205
Two variants of cpdli are derived: the data-dependent algorithm uses the data
206
in every iteration, while the data-independent variant computes a projected repre-
207
sentation of the data prior to the optimization and uses this representation in every
208
iteration. The derivations of all expressions are presented in section SM1.
209
2.1. A data-dependent algorithm: cpdli dd. The cpdli dd algorithm
210
computes the CPD with linearly constrained factor matrices of an incomplete tensor
211
with a computational complexity linear in the number of known entries. This section
212
summarizes the elements needed for the algorithm outlined in Algorithm 1.
213
5
Algorithm 1 cpdli dd using Gauss–Newton with dogleg trust region
1:
Input: S, T , B
(n)N
n=1
and C
(n)N
n=1 2:
Output: C
(n)N
n=1 3:
while not converged do
4:
Compute gradient g using (15) (or (16)).
5:
Compute step p using either
• p = −H
†g with H from (18), or ¯
• PCG of Hp = −¯ g with Gramian-vector products Hp computed using (23), (25) and (26) and preconditioner (45).
6:
Update C
(n), n = 1, . . . , N , using DLTR from p, g and function value (13).
7:
end while
Objective function. Computing the function value is straightforward. After
214
constructing the factor matrices A
(n)= B
(n)C
(n), n = 1, . . . , N , the sum of the
215
squared entries of the residual tensor F = S ∗ qA
(1), . . . , A
(N )y − T is computed.
216
For sparsely sampled tensor with N
keI
1I
2· · · I
N, only the nonzero entries of F are
217
constructed.
218
Gradient. The gradient g of (13) w.r.t. the (vectorized) optimization variables
219
C
(n)is partitioned as g = vec G
(1); . . . ; vec G
(N )in which each subgradient
220
G
(n)is given by
221
G
(n)= ∂f
∂C
(n)= B
(n)TF ¯
(n)k6=n
A
(k)(15)
222
= ¯ F ·
1B
(1)T· · · · ·
NB
(N )T(n)
k6=n
C
(k), (16)
223 224
which follows from the chain rule and from multilinear identities. Which one of the
225
expressions (15) and (16) should be used depends on the order, rank and number of
226
known entries in the tensor and D
n. Gradient expression (15) is computed using a
227
sparse matricized tensor times Khatri–Rao product (mtkrprod), which can be com-
228
puted in O 2N
2N
keR operations for the N subgradients G
(n). Gradient expression
229
(16) computes a projection onto basis matrices B
(n), n = 1, . . . , N , which is a series of
230
N sparse tensor-matrix products, followed by a dense mtkrprod. The former opera-
231
tion can also be seen as a matricized tensor times Kronecker product (mtkronprod).
232
The overall complexity of (16) is therefore O (2N N
keQ
n
D
n) for the (sparse) mtkro-
233
nprod and O (2N R Q
n
D
n) for the dense mtkrprod. Both mtkrprod and mtkro-
234
nprod are standard operations in tensor computations. Efficient implementations for
235
dense and sparse tensors are described in, e.g., [5, 19, 31, 35, 41].
236
Gramian of the Jacobian. Finally, we compute the Gramian H = J
HJ. Sim-
237
ilar to the gradient, the Gramian is partitioned into a grid of N × N blocks H
(m,n),
238
m, n = 1, . . . , N , corresponding to the approximation of the second-order derivative
239
w.r.t. C
(m)and C
(n). Each block is computed from the explicitly constructed Jaco-
240
bians J
(n):
241
J
(n)= ∂
∂vec C
(n)vec (F ) = V
[n]
TB
[n]in which V
[n]= ∗
k6=n
A
[n]. (17)
242 243
The Jacobian is therefore a dense matrix of size N
ke× D
nR and the Gramian blocks
244
H
(m,n)of size D
mR × D
nR are given by
245
H
(m,n)=
V
[m]TB
[m]HV
[n]TB
[n]. (18)
246247
The data-independent algorithm cpdli di exploits the remaining structure in V
[n].
248
To reduce the cost of computing the pseudoinverse of H, CG is used when dealing
249
with many variables. Each CG iteration requires the matrix-vector product y = Hx.
250
A partitioning similar to the gradient and Gramian is used for the vectors x and y:
251
x = h vec
X
(1); . . . ; vec X
(N )i
,
252
(19)
y = h vec
Y
(1); . . . ; vec Y
(N )i
. (20)
253254
The part of the product y = Hx corresponding to Y
(m)is then computed as
255
vec Y
(m)= J
(m)HJ
(1)· · · J
(N )h vec
X
(1); . . . ; vec X
(N )i (21)
256
= J
(m)HN
X
n=1
J
(n)vec X
(n). (22)
257 258
Using multilinear identities, the definition of the Jacobian (17) and the auxiliary
259
vector z
(n), we then have
260
z
(n)= J
(n)vec X
(n)=
V
[n]TB
[n]X
(n)1 (23)
261
vec Y
(m)=
V
[m]TB
[m]H NX
n=1
z
(n). (24)
262 263
The matrix B
[n]X
(n)can contain many identical rows as B
[n]is the extended form
264
of B
(n). The computational cost can therefore be reduced by extending the product
265
B
(n)X
(n)instead of B
(n). The product V
[m]TB
[m]HP
Nn=1
z
(n)can be computed
266
efficiently using accumulation. Let z = P
Nn=1
z
(n)and define the ith row of Q
(m)∈
267
K
Im×Rby
268
Q
(m)(i, :) = X
k∈π(m)
z
kV ¯
[m](k, :) with π
(m)= n
k | i
(m)(k) = i o . (25)
269 270
The cost of the row-wise Khatri–Rao product in (24) is avoided by computing
271
Y
(m)= B
(m)HQ
(m). (26)
272273
2.2. Removing data dependence: cpdli di. Each iteration, the cpdli dd
274
algorithm has a complexity dependent on the number of known entries N
ke. In this
275
section, we show how the structure resulting from the linear constraints allows the per-
276
iteration complexity to be independent of the number of known entries by computing a
277
data-dependent matrix D prior to the optimization process. This matrix D is defined
278
as the inner product
279
D =
nT
B
[n] H nT
B
[n], (27)
280 281
7
which only depends on the positions of the known entries and the given matrices
282
B
(n). Blocking is used to lower the intermediate memory requirements for expanding
283
TnB
[n]. As the size of D is Q
n
D
n× Q
n
D
n, it is clear that this approach is limited
284
to tensors of low order N and small D
nas the number of entries in D scales as D
2N285
assuming D
n= D for n = 1, . . . , N .
286
In the remainder of this section, we show how D can be used to reduce the
287
computational cost of the ingredients in the cpdli algorithm. Algorithm 2 gives an
288
overview of cpdli di.
289
Algorithm 2 cpdli di using Gauss–Newton with dogleg trust region
1:
Input: S, T , B
(n)N
n=1
and C
(n)N
n=1 2:
Output: C
(n)N
n=1
3:
Compute D from (27), T
Bfrom (30), and ||S ∗ T ||
2.
4:
while not converged do
5:
Compute gradient g using (36) and (34).
6:
Compute step p using either
• p = −H
†g with H from (37), or ¯
• PCG of Hp = −¯ g with Gramian-vector products Hp computed using (40) and (41), and preconditioner (45).
7:
Update C
(n), n = 1, . . . , N , using DLTR from p, g and function value (29).
8:
end while
Objective function. The objective function (13) can be expanded as follows
290
f = 1
2 ||S ∗ T ||
2− Re Dr
A
(1), . . . , A
(N )z
, T E
S
+ 1
2 S ∗ r
A
(1), . . . , A
(N )z
2
. (28)
291292
Let ˆ t be the vector containing all values of known entries of T . The norm ||S ∗ T ||
2293
is then computed prior to the optimization process as ˆ t
Hˆ t. In subsection SM1.3 we
294
show that f can be computed as
295
f = 1
2 ˆ t
Hˆ t + Re 1 2 vec r
C
(1), . . . , C
(N )z
HD − ˆ t
HBvec r
C
(1), . . . , C
(N )z
. (29)
296 297
The vector ˆ t
Bis computed a priori and is defined as
298
ˆ t
B=
nT
B
[n] Hˆ t.
(30)
299 300
While using (29) is computationally more efficient, it is numerically less accurate.
301
Consider only the known entries ˆ t and assume an approximation up to an error vector
302
δ with ||δ|| = ˆ t
has been found. The data-dependent objective function is then
303
equivalent to
304
2f =
(ˆ t + δ) − ˆ t
2
= ||δ||
2= O
2. (31)
305306
For the data-independent objective function we have
307
2f =
(ˆ t + δ)
2
− 2Re (ˆ t + δ)
Hˆ t + ˆ t
(32)
2308
= Re
ˆ t
2
+ 2ˆ t
Hδ + ||δ||
2− 2 ˆ t
2
+ ˆ t
Hδ +
ˆ t
2
. (33)
309310
If < √
machwith
machthe machine precision, ||δ||
2is numerically zero, and 2f = 0
311
or 2f = ±O (
mach) instead of O
2mach. This limits the accuracy to O √
mach.
312
In applications, however, other factors often put more stringent restrictions to the
313
accuracy. Due to the signal-to-noise ratio, for example, only a few accurate digits
314
are required in signal processing, or, in machine learning and data analysis, a limited
315
precision is expected because of model errors and regularization terms. In both ex-
316
amples, the limited accuracy caused by using (29) is therefore not a problem. If, in
317
a specific application, a solution accurate up to machine precision is required, the so-
318
lution found by the data-independent algorithm can be refined using a few iterations
319
of the data-dependent algorithm.
320
Gradient. Similar to the derivation of the objective function, the residual ten-
321
sor F in (16) is separated in a data part and a decomposition part. As shown in
322
subsection SM1.3, the gradient can be written as
323
G
(n)= Z − ¯ ¯ T
B(n)
k6=n
C
(k), (34)
324 325
which is an mtkrprod involving the following compressed, dense tensors. The values
326
of compressed data representation ¯ T
B= S ∗ ¯ T ·
1B
(1)T· · · · ·
NB
(N )Tare ˆ t
Bfrom
327
(30). The conjugated auxiliary tensor ¯ Z has size D
1× · · · × D
Nand is computed as
328
Z = ¯ S ∗ r
A ¯
(1), . . . , ¯ A
(N )z
·
1B
(1)T· · · · ·
NB
(N )T(35)
329
= reshape ¯ Dvec r ¯ C
(1), . . . , ¯ C
(N )z
, [D
1, . . . , D
N] (36) .
330331
Gramian of the Jacobian. The complexity of computing the data-dependent
332
Gramian in (18) can be reduced by removing the dependency on B
[n]. In subsec-
333
tion SM1.3, we show that
334
H
(m,n)=
k6=m
C
(k)⊗ I
HP
(m)DP
(n)Tk6=n
C
(k)⊗ I
. (37)
335 336
This is a variation on the mtkrprod called the Khatri–Rao times matricized tensor
337
times Khatri–Rao product (krmtkrprod). While an mtkrprod can be computed
338
in memory and cache efficient way, i.e., without permuting entries of the tensor [31,
339
41 ], permutations are required for krmtkrprod. Using similar ideas as in [ 31, 41],
340
the performance penalty endured by permuting the tensor can be minimized using
341
appropriate left-to-right and right-to-left contractions. The exact description of this
342
algorithm is out of the scope of this paper. An implementation of krmtkrprod will
343
be made available in Tensorlab [44].
344
Similar to the data-dependent algorithm, CG can be used for large-scale problems,
345
hence, requiring a Gramian-vector product. From (37) we have for a single Gramian
346
block H
(m,n)347
H
(m,n)vec X
(n)=
k6=m
C
(k)⊗ I
HP
(m)vec Z
(n)(38)
348 349
in which the vectorized tensor Z
(n)of size D
1× · · · × D
Nis given by
350
vec Z
(n)= Dvec r
C
(1), . . . , X
(n), . . . , C
(N )z
(39)
351
vec (Z) =
N
X
n=1
vec Z
(n)= D
N
X
n=1
vec r
C
(1), . . . , X
(n), . . . , C
(N )z
. (40)
352 353
9
The matricized result Y
(m)is then computed using an mtkrprod:
354
Y
(m)=
N
X
n=1
H
(m,n)vec X
(n)= Z
(m)k6=m
C ¯
(k). (41)
355 356
2.3. Complexity. To conclude this section, the per-iteration complexity of both
357
variants is investigated and compared. For simplicity of notation, assume T is a
358
cubical tensor, i.e., I
n= I and matrices B
(n)have the same number of columns, i.e.,
359
D
n= D, for n = 1, . . . , N . A summary of the per-iteration complexity for cpdli dd
360
and cpdli di is given in Table 1 and Table 2, respectively. For large-scale problems,
361
i.e., problems with a large number of variables, the choice between cpdli dd and
362
cpdli di depends on N
keand D: the former’s overall complexity per iteration scales
363
with O N
2RDN
kecompared to O D
2Nfor the latter. Therefore, the cpdli di
364
algorithm is favored for low-order tensors with D small and relatively many known
365
entries.
366
Table 1
The per-iteration complexity of the cpdli dd algorithm is dominated by Gramian operations.
The overall complexity depends linearly on the number of known entries Nke. A trust region method is used to determine the update of the variables, requiring itTRadditional function evaluations.
Calls per iteration Complexity Function value 1 + it
TRO (N
keN R)
Gradient (sparse) 1 O 2N
2RN
keGradient (dense) 1 O 2N D
N(R + N
ke)
Gramian 1 O N
2D
2R
2N
ke+
83N
3D
3R
3Gramian-vector it
CGO 2N
2DRN
keTotal Small-scale O N
2D
2R
2N
ke+
83N
3D
3R
3Large-scale O 2it
CGN
keN
2RD
Table 2
The per-iteration complexity of the cpdli di algorithm is dominated by Gramian operations involving D. The number of known entries Nkeappears only in the preparation complexity which is independent of the number of iterations. A trust region method is used to determine the update of the variables, requiring itTR additional function evaluations.
Calls per iteration Complexity
Preparation O 2N D
2NN
keFunction value 1 + it
TRO 2D
2NGradient 1 O 2D
2N+ 3D
NRN
Gramian 1 O N
2RD
2N+
83N
3D
3R
3Gramian-vector it
CGO 2D
2N+ N
2D
NR Total Small-scale O N
2RD
2N+
83N
3D
3R
3Large-scale O 2(it
CG+ 2)D
2N3. Unconstrained CPD for incomplete tensors. How to deal with missing
367
entries when computing a CPD has been an important topic for many years and many
368
algorithms have been proposed. We present a new algorithm called cpdi, which is a
369
special case of cpdli dd. In contrast to imputation based methods [ 17, 30, 40], the
370
computational per-iteration complexity is linear in N
ke, and in contrast to other direct
371
minimization techniques [2, 20, 39], second-order convergence can be achieved by using
372
the exact Gramian. By combining fast convergence with efficient Gramian-vector
373
products, cpdi outperforms state-of-the-art methods when few entries are known or
374
the position of the missing entries follows a pattern, as shown in the experiments in
375
subsection 5.1. These cases are particularly interesting when dealing with large-scale
376
problems as illustrated in [43].
377
The unconstrained CPD of an incomplete tensor can be seen as a special case of
378
the cpdli algorithm in which the known matrices are identity matrices, i.e., B
(n)=
379
I
In, for n = 1, . . . , N . As D
n= I
n, the data independent variant cpdli di is usually
380
not attractive as I
nis often large. Therefore, the cpdi algorithm is based on cpdli dd.
381
The expressions for the function value and the gradient can be derived trivially from
382
cpdli di and result in the same expressions as in [ 2] which uses a quasi-Newton
383
approach and [39, 40] which use a Levenberg–Marquardt or Gauss–Newton approach
384
(as in this paper). The main difference with [40] is that an inexact algorithm is used,
385
meaning CG is used instead of inverting the Gramian. While the cpdi algorithm
386
proposed here uses the exact Gramian, [39] computes the Gramian for a full tensor
387
and then scales the result by the fraction of known entries. As this scaling is countered
388
by a longer step length in the line search or trust region step, the incompleteness is
389
in fact ignored.
390
In the experiments in subsection 5.1, we illustrate that the use of the exact
391
Gramian is beneficial if the known entries are not scattered uniformly at random
392
across the tensor, or if high accuracy is required while few entries are available. As
393
B
(n)is an identity matrix in the case of cpdi, the Jacobian (Equation ( 17)) is sparse.
394
Exploiting this sparsity using sparse matrix-matrix or matrix-vector products reduces
395
the complexity to O N
2R
2N
ke+
83N
3D
3R
3for the small-scale version (first term is
396
a factor D
2smaller compared to cpdli dd) and to O 2it
CGN
2RN
kefor the large-
397
scale version (a factor D smaller).
398
4. Preconditioner. For relatively large-scale tensors, the cost of inverting the
399
Gramian matrix in (11) can be alleviated by using preconditioned CG which only
400
requires matrix-vector products [27]. In PCG, the system
401
M
−1Hp = −M
−1g ¯ (42)
402403
is solved instead, in which the preconditioner M is a symmetric positive definite
404
matrix. As the convergence of CG depends on the clustering of the eigenvalues of H,
405
the number of CG iterations can be reduced if the eigenvalues of M
−1H are more
406
clustered. At the same time, applying the inverse of the preconditioner should be
407
computationally cheap.
408
For CPD algorithms, [38] proposes a block-Jacobi type preconditioner
409
M
full= blkdiag
M
(1), . . . , M
(N ), (43)
410411
where M
(n)= W
(n)⊗ I with W
(n)= V
(n)HV
(n)which is computed efficiently as
412
∗
k6=nA
(n)HA
(n). In [39], an extension to incomplete tensors uses a scaled precondi-
413
tioner
414
M
inc,scaled= ρM
full. (44)
415416
11
The fraction ρ = N
ke/ Q
n
I
nis simply the fraction of known entries. This is a
417
reasonable approximation for tensors with known entries scattered randomly across
418
the tensor and with entries of similar magnitudes.
419
Here, we propose a similar block-Jacobi type preconditioner based on the expected
420
value of the diagonal Gramian blocks H
(n,n). First, we derive a preconditioner for
421
the cpdi algorithm, which we then extend to the cpdli algorithm by incorporating
422
linear constraints.
423
To precondition the linear systems arising when decomposing incomplete tensors
424
using cpdi, we propose the block-Jacobi preconditioner M
inc,statwith diagonal blocks
425
M
(n)inc,stat. The exact expression for the block diagonal of H
(n)= J
(n)HJ
(n)would
426
require the inversion of a I
nR×I
nR matrix, which is often too expensive for large-scale
427
problems. Instead, the expected value of the block H
(n,n)is used. More concretely,
428
consider the i
nth slice of order N − 1 and assume there are Q
(n)in
known entries in this
429
slice. Instead of taking the exact positions, or distribution, of the known entries into
430
account, we assume that all possible distributions of exactly Q
(n)in
known entries in
431
this slice are equally likely. By taking the expected value over all distributions, only
432
the fraction of known entries f
i(n)n
= Q
(n)in
/ Q
k6=n
I
kis required to approximate the
433
blocks on the diagonal by
434
M
(n)inc,stat= E h H
(n,n)i
= W
(n)⊗ diag f
(n)(45)
435436
as shown in section SM2. This expression reduces to the result from [39] if every slice
437
contains the same number of known entries.
438
For both cpdli algorithms, the preconditioner is extended to the case with linear
439
constraints on the factor matrices, resulting in the expression
440
M
(n)linc,stat= E h H
(n,n)i
= W
(n)⊗
B
(n)Hdiag f
(n)B
(n). (46)
441442
Even though this preconditioner only approximates the exact Gramian blocks, which
443
take the exact position of the known entries into account, the experiments in subsec-
444
tion 5.2 show that the number of CG iterations is effectively reduced.
445
5. Experiments. The performance of the cpdi method and the effectiveness of
446
the proposed preconditioner are shown with two experiments in this section. Section 6
447
illustrates the benefits of linear constraints and second-order information on a mate-
448
rials science dataset using cpdli. Details about the parameters and implementation
449
used for the different algorithms are reported in Appendix A to allow reproducibility.
450
5.1. CPD of incomplete tensors. The effectiveness of using second-order in-
451
formation to reduce the computational cost of incomplete tensor decompositions is
452
illustrated for two scenarios: randomly missing entries and structured missing entries.
453
The proposed cpdi algorithm is compared to another Gauss–Newton type algorithm
454
cpd [ 38, 39], and to two quasi-Newton algorithms based on nonlinear conjugate gra-
455
dients (NCG): cpd minf [ 38, 39 ] and cp wopt [ 2]. The first three algorithms are
456
implemented in Tensorlab [44]. The Tensor Toolbox [6] implementation is used for
457
cp wopt. For all algorithms, we compute a low accuracy solution (E
cpd≈ 10
−5) and
458
a high accuracy solution (E
cpd≤ 10
−12), in which
459
E
cpd= max
n=1,...,N
A
(n)− ˆ A
(n)A
(n)(47)
460 461
with A
(n)the (known) exact factor matrices and ˆ A
(n)the estimated factor matrices
462
after resolving the scaling and permutation indeterminacies. All experiments are
463
repeated 25 times using different tensors and different initial guesses.
464
First, let T ∈ R
1000×1000×1000be a rank-10 tensor generated using random factor
465
matrices A
(n)drawn from a uniform distribution U (0, 1). From this tensor, entries
466
are sampled uniformly and randomly. Figure 1 shows the computation time when
467
300 000 entries and 90 000 entries are sampled, corresponding to 0.03% and 0.009% of
468
all entries, respectively, or alternatively, 10 and 3 entries on average per variable, re-
469
spectively. Both in time and number of iterations, the Gauss–Newton based methods
470
cpdi and cpd outperform the NCG based methods. Hence the higher per-iteration
471
complexity is countered by the reduction in number of entries. While cpd outperforms
472
cpdi when many entries are known or when a low accuracy is needed, the number
473
of iterations is higher for cpd compared to cpdi. For high accuracy and few known
474
entries, cpdi benefits from its improved convergence properties. cp wopt fails to find
475
a high-accuracy solution due to its implementation of the objective function.
476
0 50 100 150 200 250
High accuracy
Low accuracy Random
90 000
25 270
11 25
55 575
- -
10 106
10 23
29 305
43 412
High accuracy
Low accuracy Random
300 000
16 74
21 16
171- 415-
8 36
cpd
19 15
cpdi
84 201
minf
132 260
cp wopt
Time (s) Iter.
Time (s)
Fig. 1. The Gauss–Newton type algorithms cpd and cpdi outperform the first-order NCG type algorithms as the higher per-iteration cost is countered by a significantly lower number of required iterations. When many entries are available, the lower complexity of cpd compared to cpdi is an advantage. For higher accuracies or fewer entries, cpdi is competitive because of its faster convergence rate. cp wopt fails to find a highly accurate solution. The reported timings and number of iterations are averages over 25 experiments.
The second scenario starts from a dense rank-10 tensor T ∈ R
100×100×100con-
477
structed using random factor matrices with entries drawn from U (0, 1). A structured
478
missing data pattern is constructed as follows. From T , 95 mode-1 slices are selected
479
uniformly at random and from each of the selected slices 99% of the entries are dis-
480
carded at random. The remaining 5 mode-1 slices are dense. While the fraction of
481
known entries is relatively high (5.95%), the pattern prevents cpd from finding a so-
482
lution. Figure 2 shows the computational cost is again reduced significantly because
483
of the improved convergence properties for cpdi.
484
5.2. Preconditioner. cpdi and cpdli have a relatively high per-iteration com-
485
plexity while only a few iterations are necessary. Reducing the complexity or the
486
number of iterations has therefore a large impact on the overall computation time.
487
13
0 50 100 150 200
High accuracy
Low accuracy Slices
4 22
103 1514
- -
3 15
cpdi
42 583
minf
101 1312 cp wopt
Time (s) Iter.
Time (s)
Fig. 2. When missing entries follow a structured pattern, cpdi needs only few iterations to find a solution. cpd fails to find a solution in a reasonable amount of time for both accuracy levels and is therefore omitted. cp wopt fails to find a highly accurate solution. The reported timings and number of iterations are averages over 25 experiments.
Here, we show the effectiveness of the preconditioners proposed in section 4 in reduc-
488
ing the number of CG iterations and therefore in reducing the per-iteration complexity
489
of the algorithm.
490
Consider the following experiment. A random tensor T ∈ R
100×100×100is con-
491
structed using linearly constrained factor matrices with random basis matrices B
(n)∈
492
R
100×10and coefficient matrices C
(n)∈ R
10×5. The entries of both B
(n)and C
(n)493
are drawn from a uniform distribution U (0, 1) with n = 1, 2, 3. To simulate a typical
494
iteration, the Gramian H and the gradient g are computed for random coefficients
495
C ˆ
(n)and the system Hx = −¯ g is solved using PCG until convergence, i.e., until the
496
relative residual is smaller than 10
−6. The number of CG iterations is monitored
497
when using no preconditioner, the incompleteness agnostic preconditioner (44) from
498
[39] and the proposed preconditioner (46). As in subsection 5.1, two scenarios are con-
499
sidered: 0.1% randomly sampled entries, and structured missing entries constructed
500
by removing 99% of the entries of 95 randomly selected mode-1 slices. Table 3 shows
501
the results averaged over 25 tensors and 20 random coefficient matrices ˆ C
(n). While
502
both preconditioners have a similar performance for randomly missing entries, the
503
proposed preconditioner outperforms the scaled preconditioner for structured miss-
504
ing entries. Both preconditioners reduce the number of CG iterations significantly
505
compared to the non-preconditioned case.
506
Table 3
The proposed PC and the incompleteness agnostic PC from [39] reduce the number of CG iterations significantly compared to the non-preconditioned case. When missing entries are scattered randomly across the tensor, both preconditioners have a similar performance, as expected. When the missing entries are structured, the proposed PC reduces the number of iterations by 14 on average.
The average (standard deviation) is computed over 25 tensors for 20 sets of variables, each.
Scenario No PC Agnostic PC (44) Proposed PC (46)
Uniformly at random 115.8 (19.7) 45.5 (9.3) 45.1 (9.2) Structured missing 124.6 (18.4) 53.3 (10.0) 39.3 (7.3)
6. Materials science application. Designing new materials often requires ex-
507
tensive numerical simulations depending on material parameters. For example, the
508
melting temperature of an alloy depends, among others, on the concentrations of its
509
constituent materials [43]. As a material property can depend on a large number of
510
variables, e.g., concentrations, temperature and pressure, the corresponding tensors
511
with measured or computed values often have a high order. The curse of dimen-
512
sionality therefore prohibits measuring all possible combinations. In [43], the CPD
513
of an incomplete tensor allowed modeling a ninth-order tensor using only 10
−11%
514
of the data. Here, we show that using incomplete tensors and a CPD with linearly
515
constrained factors allows the Gibbs free energy to be modeled using very few data
516
points, while improving the accuracy w.r.t. the unconstrained problem.
517
Concretely, the Gibbs free energy for an alloy with silver, copper, nickel and tin
518
as constituent materials is modeled in its liquid phase, at a temperature T = 1000 K.
519
All parameters other than the concentrations (expressed in mole fraction) of silver
520
(c
1), copper (c
2) and nickel (c
3) are kept constant. The concentration of tin (c
4) is a
521
dependent variable as all concentrations must sum to 1, hence c
4= 1 − c
1− c
2− c
3.
522
Following the Calphad model [24, 26], the Gibbs free energy can be modeled as
523
G(c
1, c
2, c
3) = G
p(c
1, c
2, c
3) + G
log(c
1, c
2, c
3) (48)
524
G
log(c
1, c
2, c
3) =
4
X
n=1
U T c
nlog
10c
n, (49)
525 526
in which U is the universal gas constant. Thermo-Calc software [1] with the COST531-
527
v3 database [25] computes G in chosen concentrations c
1, c
2and c
3. By discretizing
528
the concentrations, an incomplete third-order tensor G is obtained with the concen-
529
trations as its modes. As the logarithmic terms in G
logdepend solely on known values,
530
we subtract these terms from G and model only G
pas a multivariate polynomial. Note
531
that G is necessarily incomplete: if sampled densely, i.e., if all feasible entries are sam-
532
pled on the chosen discretized grid of concentrations, all entries lie inside a pyramid
533
(Figure 3) as c
1+ c
2+ c
3> 1 is physically unfeasible. Consequently, variables at the
534
vertices of the pyramid are estimated using only a single data point.
535
0
0.5 1 0 0.5 1
0 0.5 1
c
1c
2c
3Fig. 3. A ‘dense’ sampling scheme for G results in pyramidal structure as c1+ c2+ c3≤ 1.
Five algorithms are compared in this section:
536
• the data-dependent (cpdli dd) and data-independent (cpdli di) algorithms
537
proposed in this paper,
538
• multivariate regression using alternating least squares (mvr als) [ 7],
539
• structured data fusion (sdf) using struct_matvec to implement the con-
540
straints [39 ] and the cpdi factorization which implements cpdi as proposed
541
here [44], and
542
15
• cpdi, i.e., without linear constraints.
543
The precise parameters are outlined in Appendix A. All tested algorithms behave
544
differently and achieve a different accuracy, defined as the median relative error over
545
all feasible entries of the discretized Gibbs free energy tensor:
546
E = median
G
p(i, j, k) − P
Rr=1
a
(1)ira
(2)jra
(3)krG
p(i, j, k)
.
547
(50)
548
As the accuracy of the final solutions computed by the different algorithms can differ,
549
accuracy thresholds are used to allow for a fair comparison of timing results. Five
550
levels with accuracy thresholds are set: 10
−3, 5 · 10
−4, 10
−4, 5 · 10
−5, 10
−5. The algo-
551
rithm proceeds to the next level when the median relative error E is below the current
552
level’s threshold and the (cumulative) time for each level is recorded. A maximum
553
computation time of three minutes is set for each level. This is repeated for 50 initial
554
guesses for C
(n), n = 1, 2, 3.
555
In the first experiment, the entries of G
p∈ R
99×99×99are sampled according to
556
a dense, pyramidal grid with c
n= 0.005, 0.015, 0.025, . . . , 0.985 for n = 1, 2, 3 which
557
amounts to 166 338 known entries (17% of all entries). A CPD with R = 5 terms
558
is computed from G
pwith polynomial constraints on the factor matrices, i.e., each
559
factor vector is a polynomial of maximal degree d = 4, hence D
n= 5, n = 1, 2, 3.
560
A monomial basis is used. Figure 4 shows that both cpdli algorithms find a high
561
accuracy approximation quickly. The computation time for the data-independent
562
algorithm cpdli di is almost a factor ten lower, as the number of known entries is
563
high compared to the number of variables. The accuracy of the solution achieved by
564
the other algorithms is lower, while the time needed to find a comparable accuracy is
565
higher. Table 4 shows that the cpdli algorithms result in a more accurate solution
566
proving the usefulness of second-order information. Compared to an unconstrained
567
model, computed by cpdi, the linear constraints improve accuracy more than an order
568
of magnitude. Note that we did not report values for the cpd method [ 39], as it failed
569
to reach the first accuracy level, illustrating again that its Gramian approximation
570
does not suffice for structured missing entries.
571
Table 4
The results computed by cpdli algorithms are almost an order of magnitude more accurate than those computed by mvr als and sdf in terms of median error and maximal error. Using linear constraints improves the achievable accuracy as the cpdi algorithm consistently returns the least accurate solution. The point-wise relative error on the training set is reported and is the median over 50 experiments.
Rel. Error Median rel. error E cpdli di
10
−610
−510
−410
−310
−26.91 · 10
−6cpdli dd 7.03 · 10
−6mvr als 5.29 · 10
−5sdf 3.49 · 10
−5cpdi 1.28 · 10
−4In the second experiment, we sample 200 entries at random from G
pand test
572
cpdli dd and the mvr als algorithm using the same parameters as in the first
573
experiment. All errors are computed for all feasible entries, i.e., E can be seen as the
574