EXPLOITING EFFICIENT REPRESENTATIONS IN LARGE-SCALE
1
TENSOR DECOMPOSITIONS∗
2
NICO VERVLIET†, OTTO DEBALS†, AND LIEVEN DE LATHAUWER† 3
Abstract. Decomposing tensors into simple terms is often an essential step to discover and 4
understand underlying processes or to compress data. However, storing the tensor and computing its 5
decomposition is challenging in a large-scale setting. Though, in many cases a tensor is structured, 6
i.e., it can be represented using few parameters: a sparse tensor is determined by the positions 7
and values of its nonzeros, a polyadic decomposition by its factor matrices, a tensor train by its 8
core tensors, a Hankel tensor by its generating vector, etc. We show that the complexity of tensor 9
decomposition algorithms is reduced significantly in terms of time and memory if these efficient 10
representations are exploited directly, thereby avoiding the explicit construction of multiway arrays. 11
Key to this performance increase is rewriting the least squares cost function, which exposes core 12
operations such as norms and inner products. Moreover, as the optimization variables do not change, 13
constraints and coupling can be handled trivially. To illustrate this, large-scale nonnegative tensor 14
factorization is performed using Tucker and tensor train compression. We show how vector and 15
matrix data can be analyzed using tensorization while keeping a vector or matrix complexity through 16
the new concept of implicit tensorization, as illustrated for Hankelization and Löwnerization. The 17
concepts and numerical properties are extensively investigated with experiments. 18
Key words. tensor decompositions, canonical polyadic decomposition, parafac, Tucker de-19
composition, block term decomposition, Hankel, Löwner, large-scale, nonnegative factorization, ten-20
sorization 21
AMS subject classifications. 15A69 22
1. Introduction. In signal processing and data analysis, tensors are often given
23
as multiway arrays of numerical values. Tensor decompositions are then used to
dis-24
cover patterns, separate signals, model behavior, cluster similar phenomena, detect
25
anomalies, predict missing data, etc. More concrete examples include intrusion
detec-26
tion in computer networks [61], analysis of food samples [6], crop classification using
27
hyperspectral imaging [88], direction of arrival estimation using a grid of antennas [68]
28
and modeling melting temperatures of alloys [86]. In these contexts, the canonical
29
polyadic decomposition (CPD), the low multilinear rank approximation (LMLRA),
30
or Tucker decomposition, and the block term decomposition (BTD) are common. To
31
improve interpretability, constraints such as nonnegativity and smoothness can be
32
imposed. When multiple datasets describing (partially) the same phenomenon are
33
available, coupling or data fusion allows more meaningful patterns to be extracted.
34
More examples can be found in several overview papers; see, e.g., [12,48,69].
35
Given as dense arrays of numerical values, the cost of storing and processing
ten-36
sors tends to increase quickly, especially for high orders: for an Nth-order tensor of
37
∗Submitted to the editors on October 16, 2017.
Funding: Nico Vervliet is supported by an Aspirant Grant from the Research Foundation — Flanders (FWO). Otto Debals is supported by a Ph.D. grant of the Agency for Innovation by Science and Technology (IWT). Research furthermore supported by: (1) Flemish Government: FWO: projects: G.0830.14N, G.0881.14N; (2) EU: The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Advanced Grant: BIOTENSORS (n◦339804). This paper reflects only the authors’ views and the Union is not liable for any use that may be made of the contained information; (3) KU Leuven Internal Funds C16/15/059.
†KU Leuven, Dept. of Electrical Engineering ESAT/STADIUS, Kasteelpark Arenberg 10, bus 2446, B-3001 Leuven, Belgium; and Group Science, Engineering and Technology, KU Leuven - Ku-lak, E. Sabbelaan 53, 8500 Kortrijk, Belgium (Nico.Vervliet@kuleuven.be,Otto.Debals@gmail.com,
size I × I × · · · × I the number of entries is IN. To deal with these large-scale tensors, 38
various techniques emerged to handle their decompositions, e.g., by deliberately
sam-39
pling entries [84,86], through randomization [83], through compression and/or using
40
parallelism, e.g., [11,39,51,61,70,73]. Here we consider tensors that are structured
41
and can be represented efficiently using fewer parameters than the number of entries
42
in the dense array. Moreover, we assume that structure is exploitable to reduce the
43
computational and memory complexity from the O (tensor entries) to O (parameters).
44
We identify three important types of efficient representation that allow such
exploita-45
tion: compact, exact representation, compression-based representation and (implicit)
46
tensorization.
47
For the first type, the tensor has an exact and known structure that can be
48
represented compactly. A sparse tensor, for instance, is represented by the indices
49
and values of the nonzeros, or more efficiently using compressed formats [4,71]. For
50
sparse tensors, the compact representation has been exploited extensively; see, e.g.,
51
[4,11,37,38,39,40,41,49,61,71,72,73]. A system may be modeled mathematically as
52
a polyadic decomposition, with a possible non-minimal number of rank-1 terms. In
53
this case, the factor matrices form the compact representation [52,64].
54
Second, in the case of compression-based representations, the tensor is first
ap-55
proximated using a decomposition that is easier or cheaper to compute and this
ap-56
proximation is then used to speed up subsequent computations. Common examples
57
are a truncated MLSVD [19], a hierarchical Tucker approximation [27,31] or a TT
58
approximation [59]. Tucker compression is used extensively as a preprocessing step
59
to speed up the computation of an unconstrained CPD, as it suffices to decompose
60
the core tensor [7], which follows from the CANDELINC model [10,46]. To design
61
effective updating algorithms, a previously computed compact representation of the
62
old data combined with a newly arrived slice is used in [79]. In scientific computing,
63
recompression or rounding are often used to lower the multilinear rank of a PD or a
64
Tucker decomposition [44,45,58,66,67], or to lower the TT rank of a tensor given as
65
a PD [59] or a TT approximation [33,55]. Other examples include the approximation
66
of a matrix or tensor by a greedy orthogonal rank-1 decomposition [47], a
hierar-67
chical Kronecker tensor product decomposition [30] or a two-level rank-(r1, . . . , rd) 68
decomposition [43].
69
The third type is implicit tensorization. Tensorization involves mapping vectors
70
or matrices to higher-order tensors [12,21]. By decomposing the obtained tensors, mild
71
uniqueness conditions of tensor decompositions can be leveraged. However, for some
72
types of tensorization, the number of data points increases drastically. For example,
73
third-order Hankelization of a vector of length M results in a tensor with O M3
74
entries [68], while Löwnerization [23] of K observations of mixed rational functions,
75
sampled in M points, may result in a tensor of dimensions M/2 × M/2 × K. In
76
both examples, the signal length M is limited by the respective cubic and quadratic
77
dependence of the number of entries on M. To avoid this data explosion, we propose
78
the concept of implicit tensorization: by operating on the underlying data directly,
79
no explicit construction of the tensor is required. Implicit tensorization alleviates
80
computation and memory cost if the data underlying the tensor has few parameters,
81
and the structure of this representation can be exploited easily, e.g., through fast
82
Fourier transforms (FFT); seesection 3. Examples are Hankelization, Löwnerization,
83
Toeplitzization, segmentation with overlap and outer product structures [5,21]. In
84
this way, longer signals can be handled, as illustrated insubsection 4.3.
85
1.1. Contributions. We explain how efficient representations originating from
86
each of the three types of efficient representations can be exploited in
optimization-87
based algorithms for the computation of a CPD, a decomposition into multilinear
88
rank-(Lr, Lr, 1)terms (LL1), an LMLRA or a general BTD. By rewriting the least 89
squares loss function, core operations are exposed, each of which can be implemented
90
such that the structure underlying the efficient representation is exploited. While a
91
number of papers propose similar ideas for the computation of a CPD or LMLRA in
92
the specific cases of sparse tensors [4,11,37,38,39,40,41,49,61,71,72,73], or CPD
93
and LMLRA approximations [89,90,91], we show that these ideas fit in a broader
94
framework for structured tensor decompositions, including tensorization methods.
95
Moreover, we show that, in contrast to previous results, both first and second-order
96
optimization algorithms can be used and that constraints and joint factorizations are
97
handled trivially. We also pay attention to often neglected numerical issues. Finally,
98
we show that our approach enables enormous speedups for large-scale problems even
99
when constraints are imposed or when multiple datasets are jointly factorized, in
con-100
trast to the traditional compression approach using the CANDELINC model [7] which
101
cannot be used in these cases. We demonstrate this experimentally for large-scale
102
nonnegative tensor factorization using Tucker and TT compression and for implicit
103
Hankelization of signals with up to 500 000 samples. Matlab implementations for the
104
framework are available in Tensorlab [87].
105
1.2. Outline. After discussing the notation in the remainder of this section,
106
section 2explains how efficient representations of structured tensors can be exploited
107
by rewriting the Frobenius norm objective function and gradient for the CPD, LL1,
108
LMLRA and BTD. Implementations of the four core operations — norm, inner
prod-109
uct, matricized tensor times Khatri–Rao product and matricized tensor times
Kro-110
necker product — are given in section 3 for structured tensors given as a CPD,
111
LMLRA, TT, Hankel tensor or Löwner tensor. Finally, the experiments insection 4
112
discuss the numerical properties of the presented framework and illustrate the
perfor-113
mance for compression-based nonnegative CPD and blind separation of exponential
114
polynomials through implicit Hankelization.
115
1.3. Notation. Scalars, vectors, matrices and tensors are denoted by lower case,
116
e.g., a, bold lower case, e.g., a, bold upper case, e.g., A and calligraphic, e.g., T ,
117
letters, respectively. K denotes either R or C. Sets are indexed by superscripts within
118
parentheses, e.g., A(n), n = 1, . . . , N. A mode-n vector is the generalization of a 119
column (mode-1) and row (mode-2) vector and is defined by fixing all but the nth
120
index of an Nth-order tensor. The mode-n unfolding of a tensor T is denoted by T(n) 121
and has the mode-n vectors as its columns. The mode-(m, n) unfolding is defined
122
similarly and is denoted by T(m,n). The vectorization operator vec (T ) stacks all 123
1 vectors into a column vector. A number of products are needed. The
mode-124
n tensor-matrix product is denoted by T ·n A. The Kronecker and Hadamard, or 125
element-wise, product are denoted by ⊗ and ∗, respectively. The column-wise
(row-126
wise) Khatri–Rao products between two matrices, denoted by and T, respectively, 127
are defined as the column-wise (row-wise) Kronecker product. To simplify expressions,
128
the following shorthand notations are used
129
⊗
n ≡⊗
1 n=N k⊗
6=n ≡⊗
1 k=N k6=n , 130 131where N is the order of the tensor and indices are traversed in reverse order.
Sim-132
ilar definitions are used for , T and
∗
. To reduce the number of parenthe-133ses, we assume the matrix product takes precedence over ⊗, , T and ∗, e.g., 134
AB CD ≡(AB) (CD). The complex conjugate, transpose and conjugated
trans-135
pose are denoted by ¯·, ·T and ·H, respectively. The column-wise concatenation of vec-136
tors a and b is written as x = [a; b] and is a shorthand for x = aT bTT
. The inner
137
product between A and B is denoted by hA, Bi = vec (B)H
vec (A). The Frobenius
138
norm is denoted by ||·||. Re (·) returns the real part of a complex number. II is the 139
I × I identity matrix and 1I is a length I column vector with ones. Finally, Vf is the 140
row-wise ‘flipped’ version of V ∈ KI×J, i.e., Vf(i, :) = V(I − i + 1, :)for i = 1, . . . , I. 141
See [12] for more details.
142
2. Exploiting efficient representations. Many tensor decompositions used in
143
signal processing and data analysis can be computed by minimizing the least squares
144
(LS) error between the approximation and the given tensor T
145 min z f (z) with f(z) = 1 2||M(z) − T || 2 , (1) 146 147
in which M(z) is a tensor decomposition with variables z. Various algorithms have
148
been proposed to solve(1), including alternating least squares (ALS) [8,9,20,32,50],
149
nonlinear conjugate gradient (NCG) [1], quasi-Newton (qN) [76] and Gauss–Newton
150
(GN) [76,77,78] algorithms. We first review the notation for common tensor
decom-151
positions insubsection 2.1. Insubsection 2.2we give an overview of LS optimization
152
theory for algorithms using gradients (first-order information) and Hessian
approxima-153
tions (second-order information) and show that many approximations to the Hessian
154
do not depend on the data T . Hence, second-order algorithms can be used without
155
additional changes in implementation compared to first-order algorithms. Finally, the
156
core operations required to reduce the computational complexity are introduced by
157
rewriting f(z) insubsection 2.3.
158
2.1. Overview of tensor decompositions. The techniques derived in this
159
paper focus on four main decompositions, which can be grouped into two families
160
from an optimization point-of-view: CPD-based and BTD-based algorithms. Note
161
that we use different families compared to the usual division between decompositions
162
computed using numerical linear algebra techniques such as singular value
decompo-163
sitions (SVD), e.g., MLSVD, TT and hierarchical Tucker, and decompositions usually
164
computed through optimization (CPD, LL1, BTD). Here, we focus mainly on
no-165
tation; for pointers to uniqueness results and applications, we refer to the overview
166
papers [12,28,29,48,69].
167
CPD-based algorithms. The (canonical) polyadic decomposition (CPD) writes
168
an Nth-order tensor as a (minimal) sum of R rank-1 terms, each of which is an outer
169
product, denoted by⊗, of N nonzero factor vectors a(n)r : 170 MCPD(z) = R X r=1 a(1)r ⊗· · ·⊗a(N ) r def =rA(1), . . . , A(N )z. 171 172
Each factor matrix A(n) contains a(n)
r , r = 1, . . . , R, as its columns and the vari-173
ables z are the vectorized factor matrices A(n), i.e., z = vec A(1) ; . . . ;vec A(N ).
174
Depending on the field, the CPD is also known as PARAFAC, CANDECOMP or
175
tensor/separation rank decomposition.
176
The decomposition into multilinear rank-(Lr, Lr, 1)terms (LL1) is accommodated 177
within this family as it is often computed as a constrained CPD [8,20,76]:
178 MLL1(z) = R X r=1 (ArBTr)⊗cr=JA, B, CPK , 179 180 in which Ar ∈ KI1×Lr, Br ∈ KI2×Lr for r = 1, . . . , R, A = A1 . . . AR , 181
B = B1 . . . BR, C = c1, . . . , cR and P is a binary matrix replicating col-182
umn cr Lr times, i.e., P = blockdiag(1TL1, . . . , 1 T
LR). Similar to the CPD, z = 183
vec (A) ; vec (B) ; vec (C). The LL1 decomposition is similar to PARALIND [8].
184
CONFAC generalizes this decomposition by allowing an arbitrary full row rank
ma-185
trix P in every mode [15]. In [76], the decomposition into (rank-Lr ⊗ rank-1) terms 186
is discussed as one generalization of MLL1 to higher-order tensors. 187
BTD-based algorithms. The low-multilinear rank approximation (LMLRA)
188
or Tucker decomposition writes a tensor as a multilinear transformation of a core
189
tensor S ∈ KJ1×···×JN by the factor matrices U(n)∈ KIn×Jn, n = 1, . . . , N: 190
MLMLRA(z) = S ·1U(1)· · · ·NU(N ). 191
192
An LMLRA can be computed using SVDs of tensors unfoldings, i.e., as a (truncated)
193
multilinear singular value decomposition (MLSVD) [18,82], by higher-order
orthogo-194
nal iteration [19] or through optimization [35,36,76,77].
195
The more general block term decomposition writes a tensor as a sum of R
multi-196
linear rank-(Jr1, Jr2, . . . , JrN)terms [16]: 197 MBTD(z) = R X r=1 S(r)·1U(r,1)· · · ·N U(r,N ). 198 199
From an optimization point-of-view, the difference between MLMLRA and MBTD is 200
the additional summation over r = 1, . . . , R. For simplicity of notation, we only use
201
MLMLRA in the derivations in this paper.
202
2.2. Optimization for least squares problems. In gradient-based
optimiza-203
tion, an initial guess for the variables z0 is iteratively refined as 204
zk= zk−1+ αkpk, 205
206
with the step direction pk found by solving 207
Hkpk = −gk,
(2)
208 209
in which gk and Hk are the gradient and the Hessian (or an approximation) of f(z) 210
w.r.t. z in (1), respectively, assuming M(z) is analytic in z [75]. The step size αk 211
ensures sufficient decrease in the objective function value, e.g., through line search.
212
Alternatively, z can be updated using trust-region approaches or using plane search
213
[53,74,76]. The linear system(2)can be solved using direct (pseudo)inversion or using
214
(preconditioned) conjugate gradients [53].
215
For the LS objective function (1), various Hessian approximations are used
de-216
pending on the algorithm, e.g.,
217
• Hk = Ifor gradient descent, 218
• Hk = (I−UHk−1)Hk−1(I−Uk−1)+Vk−1with Uk−1and Vk−1rank-1 matrices 219
constructed using previous gradients and updates of z for BFGS,
220 • Hk = JHkJk with Jk = ∂z∂ vec (M(z) − T )z=z k = ∂ ∂zvec (M(z))z=z k for 221 Gauss–Newton, 222 • Hk = JHkJk+ λkIfor Levenberg–Marquardt. 223
None of these approximations of the Hessian depend on the tensor T . Therefore, only
224
the objective function evaluation and the gradient computation require adaptation to
225
exploit structure in T . Note that this also holds for ALS algorithms.
226
2.3. Exploiting efficient representations. Due to the explicit construction
227
of the residual tensor F = M(z) − T in (1) and the computation of the gradient,
228
the per-iteration complexity is governed by the number of entries in the tensor, i.e.,
229
O (Q
nIn). To reduce the complexity, we assume T is a function of a limited number 230
of parameters. The parameters can, for instance, be the set of positions and values
231
of nonzero entries in a sparse tensor, the factor matrices of a CPD or the underlying
232
signal in a Hankelized tensor. To exploit the structure in T and to avoid the creation
233
of the residual tensor, we rewrite the objective function and gradients which reveals
234
the core operations. More concretely, we have
235 f (z) = 1 2||M(z)|| 2 −Re (hM(z), T i) +1 2||T || 2 , (3) 236 237
which requires the efficient computation of the norms of the model and the tensor,
238
and the inner product between the model and the tensor. Similarly, the gradient
239
which is given by ¯g = JHvec (F) with J = ∂vec(F)
∂z the Jacobian matrix, is rewritten 240
by grouping the model and data dependent terms as
241 ¯ g= JHvec (M(z)) | {z } − JHvec (T ) | {z } 242 = g¯M − g¯T. (4) 243 244
(In the derivations, we work with g rather than ¯g.) The core operations are the
245
mtkrprod for the CPD-based algorithms and the mtkronprod for the BTD-based
246
algorithms as shown below. Extensions to constrained and coupled constraints
con-247
clude this section.
248
CPD-based algorithms. Let us partition g according to the variables z, i.e.,
249
in the case M(z) is a CPD, we have
250 g= vec G(1) ; . . . ;vec G(N ) 251 252 with 253 vecG(n)= ∂f ∂vec A(n) n = 1, . . . , N. 254 255
By grouping the model and data dependent terms as in(4), the gradients become
256 G(n)M = ¯A(n)
∗
k6=nA (k)H A(k) 257 G(n)T = ¯T(n) k6=nA (k). (5) 258 259 6It is clear that G(n)
M can be computed efficiently after forming the inner products 260
A(k)HA(k), k = 1, . . . , N. We therefore focus on the operation in (5)which involves
261
the common matricized tensor times Khatri–Rao product (mtkrprod).
262
In the case M(z) is an LL1 decomposition, A(3)= CPis used to form the inner 263
products and to compute the mtkrprod. From the chain rule follows that for n = 3,
264 G(n)is given by 265 G(3)M = ¯A(3)
∗
k6=3A (k)H A(k) PT 266 G(3)T = ¯T(3) k6=3 A(k) PT . 267 268BTD-based algorithms. Similarly, in the case M(z) is an LMLRA, the
gradi-269
ent is partitioned as
270
g=
vec G(0) ;vec G(1) ; . . . ;vec G(N )
271 272 with 273 vecG(0)= ∂f ∂vec (S) 274 vecG(n)= ∂f ∂vec U(n) , n = 1, . . . , N. 275 276
More specifically, the gradient expressions for the core tensors are given by
277 GM(0)= ¯S ·1 U(1)TU¯(1)· · · ·N U(N )TU¯(N ) 278 GT(0)= ¯T ·1U(1) T · · · ·N U(N ) T , (6) 279 280
and the expressions for the factors matrices by
281 G(n)M = ¯U(n)S¯(n)
⊗
k6=nU (k)H U(k) ST (n) 282 G(n)T = ¯T(n)⊗
k6=n U(k) ST (n) (7) 283 284for n = 1, . . . , N. In the case of more general BTD with R terms, an additional
285
summation is introduced; see, e.g., [77]. As (6) can be written as vecG(0)T T =
286
vec (T )H
⊗nU(n)
, both(6)and (7)require a vectorized or matricized tensor times
287
Kronecker product (mtkronprod). Note that
288 vecG(0)T T =vec U(1)TT¯(1)
⊗
k6=1 U(k) , 289 290hence the mtkronprod from(7) can be reused, although this may lead to a higher
291
complexity.
292
Core operations, constraints and coupling. In order to reduce the
complex-293
ity from O (tensor entries) to O (parameters), structure exploiting implementations of
294
the Frobenius norm and inner products with CPD and BTD models are required to
evaluate the objective function(3)as well as implementations for the mtkrprod and
296
the mtkronprod for CPD and BTD gradients, respectively. Examples of such
imple-297
mentations are discussed insection 3. In contrast to approaches taken in, e.g., [7,13],
298
the original factor matrices and core tensors remain the optimization variables.
There-299
fore, algorithms for coupling datasets, e.g., [2,77], only require new implementations of
300
the core operations. Similarly, parametric constraints such as nonnegativity,
orthogo-301
nality or Vandermonde structure can be handled using the chain rule as in [77], as well
302
as other types of constraints involving projecting onto a feasible set; see, e.g., [34,42].
303
Concrete examples are given in the experiments insection 4.
304
3. Operations on efficient representations. In the previous section, the
com-305
putation of the norm, inner product, mtkrprod and mtkronprod have been
iden-306
tified as the four key operations to allow tensor decomposition algorithms to work on
307
structured tensors. In this section, we show for a number of efficiently represented
308
tensors that the explicit construction of the full tensor is not required and that the
309
complexity is governed by the number of parameters in the efficient representation,
310
rather than the number of entries in the tensor. First, the polyadic and Tucker format
311
are discussed as expressions involving one of these formats occur in all of the discussed
312
decompositions. Then, efficient implementations for the TT format and two implicit
313
tensorizations — Hankelization and Löwnerization — are derived. These and other
314
implementations are available in Tensorlab [87].
315
Unless stated otherwise, only third-order tensors T ∈ KI×I×I are considered to 316
simplify notations and complexity expressions for the two computational families:
317
the rank-R CPD MCPD = JA, B, CK and the multilinear rank-(R, R, R) LMLRA
318
MLMLRA= S ·1U ·2V ·3W. The extension to a BTD introduces additional summa-319
tions and is omitted here; see [77]. For the mtkrprod and mtkronprod operations,
320
only the case n = 1 is shown, unless the expressions do not generalize trivially for
321
n = 2, . . . , N. Table 1gives a summary of the computational per-iteration
complex-322
ity when computing a rank-R CPD and clearly illustrates independence on the total
323
number of entries IN. 324
3.1. Polyadic format. In the polyadic format, a third-order tensor T is given
325
by a set of factor matrices X, Y and Z with F columns, i.e., T = JX, Y, ZK. The
326
number of variables is therefore F PnIninstead of QnIn. Efficient implementations 327
in the case T admits a PD, with F not necessary equal to R, have been presented
328
in [4] (see Kruskal tensor) and are here extended to complex data:
329 ||T ||2= 1T(XHX ∗ YHY ∗ ZHZ) 1 330 hMCPD, T i = 1T(XHA ∗ YHB ∗ ZHC) 1 331 ¯ T(1)(C B) = ¯X(YHB ∗ ZHC) . (8) 332 333
The complexity of these operations is governed by the construction of the inner
prod-334
ucts which require O NIF2
and O (NIF R) flop. Each mtkrprod in(8) required
335
for G(n)
T can then be computed using O (IF R) flop. 336
To compute an LMLRA, the inner product with an LMLRA
337
hMLMLRA, T i =trace (S ·1XHU ·2YHV ·3ZHW)
(9)
338 339
and the mtkronprod operations are required:
340 ¯ T(1)(W ⊗ V) = ¯X(WHZ VHY)H (10) 341 vec (T )H (W ⊗ V ⊗ U) =vec (JUHX , VHY , WHZ K) H . (11) 342 343 8
Table 1
Computational per-iteration complexity when computing a rank-R CPD of an N th-order I× · · · × I tensor given in its efficient representation. For Hankelization and Löwnerization, second-order tensorization ofK signals is assumed, resulting in a tensor of size I× I × K. The number of samples isM = 2I− 1 for Hankel and M = 2I for Löwner.
Structure Function Complexity Dense Objective O RIN
Gradient O N RIN
CPD Parameters NIF
Objective O N IF R + N IR2 Gradient O N IF R + N IR2 LMLRA Parameters O NIJ + JN
Objective O N IJR + JNR + N IR2 Gradient O N IJR + JNN R + N IR2 TT Parameters O NIr2 Objective O N Ir2R + N IR2 Gradient O N Ir2R + N IR2 Hankel Parameters MK Objective O M (K + log2M )R + (M + K)R2 Gradient O M (K + log2M )R + (M + K)R2 Löwner Parameters M(K + 1) Objective O M (K + log2M )R + (M + K)R2 Gradient O M (K + log2M )R + (M + K)R2
The computation of the trace in(9)requires only the diagonal entries of the LMLRA
344
and takes O NIF R + F RNflop, including the construction of the inner products.
345
To compute the actual gradient G(1)
T , the result of(10)is multiplied by ST(1), requiring 346
the product (WHZ VHY)HST
(1)which is a transposed mtkrprod involving usually 347
small matrices. This product can be computed efficiently using, e.g., [63,81] and
348
requires O NF RN+ N IF R
flop. The complexity of(11)is the same.
349
3.2. Tucker format. In the Tucker format a third-order tensor T = Q ·1X ·2 350
Y ·3Zis defined by a third-order core tensor Q ∈ KJ1×J2×J3 and the factor matrices 351
X ∈ KI×J1, Y ∈ KI×J2, Z ∈ KI×J3. The Tucker format often follows from the 352
computation of an LMLRA or an MLSVD, e.g., as the result of a compression step
353
[7]. For the complexity analysis, we assume Jn = J for n = 1, . . . , N. We assume 354
that the factor matrices have orthonormal columns, which is always possible through
355
normalization.
356
The norm of T is computed in O JN
operations as
357
||T ||2= ||Q||2.
358 359
The inner product with MCPDand mtkrprod is similar to(9)and(10): 360 hMCPD, T i =trace ¯Q ·1ATX ·¯ 2BTY ·¯ 3CTZ¯ 361 ¯ T(1)(C B) = ¯X ¯Q(1)(ZHC YHB) 362 363
and both require O NIJR + JNR flop.
To compute an LMLRA, the following expressions for the inner product and the
365
mtkronprod in(7)can be used
366 hMLMLRA, T i = hS, Q ·1UHX ·2VHY ·3WHZi 367 ¯ T(1)(W ⊗ V) = ¯X ¯Q(1)(ZHW ⊗ YHV) . 368 369
Similarly, the mtkronprod in(6)is computed as
370 vec(T )H (W ⊗ V ⊗ U) =vec (Q·1UHX ·2VHY ·3WHZ) H . 371 372
All computations involving the LMLRA require O NIJR + JNR + JRN flop.
373
3.3. Tensor train format. An Nth-order tensor T ∈ KI1×···×IN can be rep-374
resented by N core matrices or tensors Q(n)∈ Krn−1×In×rn serially linked by N − 1 375
indices such that
376 ti1,...,iN = r1 X s1=1 · · · rN −1 X sN −1=1 q1,i(1)1,s1q (2) s1,i2,s2· · · q (N ) sN −1,iN,1, or 377 T =ttG(1), . . . , G(N ). 378 379
The parameters rn, n = 0, . . . , N, are the compression ranks with r0 = rN = 1. 380
This formulation is known as matrix product states [54] in computational chemistry
381
and as tensor trains (TT) [55] in scientific computing and can be computed using
382
SVDs, cross approximation or optimization [55,56,57,60]. Assuming all dimensions
383
and compression ranks are equal, i.e., rn = r for n = 1, . . . , N − 1, the total number 384
of parameters is O NIr2 .
385
The squared norm of T is given by ||T ||2
= |F(N )|, in which the scalar F(N ) is 386
constructed using the recursive formula
387 F(0) = 1 388 F(n)=Q(n)·1F(n−1) H (1,2)Q (n) (1,2) n = 1, . . . , N. 389 390 This way ||T ||2
can be computed in O NIr3
flop. While this recursive formula is
391
essentially identical to the formula derived in [60], it reveals the inner products more
392
clearly, allowing efficient implementations.
393
To compute inner products with MCPD and MLMLRA, we first introduce the 394
auxiliary core tensors ˆQ(n). Let A(1) = A, A(2) = Band A(3) = Cfor M
CPD, and 395
A(1) = U, A(2) = V and A(3) = W for M
LMLRA, then ˆQ(n) is computed using 396 O Ir2Rflop as 397 ˆ Q(n)= ¯Q(n)·2A(n) T . 398 399
The inner products are then computed as
400 hMCPD, T i =tracett( ˆQ(1), ˆQ(2), ˆQ(3)) (12) 401 hMLMLRA, T i =Dtt( ˆQ(1), ˆQ(2), ˆQ(3)), ¯S E . (13) 402 403
As only the diagonal entries of tt( ˆQ(1), ˆQ(2), ˆQ(3))are required to compute the trace 404
in(12), the construction of the R × · · · × R tensor can be avoided. The construction
405
of the diagonal entries in (12)and the inner product in (13)require O NRr2and
406
O RNr2
flop, respectively.
407
To compute an mtkrprod, the auxiliary cores ˆQ(n), n = 1, . . . , N are used 408
again. Define P(1) = tt( ¯Q(1), ˆQ(2), ˆQ(3)) and p
r as the mode-1 vectors P(1)(:, r, r), 409 r = 1, . . . , R, then 410 ¯ T(1)(C B) = h p(1)1 p(1)2 · · · p(1)R i . 411 412
The complexity of computing one mtkrprod is therefore O Ir2R
, assuming
auxil-413
iary cores have been constructed.
414
Similarly, the mtkronprod can be formed efficiently as
415 ¯ T(1)(W ⊗ V) =tt( ¯Q(1), ˆQ(2), ˆQ(3)) (1). 416 417
To compute the gradient terms G(n)
T in (7) the matrix tt( ¯Q(1), ˆQ(2), ˆQ(3))
(1) can 418
be constructed and multiplied with ST
(1), which requires O IR
N+ IRN−1r2 flop.
419
However, by exploiting the TT structure, the product can be computed directly in
420
O rRN + r2RN−1+ Ir2R
flop. The precise implementation is out of the scope of
421
this paper. Finally, to compute G(0)
T in O r2RN
flop, we can use
422
vec (T )H
(W ⊗ V ⊗ U) =vectt( ˆQ(1), ˆQ(2), ˆQ(3))T. 423
424
3.4. Implicit Hankelization. Hankelization of a (possibly complex) signal
vec-425
tor m ∈ KM of length M = PN
n=1In− N + 1yields an I1× I2 matrix or I1× · · · × IN 426
tensor with constant anti-diagonals or constant anti-diagonal hyperplanes for N = 2
427
and N > 2, respectively [21,62]. In blind source separation, a set of K vectors
428
mk ∈ KM is often given [14,21]. Each vector mk can be Hankelized separately and 429
the resulting matrices or tensors can be concatenated along the (N + 1)th mode. To
430
simplify the expressions1, we only consider second-order Hankelization and assume
431
I1 = I2 = I. Let us stack the vectors mk as the columns of M. Each entry of the 432
I × I × Ktensor T is then given by tijk= mi+j−1,k. As multiplying a Hankel matrix 433
by a vector corresponds to a convolution, fast Fourier transforms (FFT) can be used
434
to speed up computations [3,24]. Let F denote the M-point FFT and F−1 the inverse 435
M-point FFT. Only the last I rows of the inverse M-point FFT are retained when
436
the operator F−1 is used. 437
The squared norm can be computed in O (MK) flop as
438
||T ||2= wT(M ∗ M)1 K, 439
440
in which w is a vector that contains the number of occurrences of each entry of M
441
in T and can be computed as w = convolution(1I, 1I). The inner products can be 442 computed as 443 hMCPD, T i =F−1(FA ∗ FB)CT, M 444 hMLMLRA, T i =F−1 (FVTFU)S(1,2)WT , M . 445 446
The computational cost is O (RM(K + log2M )) and O (R + K)M log2M + M R3
447
flop for the inner product with MCPDand MLMLRA, respectively. 448
To compute the mtkrprod, the Hankel structure can be exploited as
449 ¯ T(1)(C B) = F−1 F ¯MC ∗ FBf 450 ¯ T(2)(C A) = F−1 F ¯MC ∗ FAf 451 ¯ T(3)(B A) = MHF−1(FA ∗ FB) 452 453
requiring O (RM(K + log2M ))flop each. 454
The mtkronprod in G(n)
T in (7)can be computed in O R2M log2M + RM K
455 flop as 456 ¯ T(1)(W ⊗ V) = F−1 F ¯MWTFVf 457 ¯ T(2)(W ⊗ U) = F−1 F ¯MWTFUf 458 ¯ T(3)(V ⊗ U) = MHF−1(FVTFU) . 459 460
The multiplication with ST
(n) in (7)additionally takes O IR3
flop for n = 1, 2 and
461
O KR3
flop for n = 3. However, by first computing the multiplication of the
row-462
wise Khatri–Rao product with ST
(n), i.e., before performing the inverse FFT, the total 463
cost can be reduced to O RM(K + log2M ) + M R3
. Finally, the mtkronprod in
464
all modes can be computed in O R2M log
2M + R3M + RM K
flop using
465
vec(T )H(W ⊗ V ⊗ U) =vec (F−1(FVTFU))TMW¯ T
.
466 467
3.5. Implicit Löwnerization. Löwner matrices and tensors have attractive
468
properties for applications involving rational functions [21,23]. Given a function
469
h : K → K evaluated at M points ti ∈ T = {t1, . . . , tM}, we partition T in two 470
disjoint point sets X = {x1, . . . , xI} and Y = {y1, . . . , yJ}such that T = X ∪ Y and 471
M = I + J. The entries in a Löwner matrix L ∈ KI×J are then given by 472 lij = h(xi) − h(yj) xi− yj , i = 1, . . . , I, and j = 1, . . . , J. 473 474
While higher-order generalizations of the Löwner transformation exist [21,22], we
475
focus on third-order tensors in which each kth frontal slice is a Löwner matrix
con-476
structed from a function hk evaluated in the points in X and Y . Let P ∈ KI×K and 477
Q ∈ KJ×K contain sampled function values at the points in X and Y , respectively, 478 i.e., 479 pik= hk(xi), i = 1, . . . , I, and k = 1, . . . , K 480 qjk= hk(yj), j = 1, . . . , J, and k = 1, . . . , K. 481 482
The tensor T ∈ KI×J×K is then fully determined by P, Q, X and Y . To simplify 483
complexity expressions, we take I = J. Each frontal slice Tk can be written as 484
Tk =Diag(pk)M − MDiag(qk), k = 1, . . . , K, 485
486
in which M is a Cauchy matrix with
487 mij= 1 xi− yj , 1 ≤ i, j ≤ I. 488 489 12
By assuming that all points in X and Y are equidistant, M is also a Toeplitz matrix.
490
As multiplying a Toeplitz matrix with a vector x can be seen as a convolution of the
491
generating vector v = m1,I; m1,I−1; . . . ; m1,1; m2,1; . . . ; mI,1 ∈ K2I−1 with x, it is 492
not necessary to construct M and the multiplication can be performed using FFTs.
493
Let F be the (2I −1)-point FFT with zero padding for shorter vectors, F−1the inverse 494
(2I − 1)-point FFT and define F−1 to be the last I rows of the result of the inverse 495
FFT. For a matrix X ∈ KI×K, MX and MTXcan then be computed as 496 MX= F−1(Fv1T K∗ FX) (14) 497 MTX= F−1 Fvf1T K∗ FX . (15) 498 499
If Fv is precomputed, the cost of this multiplication is O (2KI + 4KI log22I)flop. 500
We now define the different operations for Löwner tensors. As the unfoldings
501 T(n), n = 1, 2, 3, are given by 502 T(1)= P TM − M(Q TIJ) 503 T(2)= MT(P TII) − (Q TMT) 504 T(3)= PT(M TII) − QT(IJTMT) , 505 506
the expressions for the mtkrprod and mtkronprod can be derived easily using
507
multilinear algebra identities. The mtkrprod can be computed efficiently as
508 ¯ T(1)(C B) = ¯PC ∗ ¯MB − ¯M ¯QC ∗ B 509 ¯ T(2)(C A) = MH PC ∗ A − ¯¯ QC ∗ MHA 510 ¯ T(3)(B A) = PH MB ∗ A − Q¯ H(B ∗ MHA) . 511 512
As each mtkrprod requires two FFT-based multiplications with M and two regular
513
matrix-matrix multiplications with P and Q, the complexity is O (RI(K + log2I)). 514
Similarly, the mtkronprod in(7)can be computed as
515 ¯ T(1)(W ⊗ V) = ¯PW TMV − ¯¯ M ¯QW TV 516 ¯ T(2)(W ⊗ U) = MH PW ¯ TU − ¯QW TMHU 517 ¯ T(3)(V ⊗ U) = PH MV ¯ TU − QH(V TMHU) , 518 519 in O R2I + R2I log 2I + RKI
flop for n = 1, 2 and O RI log
2I + R2IK
flop for
520
n = 3. This can be reduced further to O (RI log2I + RKI) flop for n = 1, 2, by 521
exploiting the row-wise Khatri–Rao products when computing the subsequent
mul-522
tiplication with ST
(n), which costs O IR3
. A similar improvement can be made for
523
n = 3. The mtkronprod in (6)is computed as
524 vec (T )H (W ⊗ V ⊗ U) =vec UT PW ¯ TMV − (M¯ HU )T ¯ QW TVT 525 526
and requires O RI log2I + R3I
flop.
527
The squared Frobenius norm is computed as
528 ||T ||2= 1T I(M ∗ M) T (P ∗ P)1 K 529 + 1T I(M ∗ M) (Q ∗ Q)1K 530 − 2Re 1T K((M ∗ M) TP ∗ Q )1K . 531 532
As only K + 2 FFTs are required, the total computational cost is O (4IK log22I) 533
flop. To compute the inner products with MCPD and MLMLRA, the results from 534
mtkrprod and mtkronprod are reused:
535 hMCPD, T i =T¯(3)(B A) , ¯C 536 hMLMLRA, T i =XTT¯(1)(W ⊗ V) , ¯S(1) . 537 538
Apart from the cost of the mtkrprod and the mtkronprod, an additional cost of
539
O (KR)and O RNflop is incurred for hM
CPD, T iand hMLMLRA, T i, respectively. 540
In the case the points sets X or Y do not contain equidistant points, M no longer
541
admits a Toeplitz structure. By exploiting the low displacement rank of Cauchy
ma-542
trices, it is again not necessary to explicitly construct the tensor. The computational
543
complexity of the multiplications with M in (14) and (15) is slightly increased to
544
O 4KI log222I
flop [25,26].
545
4. Experiments. The following experiments illustrate the scalability of
con-546
strained CPD, LL1 and BTD algorithms that exploit efficient representations of
ten-547
sors and the effect on the accuracy of the results. The latter is discussed in
sub-548
section 4.1 for ill-conditioned problems and in subsection 4.3for Hankelized signals.
549
The former is illustrated by computing the nonnegative CPD of a compressed tensor
550
in subsection 4.2 and for the unconstrained LL1 decomposition and the constrained
551
BTD insubsection 4.3. The CPD error ECPDis defined as 552 ECPD= maxn A (n)− ˆA(n) A(n) , 553 554
in which A(n) ( ˆA(n)) are the exact (estimated) factor matrices, n = 1, . . . , N, and 555
scaling and permutation indeterminacies are assumed to be resolved. All timing
re-556
sults are total computation times for a complete run of the optimization algorithm,
557
in contrast to the derived per-iteration complexities insection 3. Tensorlab 3.0 [87] is
558
used for all experiments in combination with Matlab 2016b running on a dual socket
559
20 core Intel Xeon E5-2660 v2 machine with 128 GiB of RAM running CentOS 7.
560
4.1. Accuracy and conditioning. In this first experiment, we study the
ac-561
curacy of recovered factor matrices for the full tensor and its efficient representation.
562
More specifically, the CPD of a tensor with an exact rank-R structure is computed
563
while varying the relative condition number κ, which is defined as in [80]. Concretely,
564
a rank-5 tensor of size 25 × 25 × 25 is constructed using random factor matrices A(n), 565
n = 1, 2, 3. Each random factor vector a(n)
r has norm one and a fixed angle α w.r.t. to 566
the other factor vectors in the same factor matrix, i.e., for n = 1, 2, 3, a (n) r = 1 567 and cos α = a(n)T
r a(n)s , for r, s = 1, . . . , R, r 6= s. As α decreases, the rank-1 terms 568
become more collinear and the condition number κ increases. A rank-5 CPD is
com-569
puted from the full tensor and from the structured tensor in the polyadic format
570
using cpd_nls, starting from a perturbed exact solution. Figure 1 shows the error
571
ECPD on the recovered factor matrices for α = π/2, . . . , π/180 (using a logarithmic 572
scale). As expected, the error increases when the condition number κ worsens. For
573
well-conditioned problems, ECPD is in the order of the machine precision ≈ 10−16 574
if the full tensor is used, while ECPD is higher for the structured tensor. This can be 575
explained by the fact that changes smaller than √ ≈ 10−8 cannot be distinguished 576
when using the structured tensor, as the computation of the objective function (3)
577
requires the difference of squared, almost equal numbers. For very ill-conditioned
578
problems, ECPD may be undesirably high when using the structured tensor. If the 579
tensor admits an exact decomposition and if this exact decomposition is required up
580
to machine precision, using the full tensor may be necessary. However, the structured
581
tensor can still be used as an initialization to reduce the overall computational cost.
582 1 102 104 106 10−16 10−12 10−8 10−4 Full Structured α =π2 α =180π Condition number κ ECPD
Figure 1. By reducing the angle α between the factor vectors, the condition of the CPD worsens, resulting in a higher errorECPD. For very ill-conditioned problems, using the structured tensor approach may result in an undesirably large error and a few additional iterations using the full tensor may be required to improve the accuracy. Results are medians over 100 experiments.
4.2. Compression for nonnegative CPD. The following experiments
illus-583
trate how compression can be used to reduce the complexity of constrained tensor
584
decompositions. Here, we focus on the widely used nonnegative CPD, which is sped
585
up by first computing a truncated MLSVD or a TT approximation. Two approaches
586
to enforce the constraints are used: a projected Gauss–Newton approach
generat-587
ing updates such that the variables are always positive using active sets [42], and
588
using parameter-based constraints, i.e., each entry is the square of an underlying
589
variable [65]. The former approach is implemented using a boundary constrained
590
Gauss–Newton algorithm nlsb_gndl; the latter is implemented using the structured
591
data fusion framework [77] (sdf_nls and struct_nonneg).
592
For the first experiment, let T be a rank-10 tensor of size I × I × I constructed
593
using random factor matrices A(n) with entries drawn from a uniform distribution 594
U (0, 1). Gaussian i.i.d. noise is added such that the signal-to-noise ration (SNR) is
595
20 dB. Using a randomized MLSVD (mlsvd_rsi [85]) the tensor is compressed such
596
that the core G has size 10 × 10 × 10, i.e., T ≈ qG; U(1), U(2), U(3)y
. The core tensor
597
G and the factors U(n) are then used as structured format to compute a rank-10 598
CPD. Random initial factor matrices are drawn from the same distribution as A(n). 599
Figure 2shows the time required to compute an unconstrained CPD (hence relying on
600
CPD uniqueness), a constrained CPD using projected GN (cpd_nls with nlsb_gndl
601
solver) and a constrained CPD using parametrization (sdf_nls). In the three cases,
602
the computation time for the original tensor T rises cubically in the dimension I, for
603
I = 26, 27, . . . , 210, while the time rises linearly using the structured approach. Hence, 604
existing CPD algorithms can be scaled to handle large problems, provided that the
605
MLSVD approximation can be computed and the rank is modest. Note that the
606
compression time is not included as it can be amortized over multiple initializations,
607
ranks or experiments.
608
In a second experiment, the complexity in function of the order N of the tensor
26 28 210 10−1 100 101 102 ×2 ×8.8 ×2 ×1.6 I Time (s) Unconstrained GN 26 28 210 ×8.5 ×1.6 I Projected GN 26 28 210 ×7.6 ×1.7 Full Compressed I Parametric GN
Figure 2. When using a compressed tensor instead of the full tensor, the computational cost scales linearly in the tensor dimensions instead of cubically for a rank-10 nonnegative tensor of size I× I × I. MLSVD compression with a core size of 10 × 10 × 10 is used. The increase for the structured tensor is actually less than linear (×1.6 − ×1.7 instead of ×2), which is caused by an improved multicore usage. The time required to compute the MLSVD increases cubically from20 ms forI = 26 to8 s for I = 210and is not included. Medians over 100 experiments for each parameter are reported. 5 6 7 8 9 10−1 100 101 102 ×1.2 ×8.3 Full Compressed Order N Time (s)
Figure 3. Using a TT approximation instead of the full 10× 10 × · · · × 10 tensor removes the exponential dependence on the orderN when computing a rank-5 nonnegative CPD: from N = 8 toN = 9 the time increases with a factor 1.2≈ 9/8 using the TT approximation, while the time increase is8.3≈ 10 using the full tensor. The median TT compresssion time is not included and increases exponentially from20 ms for N = 5 to 114 s for N = 9. The results are medians over 100 experiments.
is investigated. We compare the required time to compute a nonnegative CPD of an
610
Nth order rank-5 tensor with dimensions 10×· · ·×10 using the full tensor and its TT
611
approximation. The data is generated by constructing N random factor matrices with
612
entries drawn from the uniform distribution U(0, 1). Random Gaussian i.i.d. noise is
613
added to the generated tensor such that the SNR is 20 dB. The core tensors G(n) 614
corresponding to the TT approximation are computed using tt_tensor from the TT
615
Toolbox [57], which we slightly adapted such that all compression ranks rn = 5, 616
n = 1, . . . , N − 1. Random factor matrices using the same distribution are used as
617
initialization. Nonnegativity is enforced using the projected GN approach (cpd_nls
618
with nlsb_gndl solver). The timing results in Figure 3 clearly show that using the
619
TT approximation avoids the curse of dimensionality as the time increases linearly
620
in N in contrast to the time required for the decomposition of the full tensor. Note
621
that we expect the time to increase with a factor 10 if the order increases by one,
622
but the actual time increase is lower as seen in Figure 3. This is due to the fact
623
that the decomposition problem becomes easier as relatively more data is available
624
per variable. This is also reflected in the accuracy: for N = 5 the median accuracy
625
ECPD= 4.7 · 10−3, while ECPD= 3.8 · 10−5 for N = 9. 626
4.3. Signal separation through Hankelization. When source signals are
627
sums of exponentials polynomials, the source signals S ∈ RM×R can be recovered 628
from a noisy mixture X ∈ RM×K using Hankelization [17]. We use the setup 629
X= SMT+ N 630
631
in which M ∈ RK×R is the mixing matrix and N ∈ RM×K is additive Gaussian 632
i.i.d. noise scaled such that a given SNR is attained. Hankelization along the columns
633
of X leads to a third-order tensor T of size M+1 2 ×
M+1
2 × K in which b·c and 634
d·e are the floor and ceil operators, respectively. Hence, if the number of samples M
635
doubles, the number of entries in T quadruples. If each of the R source signals is a
636
sum of Qrexponential polynomials of degree dqr, q = 1, . . . , Qr, T can be decomposed 637
into a sum of R multilinear rank-(Lr, Lr, 1)terms with Lr=P Qr q=1(dqr+ 1)[17], i.e., 638 T = R X r=1 (ArBTr)⊗cr 639 640
in which cr estimates the rth mixing vector M(:, r). The source signals can be esti-641
mated up to scaling and permutation by dehankelizing ArBTr, e.g., by averaging over 642
the anti-diagonals. The example from [17] is slightly adapted in this experiment: two
643
sources are mixed using
644 M= 2 1 −1 1 , 645 646
i.e., K = R = 2. The sources are given by
647 s1(t) = 0.5 sin(6πt) 648 s2(t) = (4t2− 2.8t) exp(−t), 649 650
hence L1= 2and L2= 3. Estimating R and Lris out of the scope of this paper; see, 651
e.g., [17]. The true R and Lr are therefore used. M equidistant samples are taken 652
between 0 s and 1 s.
653
In the first experiment, the SNR is varied for a fixed number of M = 501
sam-654
ples and the relative error E on the mixing matrix is compared when using the full
655
Hankelized tensor or the implicitly Hankelized tensor, i.e., in the structured format.
656
Hankelization is performed using hankelize, which returns both the explicit and
im-657
plicit tensorization. The resulting tensors of size 251 × 251 × 2 are decomposed using
658
ll1_nlsstarting from a random initialization. From the factor matrices, the signals
659
are recovered using dehankelize, without constructing the full tensor. As shown
660
in Figure 4, when increasing the SNR from 0 dB to 300 dB, the error decreases at
661
the same rate for the explicit and the implicit tensorization until the SNR is 180 dB:
662
while the error continues to decrease for the explicit tensorization, the error stagnates
663
at E ≈ 10−10 for the implicit tensorization. This loss of accuracy is explained in 664
subsection 4.1 and only occurs for a very high SNR and in the case the sources are
exact sums of exponential polynomials. As illustrated in the following experiments,
666
a trade-off between computational cost and accuracy can be made. In the case of a
667
high SNR, the solution using the implicit tensorization can be used to initialize the
668
algorithm with the full tensor.
669 0 180 300 10−16 10−8 100 Implicit Explicit SNR (dB) Error E
Figure 4. For low and medium SNR, the errors on the mixing matrix for the explicit and implicit Hankelization approaches are equal. While the error for the implicit tensorization stagnates for SNR larger than180 dB, the error continues to decrease when using the explicit tensorization. The errors are medians over 100 experiments, each using a best-out-of-five initializations strategy.
In the second experiment, the scaling behavior in function of the number of
sam-670
ples is illustrated for M = 501, 5 001, 50 001 and 500 001 points. The resulting
Hanke-671
lized tensors require 953 KiB, 95.3 MiB, 9.53 GiB and 953 GiB of memory, respectively,
672
if formed explicitly. The SNR is fixed at 20 dB. We compute an unconstrained LL1
673
decomposition (ll1_nls) and a constrained BTD, which models the tensor as
674 T = R X r=1 V(r)G(r)V(r)T⊗cr, 675 676
in which V(r)are confluent Vandermonde matrices and G(r)are upper anti-triangular 677
matrices; see [17] for details. By imposing the structure in V(r), the recovered signals 678
are guaranteed to be exponential polynomials, and the underlying poles can be
re-679
covered easily. These parametric constraints are modeled in the SDF framework [77]
680
using struct_confvander and sdf_nls. Figure 5 and Figure 6 show the median
681
time and error on the mixing matrix over 50 noise realizations. Each algorithm is
682
initialized three and six times using random variables for the LL1 and BTD model,
683
respectively. The time and error for the best initialization, i.e., the one resulting in
684
the lowest error on the mixing matrix, is retained. The timing results in Figure 5
685
clearly show that implicit Hankelization drastically reduces the computation time,
686
allowing longer signals to be analyzed. The computation of the constrained BTD is
687
sensitive to the initialization and is difficult from an optimization point-of-view, as
688
the factors are ill-conditioned due to the generalized Vandermonde structure. This
689
results in higher computation times and a lower accuracy for random initializations.
690
Being able to analyze longer signal allows one to improve the accuracy for a fixed
691
SNR, as is clear fromFigure 6: by taking 100 times as many samples, E decreases by
692
a factor 10.
693
5. Conclusion. By rewriting the objective function and gradient expressions
694
commonly used to compute tensor decompositions, four core operations are exposed:
695
501 5001 50001 500001 100 102 104 ×14 ×17 M Time (s) Unconstrained LL1 501 5001 50001 ×8.6 ×187 Implicit Explicit M Constrained BTD
Figure 5. For both the unconstrained LL1 decomposition and the constrained BTD, the com-putation time using implicit tensorization increases more slowly, enabling large-scale applications. For the LL1 decomposition, the time increases with a factor14≈ 13 (O (M log2M )) using the ef-ficient representation and with a factor17 when the full tensor is used. The latter factor is better than the expected factor 100 as fewer iterations are required. Computing the constrained BTD is more difficult due to ill-conditioned factors and the variation in time is large causing larger devia-tons from the expected scaling factors. The results are medians over 50 experiments with multiple initializations. 501 5001 50001 500001 10−4 10−3 10−2 BTD LL1 ÷9.2 ×100 M Error E
Figure 6. The relative error E of the estimated mixing matrix decreases when the number of samplesM increases. In the case of the unconstrained LL1 decomposition, E decreases with a factor 9.2≈ 10 when 100 times as many samples are used. The constrained BTD is more difficult from an optimization point-of-view and involves ill-conditioned factor matrices, resulting in a modest im-provement in terms of error. The errors are computed using implicit tensorization and are medians over 50 noise realizations with multiple initializations. The results for full tensors are similar.
squared norm, inner product, matricized tensor times Khatri-Rao product and
ma-696
tricized tensor times Kronecker product. By specializing these operations for efficient
697
representations of tensors, both the computational and memory complexity are
re-698
duced from O (tensor entries) to O (parameters) as illustrated for the polyadic, Tucker
699
and tensor train format, as well as for Hankelization and Löwnerization. The
opera-700
tions can be used in many decomposition algorithms and frameworks, including ALS,
701
second-order algorithms and algorithms for constrained and/or coupled
decomposi-702
tions. The numerical consequences of exploiting efficient representations are studied
703
in the case a highly accurate solution is required. Finally, important concepts such as
704
Tucker or TT compression for constrained decompositions and implicit tensorization,
705
allow large-scale datasets to be handled easily, as illustrated for nonnegative CPD and
706
the Hankelization of mixtures of exponentials.
REFERENCES 708
[1] E. Acar, D. M. Dunlavy, and T. G. Kolda, A scalable optimization approach 709
for fitting canonical tensor decompositions, J. Chemometrics, 25 (2011), pp. 67–86, 710
doi:10.1002/cem.1335. 711
[2] E. Acar, T. G. Kolda, and D. M. Dunlavy, All-at-once optimization for coupled matrix 712
and tensor factorizations, May 2011,arXiv:1105.3422. 713
[3] R. Badeau and R. Boyer, Fast multilinear singular value decomposition for structured ten-714
sors, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 1008–1021,doi:10.1137/060655936. 715
[4] B. Bader and T. G. Kolda, Efficient matlab computations with sparse and factored tensors, 716
SIAM J. Sci. Comput., 30 (2007), pp. 205–231,doi:10.1137/060676489. 717
[5] H. N. Bharath, D. Sima, N. Sauwen, U. Himmelreich, L. De Lathauwer, and 718
S. Van Huffel, Nonnegative canonical polyadic decomposition for tissue-type differ-719
entiation in gliomas, IEEE J. Biomed. Health Inform., 21 (2017), pp. 1124–1132, 720
doi:10.1109/JBHI.2016.2583539. 721
[6] R. Bro, Multi-way analysis in the food industry: models, algorithms, and applications, PhD 722
thesis, University of Amsterdam, 1998. 723
[7] R. Bro and C. A. Andersson, Improving the speed of multiway algorithms: 724
Part II: Compression, Chemometr. Intell. Lab., 42 (1998), pp. 105–113, 725
doi:http://dx.doi.org/10.1016/S0169-7439(98)00011-2. 726
[8] R. Bro, R. A. Harshman, N. D. Sidiropoulos, and M. E. Lundy, Modeling multi-727
way data with linearly dependent loadings, J. Chemometrics, 23 (2009), pp. 324–340, 728
doi:10.1002/cem.1206. 729
[9] J. D. Carroll and J.-J. Chang, Analysis of individual differences in multidimensional 730
scaling via an n-way generalization of “Eckart–Young” decomposition, Psychometrika, 35 731
(1970), pp. 283–319,doi:10.1007/bf02310791. 732
[10] J. D. Carroll, S. Pruzansky, and J. B. Kruskal, Candelinc: A general approach to 733
multidimensional analysis of many-way arrays with linear constraints on parameters, Psy-734
chometrika, 45 (1980), pp. 3–24,doi:10.1007/bf02293596. 735
[11] J. H. Choi and S. V. N. Vishwanathan, DFacTo: Distributed factorization of tensors, 736
in Proceedings of the 27th International Conference on Neural Information Processing 737
Systems, NIPS’14, Cambridge, MA, USA, 2014, MIT Press, pp. 1296–1304. 738
[12] A. Cichocki, D. Mandic, A.-H. Phan, C. Caiafa, G. Zhou, Q. Zhao, and L. De Lath-739
auwer, Tensor decompositions for signal processing applications: From two-way to 740
multiway component analysis, IEEE Signal Process. Mag., 32 (2015), pp. 145–163, 741
doi:10.1109/msp.2013.2297439. 742
[13] J. E. Cohen, R. C. Farias, and P. Comon, Fast decomposition of large nonnegative tensors, 743
IEEE Signal Process. Lett., 22 (2015), pp. 862–866,doi:10.1109/lsp.2014.2374838. 744
[14] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent component 745
analysis and applications, Academic press, 2010. 746
[15] A. de Almeida, G. Favier, and J. C. M. Mota, A constrained factor decomposition with 747
application to MIMO antenna systems, IEEE Trans. Signal Process., 56 (2008), pp. 2429– 748
2442,doi:10.1109/TSP.2008.917026. 749
[16] L. De Lathauwer, Decompositions of a higher-order tensor in block terms — Part II: 750
Definitions and uniqueness, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 1033–1066, 751
doi:10.1137/070690729. 752
[17] L. De Lathauwer, Blind separation of exponential polynomials and the decomposition of a 753
tensor in rank-(Lr, Lr, 1) terms, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 1451–1474, 754
doi:10.1137/100805510. 755
[18] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singu-756
lar value decomposition, SIAM J. Matrix Anal. Appl., 21 (2000), pp. 1253–1278, 757
doi:10.1137/S0895479896305696. 758
[19] L. De Lathauwer, B. De Moor, and J. Vandewalle, On the best 1 and rank-759
(R1, R2, . . . , RN) approximation of higher-order tensors, SIAM J. Matrix Anal. Appl., 21 760
(2000), pp. 1324–1342,doi:10.1137/S0895479898346995. 761
[20] L. De Lathauwer and D. Nion, Decompositions of a higher-order tensor in block terms 762
— Part III: Alternating least squares algorithms, SIAM J. Matrix Anal. Appl., 30 (2008), 763
pp. 1067–1083,doi:10.1137/070690730. 764
[21] O. Debals and L. De Lathauwer, The concept of tensorization. Technical Report 17–99, 765
ESAT-STADIUS, KU Leuven, Belgium, 2017. 766
[22] O. Debals, L. De Lathauwer, and M. Van Barel, About higher-order Löwner ten-767
sors. Technical Report 17–98, ESAT-STADIUS, KU Leuven, Belgium, 2017, ftp://ftp.
768
esat.kuleuven.be/pub/stadius/odebals/debals2017higher.pdf. 769
[23] O. Debals, M. Van Barel, and L. De Lathauwer, Löwner-based blind signal separation 770
of rational functions with applications, IEEE Trans. Signal Process., 64 (2016), pp. 1909– 771
1918,doi:10.1109/tsp.2015.2500179. 772
[24] W. Ding, L. Qi, and Y. Wei, Fast Hankel tensor-vector product and its application 773
to exponential data fitting, Numer. Linear Algebra Appl., 22 (2015), pp. 814–832, 774
doi:10.1002/nla.1970. 775
[25] I. Gohberg and V. Olshevsky, Complexity of multiplication with vectors for structured 776
matrices, Linear Algebra Appl., 202 (1994), pp. 163–192,
doi:10.1016/0024-3795(94)90189-777
9. 778
[26] I. Gohberg and V. Olshevsky, Fast algorithms with preprocessing for matrix-779
vector multiplication problems, Journal of Complexity, 10 (1994), pp. 411–427, 780
doi:10.1006/jcom.1994.1021. 781
[27] L. Grasedyck, Hierarchical singular value decomposition of tensors, SIAM J. Matrix Anal. 782
Appl., 31 (2010), pp. 2029–2054,doi:10.1137/090764189. 783
[28] L. Grasedyck, D. Kressner, and C. Tobler, A literature survey of low-784
rank tensor approximation techniques, GAMM-Mitteilungen, 36 (2013), pp. 53–78, 785
doi:10.1002/gamm.201310004. 786
[29] W. Hackbusch, Tensor spaces and numerical tensor calculus, Springer Series in Computa-787
tional Mathematics, (2012),doi:10.1007/978-3-642-28027-6. 788
[30] W. Hackbusch, B. N. Khoromskij, and E. E. Tyrtyshnikov, Hierarchi-789
cal Kronecker tensor-product approximations, J. Numer. Math., 13 (2005), 790
doi:10.1515/1569395054012767. 791
[31] W. Hackbusch and S. Kühn, A new scheme for the tensor representation, J. Fourier Anal. 792
Appl., 15 (2009), pp. 706–722,doi:10.1007/s00041-009-9094-9. 793
[32] F. L. Hitchcock, The expression of a tensor or a polyadic as a sum of products, Journal of 794
Mathematics and Physics, 6 (1927), pp. 164–189,doi:10.1002/sapm192761164. 795
[33] S. Holtz, T. Rohwedder, and R. Schneider, The alternating linear scheme for tensor 796
optimization in the Tensor Train format, SIAM J. Sci. Comput., 34 (2012), pp. A683– 797
A713,doi:10.1137/100818893. 798
[34] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, A flexible and efficient algorithmic 799
framework for constrained matrix and tensor factorization, IEEE Trans. Signal Process., 800
64 (2016), pp. 5052–5065,doi:10.1109/TSP.2016.2576427. 801
[35] M. Ishteva, P.-A. Absil, S. Van Huffel, and L. De Lathauwer, Best low multilinear 802
rank approximation of higher-order tensors, based on the Riemannian trust-region scheme, 803
SIAM J. Matrix Anal. Appl., 32 (2011), pp. 115–135,doi:10.1137/090764827. 804
[36] M. Ishteva, L. De Lathauwer, P.-A. Absil, and S. Van Huffel, The best rank-805
(R1, R2, R3) approximation of tensors by means of a geometric Newton method, AIP Con-806
ference Proceedings, 1048 (2008), pp. 274–277,doi:10.1063/1.2990911. 807
[37] B. Jeon, I. Jeon, L. Sael, and U. Kang, SCouT: Scalable coupled matrix-tensor factoriza-808
tion — algorithm and discoveries, in 2016 IEEE 32nd International Conference on Data 809
Engineering (ICDE), IEEE, May 2016, pp. 811–822,doi:10.1109/icde.2016.7498292. 810
[38] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, HaTen2: Billion-scale tensor 811
decompositions, in 2015 IEEE 31st International Conference on Data Engineering, IEEE, 812
Apr 2015, pp. 1047–1058,doi:10.1109/icde.2015.7113355. 813
[39] U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos, GigaTensor: Scaling tensor 814
analysis up by 100 times - algorithms and discoveries, Proceedings of the 18th ACM 815
SIGKDD international conference on Knowledge discovery and data mining - KDD ’12, 816
(2012),doi:10.1145/2339530.2339583. 817
[40] O. Kaya and B. Uçar, Scalable sparse tensor decompositions in distributed memory 818
systems, in Proceedings of the International Conference for High Performance Com-819
puting, Networking, Storage and Analysis on - SC ’15, ACM, 2015, pp. 77:1–77:11, 820
doi:10.1145/2807591.2807624. 821
[41] O. Kaya and B. Uçar, High performance parallel algorithms for the Tucker decomposition 822
of sparse tensors, in 2016 45th International Conference on Parallel Processing (ICPP), 823
IEEE, Aug 2016, pp. 103–112,doi:10.1109/icpp.2016.19. 824
[42] C. Kelley, Iterative Methods for Optimization, SIAM, 1999. 825
[43] B. N. Khoromskij, Structured rank-(r1, . . . , rD) decomposition of function-related tensors in 826
RD, Comput. Methods Appl. Math., 6 (2006),doi:10.2478/cmam-2006-0010. 827
[44] B. N. Khoromskij and V. Khoromskaia, Low rank tucker-type tensor approximation to 828
classical potentials, Central European Journal of Mathematics, 5 (2007), pp. 523–550, 829
doi:10.2478/s11533-007-0018-0. 830