EXPLOITING EFFICIENT REPRESENTATIONS IN LARGE-SCALE

(1)

1

TENSOR DECOMPOSITIONS∗

2

NICO VERVLIET†_{, OTTO DEBALS}†_, _AND _{LIEVEN DE LATHAUWER}† 3

Abstract. Decomposing tensors into simple terms is often an essential step to discover and 4

understand underlying processes or to compress data. However, storing the tensor and computing its 5

decomposition is challenging in a large-scale setting. Though, in many cases a tensor is structured, 6

i.e., it can be represented using few parameters: a sparse tensor is determined by the positions 7

and values of its nonzeros, a polyadic decomposition by its factor matrices, a tensor train by its 8

core tensors, a Hankel tensor by its generating vector, etc. We show that the complexity of tensor 9

decomposition algorithms is reduced significantly in terms of time and memory if these efficient 10

representations are exploited directly, thereby avoiding the explicit construction of multiway arrays. 11

Key to this performance increase is rewriting the least squares cost function, which exposes core 12

operations such as norms and inner products. Moreover, as the optimization variables do not change, 13

constraints and coupling can be handled trivially. To illustrate this, large-scale nonnegative tensor 14

factorization is performed using Tucker and tensor train compression. We show how vector and 15

matrix data can be analyzed using tensorization while keeping a vector or matrix complexity through 16

the new concept of implicit tensorization, as illustrated for Hankelization and Löwnerization. The 17

concepts and numerical properties are extensively investigated with experiments. 18

Key words. tensor decompositions, canonical polyadic decomposition, parafac, Tucker de-19

composition, block term decomposition, Hankel, Löwner, large-scale, nonnegative factorization, ten-20

sorization 21

AMS subject classifications. 15A69 22

1. Introduction. In signal processing and data analysis, tensors are often given

23

as multiway arrays of numerical values. Tensor decompositions are then used to

dis-24

cover patterns, separate signals, model behavior, cluster similar phenomena, detect

25

anomalies, predict missing data, etc. More concrete examples include intrusion

detec-26

tion in computer networks [61], analysis of food samples [6], crop classification using

27

hyperspectral imaging [88], direction of arrival estimation using a grid of antennas [68]

28

and modeling melting temperatures of alloys [86]. In these contexts, the canonical

29

polyadic decomposition (CPD), the low multilinear rank approximation (LMLRA),

30

or Tucker decomposition, and the block term decomposition (BTD) are common. To

31

improve interpretability, constraints such as nonnegativity and smoothness can be

32

imposed. When multiple datasets describing (partially) the same phenomenon are

33

available, coupling or data fusion allows more meaningful patterns to be extracted.

34

More examples can be found in several overview papers; see, e.g., [12,48,69].

35

Given as dense arrays of numerical values, the cost of storing and processing

ten-36

sors tends to increase quickly, especially for high orders: for an Nth-order tensor of

37

∗_{Submitted to the editors on October 16, 2017.}

Funding: Nico Vervliet is supported by an Aspirant Grant from the Research Foundation — Flanders (FWO). Otto Debals is supported by a Ph.D. grant of the Agency for Innovation by Science and Technology (IWT). Research furthermore supported by: (1) Flemish Government: FWO: projects: G.0830.14N, G.0881.14N; (2) EU: The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Advanced Grant: BIOTENSORS (n◦_{339804). This paper reflects only the} authors’ views and the Union is not liable for any use that may be made of the contained information; (3) KU Leuven Internal Funds C16/15/059.

†_{KU Leuven, Dept. of Electrical Engineering ESAT/STADIUS, Kasteelpark Arenberg 10, bus} 2446, B-3001 Leuven, Belgium; and Group Science, Engineering and Technology, KU Leuven - Ku-lak, E. Sabbelaan 53, 8500 Kortrijk, Belgium (Nico.Vervliet@kuleuven.be,Otto.Debals@gmail.com,

(2)

size I × I × · · · × I the number of entries is IN_{. To deal with these large-scale tensors,} 38

various techniques emerged to handle their decompositions, e.g., by deliberately

sam-39

pling entries [84,86], through randomization [83], through compression and/or using

40

parallelism, e.g., [11,39,51,61,70,73]. Here we consider tensors that are structured

41

and can be represented efficiently using fewer parameters than the number of entries

42

in the dense array. Moreover, we assume that structure is exploitable to reduce the

43

computational and memory complexity from the O (tensor entries) to O (parameters).

44

We identify three important types of efficient representation that allow such

exploita-45

tion: compact, exact representation, compression-based representation and (implicit)

46

tensorization.

47

For the first type, the tensor has an exact and known structure that can be

48

represented compactly. A sparse tensor, for instance, is represented by the indices

49

and values of the nonzeros, or more efficiently using compressed formats [4,71]. For

50

sparse tensors, the compact representation has been exploited extensively; see, e.g.,

51

[4,11,37,38,39,40,41,49,61,71,72,73]. A system may be modeled mathematically as

52

a polyadic decomposition, with a possible non-minimal number of rank-1 terms. In

53

this case, the factor matrices form the compact representation [52,64].

54

Second, in the case of compression-based representations, the tensor is first

ap-55

proximated using a decomposition that is easier or cheaper to compute and this

ap-56

proximation is then used to speed up subsequent computations. Common examples

57

are a truncated MLSVD [19], a hierarchical Tucker approximation [27,31] or a TT

58

approximation [59]. Tucker compression is used extensively as a preprocessing step

59

to speed up the computation of an unconstrained CPD, as it suffices to decompose

60

the core tensor [7], which follows from the CANDELINC model [10,46]. To design

61

effective updating algorithms, a previously computed compact representation of the

62

old data combined with a newly arrived slice is used in [79]. In scientific computing,

63

recompression or rounding are often used to lower the multilinear rank of a PD or a

64

Tucker decomposition [44,45,58,66,67], or to lower the TT rank of a tensor given as

65

a PD [59] or a TT approximation [33,55]. Other examples include the approximation

66

of a matrix or tensor by a greedy orthogonal rank-1 decomposition [47], a

hierar-67

chical Kronecker tensor product decomposition [30] or a two-level rank-(r1, . . . , rd) 68

decomposition [43].

69

The third type is implicit tensorization. Tensorization involves mapping vectors

70

or matrices to higher-order tensors [12,21]. By decomposing the obtained tensors, mild

71

uniqueness conditions of tensor decompositions can be leveraged. However, for some

72

types of tensorization, the number of data points increases drastically. For example,

73

third-order Hankelization of a vector of length M results in a tensor with O M3

74

entries [68], while Löwnerization [23] of K observations of mixed rational functions,

75

sampled in M points, may result in a tensor of dimensions M/2 × M/2 × K. In

76

both examples, the signal length M is limited by the respective cubic and quadratic

77

dependence of the number of entries on M. To avoid this data explosion, we propose

78

the concept of implicit tensorization: by operating on the underlying data directly,

79

no explicit construction of the tensor is required. Implicit tensorization alleviates

80

computation and memory cost if the data underlying the tensor has few parameters,

81

and the structure of this representation can be exploited easily, e.g., through fast

82

Fourier transforms (FFT); seesection 3. Examples are Hankelization, Löwnerization,

83

Toeplitzization, segmentation with overlap and outer product structures [5,21]. In

84

this way, longer signals can be handled, as illustrated insubsection 4.3.

85

(3)

1.1. Contributions. We explain how efficient representations originating from

86

each of the three types of efficient representations can be exploited in

optimization-87

based algorithms for the computation of a CPD, a decomposition into multilinear

88

rank-(Lr, Lr, 1)terms (LL1), an LMLRA or a general BTD. By rewriting the least 89

squares loss function, core operations are exposed, each of which can be implemented

90

such that the structure underlying the efficient representation is exploited. While a

91

number of papers propose similar ideas for the computation of a CPD or LMLRA in

92

the specific cases of sparse tensors [4,11,37,38,39,40,41,49,61,71,72,73], or CPD

93

and LMLRA approximations [89,90,91], we show that these ideas fit in a broader

94

framework for structured tensor decompositions, including tensorization methods.

95

Moreover, we show that, in contrast to previous results, both first and second-order

96

optimization algorithms can be used and that constraints and joint factorizations are

97

handled trivially. We also pay attention to often neglected numerical issues. Finally,

98

we show that our approach enables enormous speedups for large-scale problems even

99

when constraints are imposed or when multiple datasets are jointly factorized, in

con-100

trast to the traditional compression approach using the CANDELINC model [7] which

101

cannot be used in these cases. We demonstrate this experimentally for large-scale

102

nonnegative tensor factorization using Tucker and TT compression and for implicit

103

Hankelization of signals with up to 500 000 samples. Matlab implementations for the

104

framework are available in Tensorlab [87].

105

1.2. Outline. After discussing the notation in the remainder of this section,

106

section 2explains how efficient representations of structured tensors can be exploited

107

by rewriting the Frobenius norm objective function and gradient for the CPD, LL1,

108

LMLRA and BTD. Implementations of the four core operations — norm, inner

prod-109

uct, matricized tensor times Khatri–Rao product and matricized tensor times

Kro-110

necker product — are given in section 3 for structured tensors given as a CPD,

111

LMLRA, TT, Hankel tensor or Löwner tensor. Finally, the experiments insection 4

112

discuss the numerical properties of the presented framework and illustrate the

perfor-113

mance for compression-based nonnegative CPD and blind separation of exponential

114

polynomials through implicit Hankelization.

115

1.3. Notation. Scalars, vectors, matrices and tensors are denoted by lower case,

116

e.g., a, bold lower case, e.g., a, bold upper case, e.g., A and calligraphic, e.g., T ,

117

letters, respectively. K denotes either R or C. Sets are indexed by superscripts within

118

parentheses, e.g., A(n)_{, n = 1, . . . , N. A mode-n vector is the generalization of a} 119

column (mode-1) and row (mode-2) vector and is defined by fixing all but the nth

120

index of an Nth-order tensor. The mode-n unfolding of a tensor T is denoted by T(n) 121

and has the mode-n vectors as its columns. The mode-(m, n) unfolding is defined

122

similarly and is denoted by T(m,n). The vectorization operator vec (T ) stacks all 123

1 vectors into a column vector. A number of products are needed. The

mode-124

n tensor-matrix product is denoted by T ·n A. The Kronecker and Hadamard, or 125

element-wise, product are denoted by ⊗ and ∗, respectively. The column-wise

(row-126

wise) Khatri–Rao products between two matrices, denoted by and T, respectively, 127

are defined as the column-wise (row-wise) Kronecker product. To simplify expressions,

128

the following shorthand notations are used

129

⊗

n ≡

⊗

1 n=N k

⊗

6=n ≡

⊗

1 k=N k6=n , 130 131

(4)

where N is the order of the tensor and indices are traversed in reverse order.

Sim-132

ilar definitions are used for , T and

_∗

. To reduce the number of parenthe-133

ses, we assume the matrix product takes precedence over ⊗, , T and ∗, e.g., 134

AB CD ≡(AB) (CD). The complex conjugate, transpose and conjugated

trans-135

pose are denoted by ¯·, ·T and ·H, respectively. The column-wise concatenation of vec-136

tors a and b is written as x = [a; b] and is a shorthand for x = aT _bTT

. The inner

137

product between A and B is denoted by hA, Bi = vec (B)H

vec (A). The Frobenius

138

norm is denoted by ||·||. Re (·) returns the real part of a complex number. II is the 139

I × I identity matrix and 1I is a length I column vector with ones. Finally, Vf is the 140

row-wise ‘flipped’ version of V ∈ KI×J_{, i.e., V}f_{(i, :) = V(I − i + 1, :)}_{for i = 1, . . . , I.} 141

See [12] for more details.

142

2. Exploiting efficient representations. Many tensor decompositions used in

143

signal processing and data analysis can be computed by minimizing the least squares

144

(LS) error between the approximation and the given tensor T

145 min z f (z) with f(z) = 1 2||M(z) − T || 2 , (1) 146 147

in which M(z) is a tensor decomposition with variables z. Various algorithms have

148

been proposed to solve(1), including alternating least squares (ALS) [8,9,20,32,50],

149

nonlinear conjugate gradient (NCG) [1], quasi-Newton (qN) [76] and Gauss–Newton

150

(GN) [76,77,78] algorithms. We first review the notation for common tensor

decom-151

positions insubsection 2.1. Insubsection 2.2we give an overview of LS optimization

152

theory for algorithms using gradients (first-order information) and Hessian

approxima-153

tions (second-order information) and show that many approximations to the Hessian

154

do not depend on the data T . Hence, second-order algorithms can be used without

155

additional changes in implementation compared to first-order algorithms. Finally, the

156

core operations required to reduce the computational complexity are introduced by

157

rewriting f(z) insubsection 2.3.

158

2.1. Overview of tensor decompositions. The techniques derived in this

159

paper focus on four main decompositions, which can be grouped into two families

160

from an optimization point-of-view: CPD-based and BTD-based algorithms. Note

161

that we use different families compared to the usual division between decompositions

162

computed using numerical linear algebra techniques such as singular value

decompo-163

sitions (SVD), e.g., MLSVD, TT and hierarchical Tucker, and decompositions usually

164

computed through optimization (CPD, LL1, BTD). Here, we focus mainly on

no-165

tation; for pointers to uniqueness results and applications, we refer to the overview

166

papers [12,28,29,48,69].

167

CPD-based algorithms. The (canonical) polyadic decomposition (CPD) writes

168

an Nth-order tensor as a (minimal) sum of R rank-1 terms, each of which is an outer

169

product, denoted by⊗, of N nonzero factor vectors a(n)_r : 170 MCPD(z) = R X r=1 a(1)_r ⊗· · ·⊗a(N ) r def =rA(1), . . . , A(N )z. 171 172

Each factor matrix A(n) _{contains a}(n)

r , r = 1, . . . , R, as its columns and the vari-173

ables z are the vectorized factor matrices A(n)_{, i.e., z = vec A}(1)_{; . . . ;}_{vec A}(N )_.

174

Depending on the field, the CPD is also known as PARAFAC, CANDECOMP or

175

tensor/separation rank decomposition.

176

(5)

The decomposition into multilinear rank-(Lr, Lr, 1)terms (LL1) is accommodated 177

within this family as it is often computed as a constrained CPD [8,20,76]:

178 M_LL1(z) = R X r=1 (ArBTr)⊗cr=JA, B, CPK , 179 180 in which Ar ∈ KI1×Lr, Br ∈ KI2×Lr for r = 1, . . . , R, A = A1 . . . AR , 181

B = B1 . . . BR, C = c1, . . . , cR and P is a binary matrix replicating col-182

umn cr Lr times, i.e., P = blockdiag(1TL1, . . . , 1 T

LR). Similar to the CPD, z = 183

_{vec (A) ; vec (B) ; vec (C). The LL1 decomposition is similar to PARALIND [8].}

184

CONFAC generalizes this decomposition by allowing an arbitrary full row rank

ma-185

trix P in every mode [15]. In [76], the decomposition into (rank-Lr ⊗ rank-1) terms 186

is discussed as one generalization of MLL1 to higher-order tensors. 187

BTD-based algorithms. The low-multilinear rank approximation (LMLRA)

188

or Tucker decomposition writes a tensor as a multilinear transformation of a core

189

tensor S ∈ KJ1×···×JN by the factor matrices U(n)_{∈ K}In×Jn, n = 1, . . . , N: 190

M_LMLRA(z) = S ·1U(1)· · · ·NU(N ). 191

192

An LMLRA can be computed using SVDs of tensors unfoldings, i.e., as a (truncated)

193

multilinear singular value decomposition (MLSVD) [18,82], by higher-order

orthogo-194

nal iteration [19] or through optimization [35,36,76,77].

195

The more general block term decomposition writes a tensor as a sum of R

multi-196

linear rank-(Jr1, Jr2, . . . , JrN)terms [16]: 197 M_BTD(z) = R X r=1 S(r)·1U(r,1)· · · ·N U(r,N ). 198 199

From an optimization point-of-view, the difference between MLMLRA and MBTD is 200

the additional summation over r = 1, . . . , R. For simplicity of notation, we only use

201

M_LMLRA in the derivations in this paper.

202

2.2. Optimization for least squares problems. In gradient-based

optimiza-203

tion, an initial guess for the variables z0 is iteratively refined as 204

zk= zk−1+ αkpk, 205

206

with the step direction pk found by solving 207

Hkpk = −gk,

(2)

208 209

in which gk and Hk are the gradient and the Hessian (or an approximation) of f(z) 210

w.r.t. z in (1), respectively, assuming M(z) is analytic in z [75]. The step size αk 211

ensures sufficient decrease in the objective function value, e.g., through line search.

212

Alternatively, z can be updated using trust-region approaches or using plane search

213

[53,74,76]. The linear system(2)can be solved using direct (pseudo)inversion or using

214

(preconditioned) conjugate gradients [53].

215

For the LS objective function (1), various Hessian approximations are used

de-216

pending on the algorithm, e.g.,

217

• Hk = Ifor gradient descent, 218

(6)

• Hk = (I−UHk₋₁)Hk₋₁(I−Uk₋₁)+Vk₋₁with Uk₋₁and Vk₋₁rank-1 matrices 219

constructed using previous gradients and updates of z for BFGS,

220 • Hk = JHkJk with Jk = _∂z∂ vec (M(z) − T )_z=z k = ∂ ∂zvec (M(z))_z=z k for 221 Gauss–Newton, 222 • Hk = JHkJk+ λkIfor Levenberg–Marquardt. 223

None of these approximations of the Hessian depend on the tensor T . Therefore, only

224

the objective function evaluation and the gradient computation require adaptation to

225

exploit structure in T . Note that this also holds for ALS algorithms.

226

2.3. Exploiting efficient representations. Due to the explicit construction

227

of the residual tensor F = M(z) − T in (1) and the computation of the gradient,

228

the per-iteration complexity is governed by the number of entries in the tensor, i.e.,

229

O (Q

nIn). To reduce the complexity, we assume T is a function of a limited number 230

of parameters. The parameters can, for instance, be the set of positions and values

231

of nonzero entries in a sparse tensor, the factor matrices of a CPD or the underlying

232

signal in a Hankelized tensor. To exploit the structure in T and to avoid the creation

233

of the residual tensor, we rewrite the objective function and gradients which reveals

234

the core operations. More concretely, we have

235 f (z) = 1 2||M(z)|| 2 −Re (hM(z), T i) +1 2||T || 2 , (3) 236 237

which requires the efficient computation of the norms of the model and the tensor,

238

and the inner product between the model and the tensor. Similarly, the gradient

239

which is given by ¯g = JHvec (F) with J = ∂vec(F)

∂z the Jacobian matrix, is rewritten 240

by grouping the model and data dependent terms as

241 ¯ g= JHvec (M(z)) | {z } − JHvec (T ) | {z } 242 = g¯_M − g¯_T. (4) 243 244

(In the derivations, we work with g rather than ¯g.) The core operations are the

245

mtkrprod for the CPD-based algorithms and the mtkronprod for the BTD-based

246

algorithms as shown below. Extensions to constrained and coupled constraints

con-247

clude this section.

248

CPD-based algorithms. Let us partition g according to the variables z, i.e.,

249

in the case M(z) is a CPD, we have

250 g= vec G(1)_{; . . . ;}_{vec G}(N ) 251 252 with 253 vecG(n)= ∂f ∂vec A(n) n = 1, . . . , N. 254 255

By grouping the model and data dependent terms as in(4), the gradients become

256 G(n)_M = ¯A(n)

∗

k6=nA (k)H A(k) 257 G(n)_T = ¯T(n)

k6=nA (k)_. (5) 258 259 6

(7)

It is clear that G(n)

M can be computed efficiently after forming the inner products 260

A(k)HA(k), k = 1, . . . , N. We therefore focus on the operation in (5)which involves

261

the common matricized tensor times Khatri–Rao product (mtkrprod).

262

In the case M(z) is an LL1 decomposition, A(3)_{= CP}_{is used to form the inner} 263

products and to compute the mtkrprod. From the chain rule follows that for n = 3,

264 G(n)_{is given by} 265 G(3)_M = ¯A(3)

∗

k6=3A (k)H A(k) PT 266 G(3)_T = ¯T₍₃₎

k6=3 A(k) PT . 267 268

BTD-based algorithms. Similarly, in the case M(z) is an LMLRA, the

gradi-269

ent is partitioned as

270

g=

vec G(0)_;_{vec G}(1)_{; . . . ;}_{vec G}(N )

271 272 with 273 vecG(0)= ∂f ∂vec (S) 274 vecG(n)= ∂f ∂vec U(n) , n = 1, . . . , N. 275 276

More specifically, the gradient expressions for the core tensors are given by

277 G_M(0)= ¯S ·1 U(1)TU¯(1)· · · ·N U(N )TU¯(N ) 278 G_T(0)= ¯T ·1U(1) T · · · ·N U(N ) T , (6) 279 280

and the expressions for the factors matrices by

281 G(n)_M = ¯U(n)S¯(n)

⊗

k6=nU (k)H U(k) ST (n) 282 G(n)_T = ¯T_(n)

⊗

k6=n U(k) ST (n) (7) 283 284

for n = 1, . . . , N. In the case of more general BTD with R terms, an additional

285

summation is introduced; see, e.g., [77]. As (6) can be written as vecG(0)_T T =

286

vec (T )H

⊗nU(n)

, both(6)and (7)require a vectorized or matricized tensor times

287

Kronecker product (mtkronprod). Note that

288 vecG(0)_T T =vec U(1)TT¯(1)

⊗

k6=1 U(k) , 289 290

hence the mtkronprod from(7) can be reused, although this may lead to a higher

291

complexity.

292

Core operations, constraints and coupling. In order to reduce the

complex-293

ity from O (tensor entries) to O (parameters), structure exploiting implementations of

294

the Frobenius norm and inner products with CPD and BTD models are required to

(8)

evaluate the objective function(3)as well as implementations for the mtkrprod and

296

the mtkronprod for CPD and BTD gradients, respectively. Examples of such

imple-297

mentations are discussed insection 3. In contrast to approaches taken in, e.g., [7,13],

298

the original factor matrices and core tensors remain the optimization variables.

There-299

fore, algorithms for coupling datasets, e.g., [2,77], only require new implementations of

300

the core operations. Similarly, parametric constraints such as nonnegativity,

orthogo-301

nality or Vandermonde structure can be handled using the chain rule as in [77], as well

302

as other types of constraints involving projecting onto a feasible set; see, e.g., [34,42].

303

Concrete examples are given in the experiments insection 4.

304

3. Operations on efficient representations. In the previous section, the

com-305

putation of the norm, inner product, mtkrprod and mtkronprod have been

iden-306

tified as the four key operations to allow tensor decomposition algorithms to work on

307

structured tensors. In this section, we show for a number of efficiently represented

308

tensors that the explicit construction of the full tensor is not required and that the

309

complexity is governed by the number of parameters in the efficient representation,

310

rather than the number of entries in the tensor. First, the polyadic and Tucker format

311

are discussed as expressions involving one of these formats occur in all of the discussed

312

decompositions. Then, efficient implementations for the TT format and two implicit

313

tensorizations — Hankelization and Löwnerization — are derived. These and other

314

implementations are available in Tensorlab [87].

315

Unless stated otherwise, only third-order tensors T ∈ KI×I×I _{are considered to} 316

simplify notations and complexity expressions for the two computational families:

317

the rank-R CPD MCPD = JA, B, CK and the multilinear rank-(R, R, R) LMLRA

318

M_LMLRA= S ·1U ·2V ·3W. The extension to a BTD introduces additional summa-319

tions and is omitted here; see [77]. For the mtkrprod and mtkronprod operations,

320

only the case n = 1 is shown, unless the expressions do not generalize trivially for

321

n = 2, . . . , N. Table 1gives a summary of the computational per-iteration

complex-322

ity when computing a rank-R CPD and clearly illustrates independence on the total

323

number of entries IN_. 324

3.1. Polyadic format. In the polyadic format, a third-order tensor T is given

325

by a set of factor matrices X, Y and Z with F columns, i.e., T = JX, Y, ZK. The

326

number of variables is therefore F PnIninstead of QnIn. Efficient implementations 327

in the case T admits a PD, with F not necessary equal to R, have been presented

328

in [4] (see Kruskal tensor) and are here extended to complex data:

329 ||T ||2= 1T_(XH_{X ∗ Y}H_{Y ∗ Z}H_{Z) 1} 330 hMCPD, T i = 1T(XHA ∗ YHB ∗ ZHC) 1 331 ¯ T(1)(C B) = ¯X(YHB ∗ ZHC) . (8) 332 333

The complexity of these operations is governed by the construction of the inner

prod-334

ucts which require O NIF2

and O (NIF R) flop. Each mtkrprod in(8) required

335

for G(n)

T can then be computed using O (IF R) flop. 336

To compute an LMLRA, the inner product with an LMLRA

337

hM_LMLRA, T i =trace (S ·1XHU ·2YHV ·3ZHW)

(9)

338 339

and the mtkronprod operations are required:

340 ¯ T₍₁₎(W ⊗ V) = ¯X(WH_{Z V}H_Y)H (10) 341 vec (T )H (W ⊗ V ⊗ U) =vec (JUH_X , VH_Y , WH_Z K) H . (11) 342 343 8

(9)

Table 1

Computational per-iteration complexity when computing a rank-R CPD of an N th-order I× · · · × I tensor given in its efficient representation. For Hankelization and Löwnerization, second-order tensorization ofK signals is assumed, resulting in a tensor of size I× I × K. The number of samples isM = 2I− 1 for Hankel and M = 2I for Löwner.

Structure Function Complexity Dense Objective O RIN

Gradient O N RIN

CPD Parameters NIF

Objective O N IF R + N IR2 Gradient O N IF R + N IR2 LMLRA Parameters O NIJ + JN

Objective O N IJR + JN_{R + N IR}2 Gradient O N IJR + JN_{N R + N IR}2 TT Parameters O NIr2 Objective O N Ir2_{R + N IR}2 Gradient O N Ir2_{R + N IR}2 Hankel Parameters MK Objective O M (K + log2M )R + (M + K)R2 Gradient O M (K + log2M )R + (M + K)R2 Löwner Parameters M(K + 1) Objective O M (K + log2M )R + (M + K)R2 Gradient O M (K + log2M )R + (M + K)R2

The computation of the trace in(9)requires only the diagonal entries of the LMLRA

344

and takes O NIF R + F RN_{flop, including the construction of the inner products.}

345

To compute the actual gradient G(1)

T , the result of(10)is multiplied by ST(1), requiring 346

the product (WH_{Z V}H_Y)H_ST

(1)which is a transposed mtkrprod involving usually 347

small matrices. This product can be computed efficiently using, e.g., [63,81] and

348

requires O NF RN_{+ N IF R}

flop. The complexity of(11)is the same.

349

3.2. Tucker format. In the Tucker format a third-order tensor T = Q ·1X ·2 350

Y ·3Zis defined by a third-order core tensor Q ∈ KJ1×J2×J3 and the factor matrices 351

X ∈ KI×J1, Y ∈ KI×J2, Z ∈ KI×J3. The Tucker format often follows from the 352

computation of an LMLRA or an MLSVD, e.g., as the result of a compression step

353

[7]. For the complexity analysis, we assume Jn = J for n = 1, . . . , N. We assume 354

that the factor matrices have orthonormal columns, which is always possible through

355

normalization.

356

The norm of T is computed in O JN

operations as

357

||T ||2= ||Q||2.

358 359

The inner product with MCPDand mtkrprod is similar to(9)and(10): 360 hM_CPD, T i =trace ¯Q ·1ATX ·¯ 2BTY ·¯ 3CTZ¯ 361 ¯ T(1)(C B) = ¯X ¯Q(1)(ZHC YHB) 362 363

and both require O NIJR + JN_R flop.

(10)

To compute an LMLRA, the following expressions for the inner product and the

365

mtkronprod in(7)can be used

366 hM_LMLRA, T i = hS, Q ·1UHX ·2VHY ·3WHZi 367 ¯ T(1)(W ⊗ V) = ¯X ¯Q(1)(ZHW ⊗ YHV) . 368 369

Similarly, the mtkronprod in(6)is computed as

370 vec(T )H (W ⊗ V ⊗ U) =vec (Q·1UHX ·2VHY ·3WHZ) H . 371 372

All computations involving the LMLRA require O NIJR + JN_{R + JR}N flop.

373

3.3. Tensor train format. An Nth-order tensor T ∈ KI1×···×IN can be rep-374

resented by N core matrices or tensors Q(n)_{∈ K}rn−1×In×rn serially linked by N − 1 375

indices such that

376 ti1,...,iN = r1 X s1=1 · · · rN −1 X sN −1=1 q_1,i(1)₁,s1q (2) s1,i2,s2· · · q (N ) sN −1,iN,1, or 377 T =ttG(1), . . . , G(N ). 378 379

The parameters rn, n = 0, . . . , N, are the compression ranks with r0 = rN = 1. 380

This formulation is known as matrix product states [54] in computational chemistry

381

and as tensor trains (TT) [55] in scientific computing and can be computed using

382

SVDs, cross approximation or optimization [55,56,57,60]. Assuming all dimensions

383

and compression ranks are equal, i.e., rn = r for n = 1, . . . , N − 1, the total number 384

of parameters is O NIr2 .

385

The squared norm of T is given by ||T ||2

= |F(N )_|_{, in which the scalar F}(N ) _is 386

constructed using the recursive formula

387 F(0) = 1 388 F(n)=Q(n)·1F(n−1) H (1,2)Q (n) (1,2) n = 1, . . . , N. 389 390 This way ||T ||2

can be computed in O NIr3

flop. While this recursive formula is

391

essentially identical to the formula derived in [60], it reveals the inner products more

392

clearly, allowing efficient implementations.

393

To compute inner products with MCPD and MLMLRA, we first introduce the 394

auxiliary core tensors ˆQ(n)_{. Let A}(1) _{= A, A}(2) _{= B}_{and A}(3) _{= C}_{for M}

CPD, and 395

A(1) = U, A(2) = V and A(3) _{= W} _{for M}

LMLRA, then ˆQ(n) is computed using 396 O Ir2_R_{flop as} 397 ˆ Q(n)= ¯Q(n)·2A(n) T . 398 399

The inner products are then computed as

400 hMCPD, T i =tracett( ˆQ(1), ˆQ(2), ˆQ(3)) (12) 401 hMLMLRA, T i =Dtt( ˆQ(1), ˆQ(2), ˆQ(3)), ¯S E . (13) 402 403

As only the diagonal entries of tt( ˆQ(1)_{, ˆ}_Q(2)_{, ˆ}_Q(3)₎_{are required to compute the trace} 404

in(12), the construction of the R × · · · × R tensor can be avoided. The construction

405

(11)

of the diagonal entries in (12)and the inner product in (13)require O NRr2_and

406

O RN_r2

flop, respectively.

407

To compute an mtkrprod, the auxiliary cores ˆQ(n)_{, n = 1, . . . , N are used} 408

again. Define P(1) ₌ _{tt( ¯}_Q(1)_{, ˆ}_Q(2)_{, ˆ}_Q(3)₎ _{and p}

r as the mode-1 vectors P(1)(:, r, r), 409 r = 1, . . . , R, then 410 ¯ T(1)(C B) = h p(1)₁ p(1)₂ · · · p(1)_R i . 411 412

The complexity of computing one mtkrprod is therefore O Ir2_R

, assuming

auxil-413

iary cores have been constructed.

414

Similarly, the mtkronprod can be formed efficiently as

415 ¯ T(1)(W ⊗ V) =tt( ¯Q(1), ˆQ(2), ˆQ(3)) (1). 416 417

To compute the gradient terms G(n)

T in (7) the matrix tt( ¯Q(1), ˆQ(2), ˆQ(3))

(1) can 418

be constructed and multiplied with ST

(1), which requires O IR

N_{+ IR}N−1_r2 flop.

419

However, by exploiting the TT structure, the product can be computed directly in

420

O rRN _{+ r}2_RN−1_{+ Ir}2_R

flop. The precise implementation is out of the scope of

421

this paper. Finally, to compute G(0)

T in O r2RN

flop, we can use

422

vec (T )H

(W ⊗ V ⊗ U) =vectt( ˆQ(1)_{, ˆ}_Q(2)_{, ˆ}_Q(3)₎T_. 423

424

3.4. Implicit Hankelization. Hankelization of a (possibly complex) signal

vec-425

tor m ∈ KM _{of length M = P}N

n=1In− N + 1yields an I1× I2 matrix or I1× · · · × IN 426

tensor with constant anti-diagonals or constant anti-diagonal hyperplanes for N = 2

427

and N > 2, respectively [21,62]. In blind source separation, a set of K vectors

428

mk ∈ KM is often given [14,21]. Each vector mk can be Hankelized separately and 429

the resulting matrices or tensors can be concatenated along the (N + 1)th mode. To

430

simplify the expressions1_{, we only consider second-order Hankelization and assume}

431

I1 = I2 = I. Let us stack the vectors mk as the columns of M. Each entry of the 432

I × I × Ktensor T is then given by tijk= mi+j−1,k. As multiplying a Hankel matrix 433

by a vector corresponds to a convolution, fast Fourier transforms (FFT) can be used

434

to speed up computations [3,24]. Let F denote the M-point FFT and F−1 _{the inverse} 435

M-point FFT. Only the last I rows of the inverse M-point FFT are retained when

436

the operator F−1 _{is used.} 437

The squared norm can be computed in O (MK) flop as

438

||T ||2= wT_{(M ∗ M)1} K, 439

440

in which w is a vector that contains the number of occurrences of each entry of M

441

in T and can be computed as w = convolution(1I, 1I). The inner products can be 442 computed as 443 hM_CPD, T i =F−1_{(FA ∗ FB)C}T_{, M} 444 hMLMLRA, T i =F−1 (FVTFU)S(1,2)WT , M . 445 446

(12)

The computational cost is O (RM(K + log2M )) and O (R + K)M log2M + M R3

447

flop for the inner product with MCPDand MLMLRA, respectively. 448

To compute the mtkrprod, the Hankel structure can be exploited as

449 ¯ T(1)(C B) = F−1 F ¯MC ∗ FBf 450 ¯ T₍₂₎(C A) = F−1 F ¯MC ∗ FAf 451 ¯ T(3)(B A) = MHF−1(FA ∗ FB) 452 453

requiring O (RM(K + log2M ))flop each. 454

The mtkronprod in G(n)

T in (7)can be computed in O R2M log2M + RM K

455 flop as 456 ¯ T(1)(W ⊗ V) = F−1 F ¯MWTFVf 457 ¯ T₍₂₎(W ⊗ U) = F−1 F ¯MWT_FUf 458 ¯ T(3)(V ⊗ U) = MHF−1(FVTFU) . 459 460

The multiplication with ST

(n) in (7)additionally takes O IR3

flop for n = 1, 2 and

461

O KR3

flop for n = 3. However, by first computing the multiplication of the

row-462

wise Khatri–Rao product with ST

(n), i.e., before performing the inverse FFT, the total 463

cost can be reduced to O RM(K + log2M ) + M R3

_{. Finally, the mtkronprod in}

464

all modes can be computed in O R2_{M log}

2M + R3M + RM K

flop using

465

vec(T )H_{(W ⊗ V ⊗ U) =}_{vec (F}−1_(FVT_FU))T_MW¯ T

.

466 467

3.5. Implicit Löwnerization. Löwner matrices and tensors have attractive

468

properties for applications involving rational functions [21,23]. Given a function

469

h : K → K evaluated at M points ti ∈ T = {t1, . . . , tM}, we partition T in two 470

disjoint point sets X = {x1, . . . , xI} and Y = {y1, . . . , yJ}such that T = X ∪ Y and 471

M = I + J. The entries in a Löwner matrix L ∈ KI×J _{are then given by} 472 lij = h(xi) − h(yj) xi− yj , i = 1, . . . , I, and j = 1, . . . , J. 473 474

While higher-order generalizations of the Löwner transformation exist [21,22], we

475

focus on third-order tensors in which each kth frontal slice is a Löwner matrix

con-476

structed from a function hk evaluated in the points in X and Y . Let P ∈ KI×K and 477

Q ∈ KJ×K _{contain sampled function values at the points in X and Y , respectively,} 478 i.e., 479 pik= hk(xi), i = 1, . . . , I, and k = 1, . . . , K 480 qjk= hk(yj), j = 1, . . . , J, and k = 1, . . . , K. 481 482

The tensor T ∈ KI×J×K _{is then fully determined by P, Q, X and Y . To simplify} 483

complexity expressions, we take I = J. Each frontal slice Tk can be written as 484

Tk =Diag(pk)M − MDiag(qk), k = 1, . . . , K, 485

486

in which M is a Cauchy matrix with

487 mij= 1 xi− yj , 1 ≤ i, j ≤ I. 488 489 12

(13)

By assuming that all points in X and Y are equidistant, M is also a Toeplitz matrix.

490

As multiplying a Toeplitz matrix with a vector x can be seen as a convolution of the

491

generating vector v = m1,I; m1,I−1; . . . ; m1,1; m2,1; . . . ; mI,1 ∈ K2I−1 with x, it is 492

not necessary to construct M and the multiplication can be performed using FFTs.

493

Let F be the (2I −1)-point FFT with zero padding for shorter vectors, F−1_{the inverse} 494

(2I − 1)-point FFT and define F−1 _{to be the last I rows of the result of the inverse} 495

FFT. For a matrix X ∈ KI×K_{, MX and M}T_Xcan then be computed as 496 MX= F−1(Fv1T K∗ FX) (14) 497 MT_X_{= F}−1 _Fvf₁T K∗ FX . (15) 498 499

If Fv is precomputed, the cost of this multiplication is O (2KI + 4KI log22I)flop. 500

We now define the different operations for Löwner tensors. As the unfoldings

501 T(n), n = 1, 2, 3, are given by 502 T(1)= P TM − M(Q TIJ) 503 T(2)= MT(P TII) − (Q TMT) 504 T(3)= PT(M TII) − QT(IJTMT) , 505 506

the expressions for the mtkrprod and mtkronprod can be derived easily using

507

multilinear algebra identities. The mtkrprod can be computed efficiently as

508 ¯ T(1)(C B) = ¯PC ∗ ¯MB − ¯M ¯QC ∗ B 509 ¯ T(2)(C A) = MH PC ∗ A − ¯¯ QC ∗ MHA 510 ¯ T(3)(B A) = PH MB ∗ A − Q¯ H(B ∗ MHA) . 511 512

As each mtkrprod requires two FFT-based multiplications with M and two regular

513

matrix-matrix multiplications with P and Q, the complexity is O (RI(K + log2I)). 514

Similarly, the mtkronprod in(7)can be computed as

515 ¯ T(1)(W ⊗ V) = ¯PW TMV − ¯¯ M ¯QW TV 516 ¯ T(2)(W ⊗ U) = MH PW ¯ TU − ¯QW TMHU 517 ¯ T(3)(V ⊗ U) = PH MV ¯ TU − QH(V TMHU) , 518 519 in O R2_{I + R}2_{I log} 2I + RKI

_{flop for n = 1, 2 and O RI log}

2I + R2IK

_{flop for}

520

n = 3. This can be reduced further to O (RI log2I + RKI) flop for n = 1, 2, by 521

exploiting the row-wise Khatri–Rao products when computing the subsequent

mul-522

tiplication with ST

(n), which costs O IR3

. A similar improvement can be made for

523

n = 3. The mtkronprod in (6)is computed as

524 vec (T )H (W ⊗ V ⊗ U) =vec UT _PW¯ T_{MV − (M}¯ H_U )T _¯ QW T_VT 525 526

and requires O RI log2I + R3I

flop.

527

The squared Frobenius norm is computed as

528 ||T ||2= 1T I(M ∗ M) T _{(P ∗ P)1} K 529 + 1T I(M ∗ M) (Q ∗ Q)1K 530 − 2Re 1T K((M ∗ M) T_{P ∗ Q} )1K . 531 532

(14)

As only K + 2 FFTs are required, the total computational cost is O (4IK log22I) 533

flop. To compute the inner products with MCPD and MLMLRA, the results from 534

mtkrprod and mtkronprod are reused:

535 hMCPD, T i =T¯(3)(B A) , ¯C 536 hMLMLRA, T i =XTT¯(1)(W ⊗ V) , ¯S(1) . 537 538

Apart from the cost of the mtkrprod and the mtkronprod, an additional cost of

539

O (KR)and O RN_{flop is incurred for hM}

CPD, T iand hMLMLRA, T i, respectively. 540

In the case the points sets X or Y do not contain equidistant points, M no longer

541

admits a Toeplitz structure. By exploiting the low displacement rank of Cauchy

ma-542

trices, it is again not necessary to explicitly construct the tensor. The computational

543

complexity of the multiplications with M in (14) and (15) is slightly increased to

544

O 4KI log222I

flop [25,26].

545

4. Experiments. The following experiments illustrate the scalability of

con-546

strained CPD, LL1 and BTD algorithms that exploit efficient representations of

ten-547

sors and the effect on the accuracy of the results. The latter is discussed in

sub-548

section 4.1 for ill-conditioned problems and in subsection 4.3for Hankelized signals.

549

The former is illustrated by computing the nonnegative CPD of a compressed tensor

550

in subsection 4.2 and for the unconstrained LL1 decomposition and the constrained

551

BTD insubsection 4.3. The CPD error ECPDis defined as 552 ECPD= max_n A (n)_{− ˆ}_A(n) A(n) , 553 554

in which A(n) _{( ˆ}_A(n)_{) are the exact (estimated) factor matrices, n = 1, . . . , N, and} 555

scaling and permutation indeterminacies are assumed to be resolved. All timing

re-556

sults are total computation times for a complete run of the optimization algorithm,

557

in contrast to the derived per-iteration complexities insection 3. Tensorlab 3.0 [87] is

558

used for all experiments in combination with Matlab 2016b running on a dual socket

559

20 core Intel Xeon E5-2660 v2 machine with 128 GiB of RAM running CentOS 7.

560

4.1. Accuracy and conditioning. In this first experiment, we study the

ac-561

curacy of recovered factor matrices for the full tensor and its efficient representation.

562

More specifically, the CPD of a tensor with an exact rank-R structure is computed

563

while varying the relative condition number κ, which is defined as in [80]. Concretely,

564

a rank-5 tensor of size 25 × 25 × 25 is constructed using random factor matrices A(n)_, 565

n = 1, 2, 3. Each random factor vector a(n)

r has norm one and a fixed angle α w.r.t. to 566

the other factor vectors in the same factor matrix, i.e., for n = 1, 2, 3, a (n) r = 1 567 and cos α = a(n)T

r a(n)s , for r, s = 1, . . . , R, r 6= s. As α decreases, the rank-1 terms 568

become more collinear and the condition number κ increases. A rank-5 CPD is

com-569

puted from the full tensor and from the structured tensor in the polyadic format

570

using cpd_nls, starting from a perturbed exact solution. Figure 1 shows the error

571

ECPD on the recovered factor matrices for α = π/2, . . . , π/180 (using a logarithmic 572

scale). As expected, the error increases when the condition number κ worsens. For

573

well-conditioned problems, ECPD is in the order of the machine precision ≈ 10−16 574

if the full tensor is used, while ECPD is higher for the structured tensor. This can be 575

explained by the fact that changes smaller than √ ≈ 10−8 _{cannot be distinguished} 576

when using the structured tensor, as the computation of the objective function (3)

577

(15)

requires the difference of squared, almost equal numbers. For very ill-conditioned

578

problems, ECPD may be undesirably high when using the structured tensor. If the 579

tensor admits an exact decomposition and if this exact decomposition is required up

580

to machine precision, using the full tensor may be necessary. However, the structured

581

tensor can still be used as an initialization to reduce the overall computational cost.

582 1 ₁₀2 ₁₀4 ₁₀6 10−16 10−12 10−8 10−4 Full Structured α =π₂ α =₁₈₀π Condition number κ ECPD

Figure 1. By reducing the angle α between the factor vectors, the condition of the CPD worsens, resulting in a higher errorECPD. For very ill-conditioned problems, using the structured tensor approach may result in an undesirably large error and a few additional iterations using the full tensor may be required to improve the accuracy. Results are medians over 100 experiments.

4.2. Compression for nonnegative CPD. The following experiments

illus-583

trate how compression can be used to reduce the complexity of constrained tensor

584

decompositions. Here, we focus on the widely used nonnegative CPD, which is sped

585

up by first computing a truncated MLSVD or a TT approximation. Two approaches

586

to enforce the constraints are used: a projected Gauss–Newton approach

generat-587

ing updates such that the variables are always positive using active sets [42], and

588

using parameter-based constraints, i.e., each entry is the square of an underlying

589

variable [65]. The former approach is implemented using a boundary constrained

590

Gauss–Newton algorithm nlsb_gndl; the latter is implemented using the structured

591

data fusion framework [77] (sdf_nls and struct_nonneg).

592

For the first experiment, let T be a rank-10 tensor of size I × I × I constructed

593

using random factor matrices A(n) _{with entries drawn from a uniform distribution} 594

U (0, 1). Gaussian i.i.d. noise is added such that the signal-to-noise ration (SNR) is

595

20 dB. Using a randomized MLSVD (mlsvd_rsi [85]) the tensor is compressed such

596

that the core G has size 10 × 10 × 10, i.e., T ≈ qG; U(1)_{, U}(2)_{, U}(3)y

. The core tensor

597

G and the factors U(n) _{are then used as structured format to compute a rank-10} 598

CPD. Random initial factor matrices are drawn from the same distribution as A(n)_. 599

Figure 2shows the time required to compute an unconstrained CPD (hence relying on

600

CPD uniqueness), a constrained CPD using projected GN (cpd_nls with nlsb_gndl

601

solver) and a constrained CPD using parametrization (sdf_nls). In the three cases,

602

the computation time for the original tensor T rises cubically in the dimension I, for

603

I = 26_{, 2}7_{, . . . , 2}10_{, while the time rises linearly using the structured approach. Hence,} 604

existing CPD algorithms can be scaled to handle large problems, provided that the

605

MLSVD approximation can be computed and the rank is modest. Note that the

606

compression time is not included as it can be amortized over multiple initializations,

607

ranks or experiments.

608

In a second experiment, the complexity in function of the order N of the tensor

(16)

26 ₂8 ₂10 10−1 100 101 102 ×2 ×8.8 ×2 ×1.6 I Time (s) Unconstrained GN 26 ₂8 ₂10 ×8.5 ×1.6 I Projected GN 26 ₂8 ₂10 ×7.6 ×1.7 Full Compressed I Parametric GN

Figure 2. When using a compressed tensor instead of the full tensor, the computational cost scales linearly in the tensor dimensions instead of cubically for a rank-10 nonnegative tensor of size I× I × I. MLSVD compression with a core size of 10 × 10 × 10 is used. The increase for the structured tensor is actually less than linear (×1.6 − ×1.7 instead of ×2), which is caused by an improved multicore usage. The time required to compute the MLSVD increases cubically from20 ms forI = 26 _to_{8 s for I = 2}10_{and is not included. Medians over 100 experiments for each parameter} are reported. 5 6 7 8 9 10−1 100 101 102 ×1.2 ×8.3 Full Compressed Order N Time (s)

Figure 3. Using a TT approximation instead of the full 10× 10 × · · · × 10 tensor removes the exponential dependence on the orderN when computing a rank-5 nonnegative CPD: from N = 8 toN = 9 the time increases with a factor 1.2≈ 9/8 using the TT approximation, while the time increase is8.3≈ 10 using the full tensor. The median TT compresssion time is not included and increases exponentially from20 ms for N = 5 to 114 s for N = 9. The results are medians over 100 experiments.

is investigated. We compare the required time to compute a nonnegative CPD of an

610

Nth order rank-5 tensor with dimensions 10×· · ·×10 using the full tensor and its TT

611

approximation. The data is generated by constructing N random factor matrices with

612

entries drawn from the uniform distribution U(0, 1). Random Gaussian i.i.d. noise is

613

added to the generated tensor such that the SNR is 20 dB. The core tensors G(n) 614

corresponding to the TT approximation are computed using tt_tensor from the TT

615

Toolbox [57], which we slightly adapted such that all compression ranks rn = 5, 616

n = 1, . . . , N − 1. Random factor matrices using the same distribution are used as

617

initialization. Nonnegativity is enforced using the projected GN approach (cpd_nls

618

with nlsb_gndl solver). The timing results in Figure 3 clearly show that using the

619

TT approximation avoids the curse of dimensionality as the time increases linearly

620

in N in contrast to the time required for the decomposition of the full tensor. Note

621

(17)

that we expect the time to increase with a factor 10 if the order increases by one,

622

but the actual time increase is lower as seen in Figure 3. This is due to the fact

623

that the decomposition problem becomes easier as relatively more data is available

624

per variable. This is also reflected in the accuracy: for N = 5 the median accuracy

625

ECPD= 4.7 · 10−3, while ECPD= 3.8 · 10−5 for N = 9. 626

4.3. Signal separation through Hankelization. When source signals are

627

sums of exponentials polynomials, the source signals S ∈ RM×R _{can be recovered} 628

from a noisy mixture X ∈ RM×K _{using Hankelization [17]. We use the setup} 629

X= SMT_{+ N} 630

631

in which M ∈ RK×R _{is the mixing matrix and N ∈ R}M×K _{is additive Gaussian} 632

i.i.d. noise scaled such that a given SNR is attained. Hankelization along the columns

633

of X leads to a third-order tensor T of size M+1 2 ×

M+1

2 × K in which b·c and 634

d·e are the floor and ceil operators, respectively. Hence, if the number of samples M

635

doubles, the number of entries in T quadruples. If each of the R source signals is a

636

sum of Qrexponential polynomials of degree dqr, q = 1, . . . , Qr, T can be decomposed 637

into a sum of R multilinear rank-(Lr, Lr, 1)terms with Lr=P Qr q=1(dqr+ 1)[17], i.e., 638 T = R X r=1 (ArBTr)⊗cr 639 640

in which cr estimates the rth mixing vector M(:, r). The source signals can be esti-641

mated up to scaling and permutation by dehankelizing ArBTr, e.g., by averaging over 642

the anti-diagonals. The example from [17] is slightly adapted in this experiment: two

643

sources are mixed using

644 M= 2 1 −1 1 , 645 646

i.e., K = R = 2. The sources are given by

647 s1(t) = 0.5 sin(6πt) 648 s2(t) = (4t2− 2.8t) exp(−t), 649 650

hence L1= 2and L2= 3. Estimating R and Lris out of the scope of this paper; see, 651

e.g., [17]. The true R and Lr are therefore used. M equidistant samples are taken 652

between 0 s and 1 s.

653

In the first experiment, the SNR is varied for a fixed number of M = 501

sam-654

ples and the relative error E on the mixing matrix is compared when using the full

655

Hankelized tensor or the implicitly Hankelized tensor, i.e., in the structured format.

656

Hankelization is performed using hankelize, which returns both the explicit and

im-657

plicit tensorization. The resulting tensors of size 251 × 251 × 2 are decomposed using

658

ll1_nlsstarting from a random initialization. From the factor matrices, the signals

659

are recovered using dehankelize, without constructing the full tensor. As shown

660

in Figure 4, when increasing the SNR from 0 dB to 300 dB, the error decreases at

661

the same rate for the explicit and the implicit tensorization until the SNR is 180 dB:

662

while the error continues to decrease for the explicit tensorization, the error stagnates

663

at E ≈ 10−10 _{for the implicit tensorization. This loss of accuracy is explained in} 664

subsection 4.1 and only occurs for a very high SNR and in the case the sources are

(18)

exact sums of exponential polynomials. As illustrated in the following experiments,

666

a trade-off between computational cost and accuracy can be made. In the case of a

667

high SNR, the solution using the implicit tensorization can be used to initialize the

668

algorithm with the full tensor.

669 0 180 300 10−16 10−8 100 Implicit Explicit SNR (dB) Error E

Figure 4. For low and medium SNR, the errors on the mixing matrix for the explicit and implicit Hankelization approaches are equal. While the error for the implicit tensorization stagnates for SNR larger than180 dB, the error continues to decrease when using the explicit tensorization. The errors are medians over 100 experiments, each using a best-out-of-five initializations strategy.

In the second experiment, the scaling behavior in function of the number of

sam-670

ples is illustrated for M = 501, 5 001, 50 001 and 500 001 points. The resulting

Hanke-671

lized tensors require 953 KiB, 95.3 MiB, 9.53 GiB and 953 GiB of memory, respectively,

672

if formed explicitly. The SNR is fixed at 20 dB. We compute an unconstrained LL1

673

decomposition (ll1_nls) and a constrained BTD, which models the tensor as

674 T = R X r=1 V(r)G(r)V(r)T⊗c_r, 675 676

in which V(r)_{are confluent Vandermonde matrices and G}(r)_{are upper anti-triangular} 677

matrices; see [17] for details. By imposing the structure in V(r)_{, the recovered signals} 678

are guaranteed to be exponential polynomials, and the underlying poles can be

re-679

covered easily. These parametric constraints are modeled in the SDF framework [77]

680

using struct_confvander and sdf_nls. Figure 5 and Figure 6 show the median

681

time and error on the mixing matrix over 50 noise realizations. Each algorithm is

682

initialized three and six times using random variables for the LL1 and BTD model,

683

respectively. The time and error for the best initialization, i.e., the one resulting in

684

the lowest error on the mixing matrix, is retained. The timing results in Figure 5

685

clearly show that implicit Hankelization drastically reduces the computation time,

686

allowing longer signals to be analyzed. The computation of the constrained BTD is

687

sensitive to the initialization and is difficult from an optimization point-of-view, as

688

the factors are ill-conditioned due to the generalized Vandermonde structure. This

689

results in higher computation times and a lower accuracy for random initializations.

690

Being able to analyze longer signal allows one to improve the accuracy for a fixed

691

SNR, as is clear fromFigure 6: by taking 100 times as many samples, E decreases by

692

a factor 10.

693

5. Conclusion. By rewriting the objective function and gradient expressions

694

commonly used to compute tensor decompositions, four core operations are exposed:

695

(19)

501 5001 50001 500001 100 102 104 ×14 ×17 M Time (s) Unconstrained LL1 501 5001 50001 ×8.6 ×187 Implicit Explicit M Constrained BTD

Figure 5. For both the unconstrained LL1 decomposition and the constrained BTD, the com-putation time using implicit tensorization increases more slowly, enabling large-scale applications. For the LL1 decomposition, the time increases with a factor14≈ 13 (O (M log2M )) using the ef-ficient representation and with a factor17 when the full tensor is used. The latter factor is better than the expected factor 100 as fewer iterations are required. Computing the constrained BTD is more difficult due to ill-conditioned factors and the variation in time is large causing larger devia-tons from the expected scaling factors. The results are medians over 50 experiments with multiple initializations. 501 5001 50001 500001 10−4 10−3 10−2 BTD LL1 ÷9.2 ×100 M Error E

Figure 6. The relative error E of the estimated mixing matrix decreases when the number of samplesM increases. In the case of the unconstrained LL1 decomposition, E decreases with a factor 9.2≈ 10 when 100 times as many samples are used. The constrained BTD is more difficult from an optimization point-of-view and involves ill-conditioned factor matrices, resulting in a modest im-provement in terms of error. The errors are computed using implicit tensorization and are medians over 50 noise realizations with multiple initializations. The results for full tensors are similar.

squared norm, inner product, matricized tensor times Khatri-Rao product and

ma-696

tricized tensor times Kronecker product. By specializing these operations for efficient

697

representations of tensors, both the computational and memory complexity are

re-698

duced from O (tensor entries) to O (parameters) as illustrated for the polyadic, Tucker

699

and tensor train format, as well as for Hankelization and Löwnerization. The

opera-700

tions can be used in many decomposition algorithms and frameworks, including ALS,

701

second-order algorithms and algorithms for constrained and/or coupled

decomposi-702

tions. The numerical consequences of exploiting efficient representations are studied

703

in the case a highly accurate solution is required. Finally, important concepts such as

704

Tucker or TT compression for constrained decompositions and implicit tensorization,

705

allow large-scale datasets to be handled easily, as illustrated for nonnegative CPD and

706

the Hankelization of mixtures of exponentials.

(20)

REFERENCES 708

[1] E. Acar, D. M. Dunlavy, and T. G. Kolda, A scalable optimization approach 709

for fitting canonical tensor decompositions, J. Chemometrics, 25 (2011), pp. 67–86, 710

doi:10.1002/cem.1335. 711

[2] E. Acar, T. G. Kolda, and D. M. Dunlavy, All-at-once optimization for coupled matrix 712

and tensor factorizations, May 2011,arXiv:1105.3422. 713

[3] R. Badeau and R. Boyer, Fast multilinear singular value decomposition for structured ten-714

sors, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 1008–1021,doi:10.1137/060655936. 715

[4] B. Bader and T. G. Kolda, Efficient matlab computations with sparse and factored tensors, 716

SIAM J. Sci. Comput., 30 (2007), pp. 205–231,doi:10.1137/060676489. 717

[5] H. N. Bharath, D. Sima, N. Sauwen, U. Himmelreich, L. De Lathauwer, and 718

S. Van Huffel, Nonnegative canonical polyadic decomposition for tissue-type differ-719

entiation in gliomas, IEEE J. Biomed. Health Inform., 21 (2017), pp. 1124–1132, 720

doi:10.1109/JBHI.2016.2583539. 721

[6] R. Bro, Multi-way analysis in the food industry: models, algorithms, and applications, PhD 722

thesis, University of Amsterdam, 1998. 723

[7] R. Bro and C. A. Andersson, Improving the speed of multiway algorithms: 724

Part II: Compression, Chemometr. Intell. Lab., 42 (1998), pp. 105–113, 725

doi:http://dx.doi.org/10.1016/S0169-7439(98)00011-2. 726

[8] R. Bro, R. A. Harshman, N. D. Sidiropoulos, and M. E. Lundy, Modeling multi-727

way data with linearly dependent loadings, J. Chemometrics, 23 (2009), pp. 324–340, 728

doi:10.1002/cem.1206. 729

[9] J. D. Carroll and J.-J. Chang, Analysis of individual differences in multidimensional 730

scaling via an n-way generalization of “Eckart–Young” decomposition, Psychometrika, 35 731

(1970), pp. 283–319,doi:10.1007/bf02310791. 732

[10] J. D. Carroll, S. Pruzansky, and J. B. Kruskal, Candelinc: A general approach to 733

multidimensional analysis of many-way arrays with linear constraints on parameters, Psy-734

chometrika, 45 (1980), pp. 3–24,doi:10.1007/bf02293596. 735

[11] J. H. Choi and S. V. N. Vishwanathan, DFacTo: Distributed factorization of tensors, 736

in Proceedings of the 27th International Conference on Neural Information Processing 737

Systems, NIPS’14, Cambridge, MA, USA, 2014, MIT Press, pp. 1296–1304. 738

[12] A. Cichocki, D. Mandic, A.-H. Phan, C. Caiafa, G. Zhou, Q. Zhao, and L. De Lath-739

auwer, Tensor decompositions for signal processing applications: From two-way to 740

multiway component analysis, IEEE Signal Process. Mag., 32 (2015), pp. 145–163, 741

doi:10.1109/msp.2013.2297439. 742

[13] J. E. Cohen, R. C. Farias, and P. Comon, Fast decomposition of large nonnegative tensors, 743

IEEE Signal Process. Lett., 22 (2015), pp. 862–866,doi:10.1109/lsp.2014.2374838. 744

[14] P. Comon and C. Jutten, Handbook of Blind Source Separation: Independent component 745

analysis and applications, Academic press, 2010. 746

[15] A. de Almeida, G. Favier, and J. C. M. Mota, A constrained factor decomposition with 747

application to MIMO antenna systems, IEEE Trans. Signal Process., 56 (2008), pp. 2429– 748

2442,doi:10.1109/TSP.2008.917026. 749

[16] L. De Lathauwer, Decompositions of a higher-order tensor in block terms — Part II: 750

Definitions and uniqueness, SIAM J. Matrix Anal. Appl., 30 (2008), pp. 1033–1066, 751

doi:10.1137/070690729. 752

[17] L. De Lathauwer, Blind separation of exponential polynomials and the decomposition of a 753

tensor in rank-(Lr, Lr, 1) terms, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 1451–1474, 754

doi:10.1137/100805510. 755

[18] L. De Lathauwer, B. De Moor, and J. Vandewalle, A multilinear singu-756

lar value decomposition, SIAM J. Matrix Anal. Appl., 21 (2000), pp. 1253–1278, 757

doi:10.1137/S0895479896305696. 758

[19] L. De Lathauwer, B. De Moor, and J. Vandewalle, On the best 1 and rank-759

(R1, R2, . . . , RN) approximation of higher-order tensors, SIAM J. Matrix Anal. Appl., 21 760

(2000), pp. 1324–1342,doi:10.1137/S0895479898346995. 761

[20] L. De Lathauwer and D. Nion, Decompositions of a higher-order tensor in block terms 762

— Part III: Alternating least squares algorithms, SIAM J. Matrix Anal. Appl., 30 (2008), 763

pp. 1067–1083,doi:10.1137/070690730. 764

[21] O. Debals and L. De Lathauwer, The concept of tensorization. Technical Report 17–99, 765

ESAT-STADIUS, KU Leuven, Belgium, 2017. 766

[22] O. Debals, L. De Lathauwer, and M. Van Barel, About higher-order Löwner ten-767

sors. Technical Report 17–98, ESAT-STADIUS, KU Leuven, Belgium, 2017, ftp://ftp.

768

(21)

esat.kuleuven.be/pub/stadius/odebals/debals2017higher.pdf. 769

[23] O. Debals, M. Van Barel, and L. De Lathauwer, Löwner-based blind signal separation 770

of rational functions with applications, IEEE Trans. Signal Process., 64 (2016), pp. 1909– 771

1918,doi:10.1109/tsp.2015.2500179. 772

[24] W. Ding, L. Qi, and Y. Wei, Fast Hankel tensor-vector product and its application 773

to exponential data fitting, Numer. Linear Algebra Appl., 22 (2015), pp. 814–832, 774

doi:10.1002/nla.1970. 775

[25] I. Gohberg and V. Olshevsky, Complexity of multiplication with vectors for structured 776

matrices, Linear Algebra Appl., 202 (1994), pp. 163–192,

doi:10.1016/0024-3795(94)90189-777

9. 778

[26] I. Gohberg and V. Olshevsky, Fast algorithms with preprocessing for matrix-779

vector multiplication problems, Journal of Complexity, 10 (1994), pp. 411–427, 780

doi:10.1006/jcom.1994.1021. 781

[27] L. Grasedyck, Hierarchical singular value decomposition of tensors, SIAM J. Matrix Anal. 782

Appl., 31 (2010), pp. 2029–2054,doi:10.1137/090764189. 783

[28] L. Grasedyck, D. Kressner, and C. Tobler, A literature survey of low-784

rank tensor approximation techniques, GAMM-Mitteilungen, 36 (2013), pp. 53–78, 785

doi:10.1002/gamm.201310004. 786

[29] W. Hackbusch, Tensor spaces and numerical tensor calculus, Springer Series in Computa-787

tional Mathematics, (2012),doi:10.1007/978-3-642-28027-6. 788

[30] W. Hackbusch, B. N. Khoromskij, and E. E. Tyrtyshnikov, Hierarchi-789

cal Kronecker tensor-product approximations, J. Numer. Math., 13 (2005), 790

doi:10.1515/1569395054012767. 791

[31] W. Hackbusch and S. Kühn, A new scheme for the tensor representation, J. Fourier Anal. 792

Appl., 15 (2009), pp. 706–722,doi:10.1007/s00041-009-9094-9. 793

[32] F. L. Hitchcock, The expression of a tensor or a polyadic as a sum of products, Journal of 794

Mathematics and Physics, 6 (1927), pp. 164–189,doi:10.1002/sapm192761164. 795

[33] S. Holtz, T. Rohwedder, and R. Schneider, The alternating linear scheme for tensor 796

optimization in the Tensor Train format, SIAM J. Sci. Comput., 34 (2012), pp. A683– 797

A713,doi:10.1137/100818893. 798

[34] K. Huang, N. D. Sidiropoulos, and A. P. Liavas, A flexible and efficient algorithmic 799

framework for constrained matrix and tensor factorization, IEEE Trans. Signal Process., 800

64 (2016), pp. 5052–5065,doi:10.1109/TSP.2016.2576427. 801

[35] M. Ishteva, P.-A. Absil, S. Van Huffel, and L. De Lathauwer, Best low multilinear 802

rank approximation of higher-order tensors, based on the Riemannian trust-region scheme, 803

SIAM J. Matrix Anal. Appl., 32 (2011), pp. 115–135,doi:10.1137/090764827. 804

[36] M. Ishteva, L. De Lathauwer, P.-A. Absil, and S. Van Huffel, The best rank-805

(R1, R2, R3) approximation of tensors by means of a geometric Newton method, AIP Con-806

ference Proceedings, 1048 (2008), pp. 274–277,doi:10.1063/1.2990911. 807

[37] B. Jeon, I. Jeon, L. Sael, and U. Kang, SCouT: Scalable coupled matrix-tensor factoriza-808

tion — algorithm and discoveries, in 2016 IEEE 32nd International Conference on Data 809

Engineering (ICDE), IEEE, May 2016, pp. 811–822,doi:10.1109/icde.2016.7498292. 810

[38] I. Jeon, E. E. Papalexakis, U. Kang, and C. Faloutsos, HaTen2: Billion-scale tensor 811

decompositions, in 2015 IEEE 31st International Conference on Data Engineering, IEEE, 812

Apr 2015, pp. 1047–1058,doi:10.1109/icde.2015.7113355. 813

[39] U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos, GigaTensor: Scaling tensor 814

analysis up by 100 times - algorithms and discoveries, Proceedings of the 18th ACM 815

SIGKDD international conference on Knowledge discovery and data mining - KDD ’12, 816

(2012),doi:10.1145/2339530.2339583. 817

[40] O. Kaya and B. Uçar, Scalable sparse tensor decompositions in distributed memory 818

systems, in Proceedings of the International Conference for High Performance Com-819

puting, Networking, Storage and Analysis on - SC ’15, ACM, 2015, pp. 77:1–77:11, 820

doi:10.1145/2807591.2807624. 821

[41] O. Kaya and B. Uçar, High performance parallel algorithms for the Tucker decomposition 822

of sparse tensors, in 2016 45th International Conference on Parallel Processing (ICPP), 823

IEEE, Aug 2016, pp. 103–112,doi:10.1109/icpp.2016.19. 824

[42] C. Kelley, Iterative Methods for Optimization, SIAM, 1999. 825

[43] B. N. Khoromskij, Structured rank-(r1, . . . , rD) decomposition of function-related tensors in 826

RD, Comput. Methods Appl. Math., 6 (2006),doi:10.2478/cmam-2006-0010. 827

[44] B. N. Khoromskij and V. Khoromskaia, Low rank tucker-type tensor approximation to 828

classical potentials, Central European Journal of Mathematics, 5 (2007), pp. 523–550, 829

doi:10.2478/s11533-007-0018-0. 830