• No results found

Distance-based analysis of dynamical systems and time series by optimal transport Muskulus, M.

N/A
N/A
Protected

Academic year: 2021

Share "Distance-based analysis of dynamical systems and time series by optimal transport Muskulus, M."

Copied!
8
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Distance-based analysis of dynamical systems and time series by optimal transport

Muskulus, M.

Citation

Muskulus, M. (2010, February 11). Distance-based analysis of

dynamical systems and time series by optimal transport. Retrieved from

https://hdl.handle.net/1887/14735

Version: Corrected Publisher’s Version License:

Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/14735

Note: To cite this publication please use the final published version (if

applicable).

(2)

Optimal transportation distances

Science is what we understand well enough to explain to a computer. Art is verythings else we do.

Donald Knuth

I

n SectionB.1the general, probabilistic setting is introduced with which we work in the following. SectionB.2introduces the optimal transportation problem which is used to define a distance in SectionB.3.

B.1 The setting

Recall the setting introduced in Section1.1: A complex system S is measured by a measuring device D. The system S is an element of an abstract space of systems S, and a measuring device is a function that maps S ∈ S into a space of measure- ments M . Since we are interested in quantitative measurements, the space M will be a metric space (M, d), equipped with a distance d. For example, we could take (M, d) to be some Euclidean space En or, more generally, a manifold with distance induced by geodesics (shortest paths). However, to account for random influences in the measurement process, we will more generally consider spaces of probability measures on M .

Let (M, d) be a metric space. For simplicity of exposition, let us also assume that M is complete, path-connected and has continuous distance function, such that it is Hausdorff in the induced topology. A curve on M is a continuous function γ : [0, 1]→ M. It is a curve from x to y if γ(0) = x and γ(1) = y. The arc length of γ is defined by

Lγ= sup

0=t0<t1<···<tn=1 n−1

X

i=0

d(γ(ti), γ(ti+1)), (B.1) where the supremum is taken over all possible partitions of [0, 1], for all n∈ N. Note that Lγcan be infinite; the curve γ is then called non-rectifiable.

Let us define a new metric dIon M , by letting the value of dI(x, y) be the infimum of the lengths of all paths from x to y. This is called the induced intrinsic metric of M . If dI(x, y) = d(x, y) for all points x, y ∈ M, then (M, d) is a length space and d is called intrinsic. Euclidean space En and Riemannian manifolds are examples of

(3)

194 B. Optimal transportation distances

length spaces. Since M is path-connected, it is a convex metric space, i.e., for any two points x, y∈ M there exists a point z ∈ M between x and y in the intrinsic metric.

Let µ be a probability measure on M with σ-algebraB. We will assume µ to be a Radon measure, i.e., a tight locally-finite measure on the Borel σ-algebra of M , and denote the space of all such measures byP(M). Most of the time, however, we will be working in the much simpler setting of a discrete probability space: Let µ be a singular measure on M that is finitely presentable, i.e., such that there exists a representation

µ =

n

X

i=1

aiδxi, (B.2)

where δxiis the Dirac measure at point xi∈ M, and the norming constraintPn i=1ai= 1 is fulfilled. We further assume that xi6= xjif i6= j, which makes the representation (B.2) unique (up to permutation of indices). Denote the space of all such measures by PF(M ). Measures inPFcorrespond to the notion of a weighted point set from the lit- erature on classification. In our setting they represent a finite amount of information obtained from a complex system.

In particular, let a probability measure µ0 ∈ P(M) represent the possible mea- surements on a system S. Each elementary measurement corresponds to a point of M , and if the state of the system S is repeatedly measured, we obtain a finite sequence X1, X2, . . . , Xn of iid random variables (with respect to the measure µ0) taking val- ues in M . These give rise to an empirical measure

µn[A] = 1 n

n

X

i=1

δXi[A], A∈ B. (B.3)

The measure µnis itself a random variable, but fixing the outcomes, i.e., considering a realization (x1, x2, . . . , xn)∈ Mn, a measure µ∈ PF(M ) is obtained,

µ =

n

X

i=1

1

xi, (B.4)

which we call a realization of the measure µ0. Denote the space of all probability measures (B.4) for fixed n∈ N and µ0∈ P(M) by Pn0).

B.2 Discrete optimal transportation

In this secion we will motivate the notion of distance with which we will be con- cerned in the rest of the thesis. The starting point is the question of how to define a useful distance for the measures inPF.

Example 10(Total variation). The distance in variation between two measures µ and

(4)

ν is

dTV(µ, ν) = sup

A∈B|µ[A] − ν[A]|. (B.5)

It is obviously reflexive and symmetric. For the triangle inequality, let ǫ > 0 and consider A∈ B such that dTV(µ, ν) <|µ[A] − ν[A]| + ǫ. Then

dTV(µ, ν) <|µ[A] − ρ[A]| + |ρ[A] − ν[A]| + ǫ

< sup

A∈M|µ[A] − ρ[A]| + sup

A∈M|ρ[A] − ν[A]| + 2ǫ. (B.6) Since this holds for all ǫ, the triangle inequality is established. Total variation dis- tance metrizes the strong topology on the space of measures, and can be interpreted easily: If two measures µ and ν have total variation p = dTV(µ, ν), then for any set A∈ F the probability assigned to it by µ and ν differs by at most p. For two measures µ, ν∈ PF concentrated on a countable set x1, x2, . . . , it simplifies to

dTV(µ, ν) =X

i

|µ[xi]− ν[xi]|. (B.7)

Unfortunately, total variation needs further effort to be usable in practice. Consider an absolutely continuous µ0 ∈ P(M) with density f : M → [0, 1]. For two realiza- tions µ, ν∈ Pn0) we have that pr(supp µ∩supp ν 6= ∅) = 0, so dTV(µ, ν) = 0 almost surely. In practice, therefore, we will need to use some kind of density estimation to achieve a non-trivial value dTV(µ, ν); confer (Schmid and Schmidt,2006).

Example 11. The Hausdorff metric is a distance of subsets of a metric space (Exam- ple5). It can be turned into a distance for probability measures by “forgetting” the probabilistic weights, i.e.,

dHD(µ, ν)def= dH(supp f, supp g), (B.8) If M is a normed vector space, then a subset A ⊂ M and its translation x + A = {x + a | a ∈ A} have Hausdorff distance dH(A, x + A) =||x||, which seems natural.

However, Hausdorff distance is unstable against outliers. For example, consider the family of measures defined by P0= δ0and Pn= n1δn+ (1−1n0for all n > 0. Then dHD(P0, Pn) = n.

Example 12(Symmetric pullback distance). Let f : Mn → N be the projection of an ordered n-tuple from M into a single point of a metric space (N, d). Call f symmetric if its value does not depend on the order of its arguments, i.e., if f (x1, . . . , xn) = f (xσ(1), . . . , xσ(n)) for all permutations σ from the symmetric group Σ(n) on n ele- ments. Then

df(X, Y )def= d(f (X), f (Y )) (B.9)

(5)

196 B. Optimal transportation distances

defines a distance between n-element subsets X, Y ⊂ M (the symmetric pullback of the distance in N ).

In particular, if M has the structure of a vector space, then each function f : Mn→ N can be symmetrized, yielding a symmetric function

fσ(x1, . . . , xn)def= 1 n!

X

σ∈Σ(n)

f (xσ(1), . . . , xσ(n)). (B.10)

For the projection to the first factor,

f : Mn→ M, (x1, . . . , xn)7→ x1, (B.11) this yields the centroid

fσ(x1, . . . , xn) = 1 n

n

X

i=1

xi (B.12)

with centroid distance df(X, Y ) = d( ¯X, ¯Y ). This construction generalizes in the obvious way to finite probability measures µ, ν∈ Pn0).

Note however, that the symmetric pullback distance is pseudo-metric: There usu- ally exist many n-subsets X, Y of M with the same pullback distance, i.e., df(X, Y ) = 0 does not imply that X = Y .

All the above distances have various shortcomings that are not exhibited by the following distance. Let µ, ν be two probability measures on M and consider a cost function c : M × M → R+. The value c(x, y) represents the cost to transport one unit of (probability) mass from location x ∈ M to some location y ∈ M. We will model the process of transforming measure µ into ν, relocating probability mass, by a probability measure π on M× M. Informally, dπ(x, y) measures the amount of mass transferred from location x to y. To be admissible, the transference plan π has to fulfill the conditions

π[A× M] = µ[A], π[M × B] = ν[B] (B.13) for all measurable subsets A, B ⊆ M. We say that π has marginals µ and ν if (B.13) holds, and denote by Π(µ, ν) the set of all admissible transference plans.

Kantorovich’s optimal transportation problem is to minimize the functional I[π] =

Z

M ×M

c(x, y) dπ(x, y) for π∈ Π(µ, ν) (B.14)

over all transference plans Π(µ, ν).

(6)

The optimal transportation cost between µ and ν is the value Tc(µ, ν) = inf

π∈Π(µ,ν)I[π], (B.15)

and transference plans π∈ Π(µ, ν) that realize this optimum are called optimal trans- ference plans.

Since (B.14) is a convex optimalization problem it admits a dual formulation.

Assume that the cost function c is lower semi-continuous, and define J(ϕ, ψ) =

Z

M

ϕ dµ + Z

M

ψ dν (B.16)

for all integrable functions (ϕ, ψ) ∈ L = L1( dµ)× L1( dν). Let Φcbe the set of all measurable functions (ϕ, ψ)∈ L such that

ϕ(x) + ψ(y)≤ c(x, y) (B.17)

for dµ-almost all x∈ M and dν-almost all y ∈ M. Then (Villani,2003, Th. 1.3)

Π(µ,ν)inf I[π] = sup

Φc

J(ϕ, ψ). (B.18)

For measures µ, ν∈ PF with representations

µ =

m

X

i=1

aiδxi and ν =

n

X

j=1

bjδyj (B.19)

any measure in Π(µ, ν) can be represented as a bistochastic m× n matrix π = (πij)i,j, where the source and sink conditions

m

X

i=1

πij = bj, j = 1, 2, . . . , n and

n

X

j=1

πij= ai, i = 1, 2, . . . , m, (B.20)

are the discrete analog of (B.13), and the problem is to minimize the objective func- tion

X

ij

πijcij, (B.21)

where cij = c(xi, yj) is the cost matrix.

Its dual formulation is to maximize X

i

ϕiai+X

j

ψjbj (B.22)

under the constraint ϕi+ ψj ≤ cij.

(7)

198 B. Optimal transportation distances

Example 13 (Discrete distance). Consider the special cost c(x, y) = 1x6=y, i.e., the distance induced by the discrete topology. Then the total transportation cost is

Tc(µ, ν) = dTV(µ, ν). (B.23)

The Kantorovich problem (B.14) is actually a relaxed version of Monge’s trans- portation problem. In the latter, it is further required that no mass be split, so the transference plan π has the special form

dπ(x, y) = dµ(x)δ[y = T (x)] (B.24)

for some measurable map T : M → M. The associated total transportation cost is then

I[π] = Z

M

c(x, T (x)) dµ(x), (B.25)

and the condition (B.13) on the marginals translates as

ν[B] = µ[T−1(B)] for all measurable B⊆ M. (B.26) If this condition is satisfied, we call ν the push-forward of µ by T , denoted by ν = T #µ. For measures µ, ν∈ PF, the optimal transference plans in Kantorovich’s prob- lem (transportation problem) coincide with solutions to Monge’s problem.

A further relaxation is obtained when the cost c(x, y) is a distance. The dual (B.18) of the Kantorovich problem then takes the following form:

Theorem 9(Kantorovich-Rubinstein (Villani,2003)[ch. 1.2). ] Let X = Y be a Polish space1, and let c be lower semi-continuous. Then:

Tc(µ, ν) = sup

Z

X

ϕ d(µ− ν); where

ϕ∈ L1(d|µ − ν|) and sup

x6=y

|ϕ(x) − ϕ(y)|

c(x, y) ≤ 1 )

(B.27)

The Kantorovich-Rubinstein theorem implies that Td(µ + σ, ν + σ) = Td(µ, ν), i.e., the invariance of the Kantorovich-Rubinstein distance under subtraction of mass (Villani,2003, Corollary 1.16). In other words, the total cost only depends on the difference µ− ν. The Kantorovich problem is then equivalent to the Kantorovich- Rubinstein transshipment problem: Minimize I[π] for all product measures π : M × M → R+, such that

π[A× M] − π[M × A] = (µ − ν)[A]

1 A topological space is a Polish space if it is homeomorphic to a complete metric space that has a countable dense subset. This is a general class of spaces that are convenient to work with. Many spaces of practical interest fall into this category.

(8)

for all measureable sets A ⊆ B(M). This transshipment problem is a strongly re- laxed version of the optimal transportation problem. For example, if p > 1 then the transshipment problem with cost c(x, y) = ||x − y||p has optimal cost zero (Villani, 2003). For this reason, the general transshipment problem is not investigated her.e Example 14 (Assignment and transportation problem). The discrete Kantorovich problem (B.19-B.21) is also known as the (Hitchcock) transportation problem in the literature on combinatorial optimization (Korte and Vygen,2007). The special case where m = n in the representation (B.19) is the assignment problem. Interestingly, as a consequence of the Birkhoff theorem, the latter is solved by a permutation σ map- ping each source aito a unique sink bσ(i)(i = 1, . . . , n); confer (Bapat and Raghavan, 1997).

B.3 Optimal transportation distances

Let (M, d) be a metric space and consider the cost function c(x, y) = d(x, y)p, if p > 0 and c(x, y) = 1x6=y if p = 0. Recall that Tc(µ, ν) denotes the cost of an optimal transference plan between µ and ν.

Definition 18(Wasserstein distances). Let p≥ 0. The Wasserstein distance of order p is Wp(µ, ν) = Tdp(µ, ν)1/pif p∈ [1, ∞), and Wp(µ, ν) = Tdp(µ, ν) if p∈ [0, 1).

Denote byPp the space of probability measures with finite moments of order p, i.e., such that

Z

d(x0, x)pdµ(x) <∞

for some x0∈ M. The following is proved in (Villani,2003, Th. 7.3):

Theorem 10. The Wasserstein distance Wp, p≥ 0, is a metric on Pp.

The Wasserstein distances Wp are ordered: p ≥ q ≥ 1 implies, by Hölder’s in- equality, that Wp≥ Wq. On a normed space, the Wasserstein distances are minorized by the distance in means, such that

Wp(µ, ν)≥ Z

X

x d(µ− ν) p

(B.28)

and behave well under rescaling:

Wp(αµ, αν) =|α|Wp(µ, ν),

where αµ indicates the measure mα#µ, obtained by push-forward of multiplication by α. If p = 2 we have the additional subadditivity property

W21µ1+ α2µ2, α1ν1+ α2ν2)≤ α12W21, ν1)2+ α22W22, ν2)21/2

.

Referenties

GERELATEERDE DOCUMENTEN

In particular, one would like to (i) under- stand better how to lessen the dependence of the Wasserstein distances on the par- ticular embedding used, a point that was introduced

Due to large intra-group variance, fluctuation analysis showed no significant dif- ferences between the groups of subjects, but there were indications that the scaling exponents

It will be shown that the best method for general classification and discrimination of diseased patients from healthy controls is the distance-based comparison, whereas slightly

Note that the results of the permutation version of Hotellings T 2 test are limited by the number of rela- belling (N = 10 000), such that Bonferroni correction for multiple

As mentioned in the Introduction, a connectivity measure has to be reflexive, sym- metric, and it has to fulfill the triangle inequality in order to represent functional distances..

This seriously restricts the class of possible “distance” measures, and involves an important principle: Being a true distance allows for a natural representation of complex systems

The left panel of Figure A.4 shows the reconstructed configuration for Euclidean distances, the middle panel the config- uration for the geodesic distance, and the right panel

In detail, for each data item its distance information is removed from x, the coordinates of the remaining points are calculated by classical multidimensional scaling, and