Avoiding local minima in nonmetric multidimensional scaling by means of deterministic annealing and iterated local search

(1)

4

Master’s Thesis Psychology,

Methodology and Statistics Unit, Institute of Psychology Faculty of Social and Behavioral Sciences, Leiden University Date: October 2018

Student number: 1773305 Supervisor: Dr. F. M. T. A. Busing

Avoiding Local Minima in Nonmetric

Multidimensional Scaling by Means of Deterministic

Annealing and Iterated Local Search.

(2)

Acknowledgements

My thesis was a long, but very interesting journey. There are some people I would like to thank that helped me during this process. First, I would like to thank my thesis

supervisor Dr. Frank Busing for the instructive appointments, and for his accessability, patience, and enthousiasm. Thank you, Prof. dr. Willem Heiser and Prof. dr. Patrick Groenen for taking time to cast a critical eye on my research.

I would like to thank my dear son Otis for accepting to be brought away more often in busy periods and for giving me strength and courage to finish my education, so that we can build up our lives even further. I would like to thank my boyfriend Patrick for being so supportive, for boosting my confidence when I needed that, and for helping me out with computer things. Last but not least, I would like to thank my father Ipe for his interest in my thesis subject and his persistent intension to study along with me.

(3)

Abstract

The current study aimed to map the seriousness of the local minima problem in nonmetric multidimensional scaling (MDS) by evaluating the results of the single random start procedure (SR). Moreover, Deterministic annealing (DA) and iterated local search (ILS) were proposed to avoid local minima, to be tested against multiple random starts (MR) and MDS with a classical scaling start (CS). The performance of these procedures were primarily evaluated by studying the proportion of global minima found, and

secondarily by considering the spearman ρ and running time of the global minima found. In developing DA, it was found to be most effective to base the cooling scheme of

temperature T on the tieblocks of the data. Two types of DA procedures were tested; one with a deterministic start and one with a random start. For ILS, perturbance was applied by means of adding error to the configuration. Concerning the acceptance criterion, two versions of ILS were developed; one accepting only better solutions and one accepting also worse solutions with a certain probability. A simulation study was performed with r = 100 replications and factors procedure, type of procedure, number of objects, amount of

missings and type of rank. Several main and interaction effects for number of objects, amount of missings and type of rank were found studying the results of the SR procedure. Furthermore, it was found that DA with deterministic start performed better than DA with random start in terms of proportion global minima found. No significant difference was found between the performance of the two proposed types of ILS. The performance of the (combined) ILS did not significantly differ from the performance of the best version of DA, whereas MR and CS performed significantly better. Evaluation of the secondary variable spearman ρ found that CS and DA were reliable in finding qualitative solutions over all conditions, whereas ILS and MR tend to find more degenerated solutions as the number of objects increased in combination with (approximately) full rank. As far as running time is concerned, MR and especially CS generally took more time to find the global minimum than DA or ILS. Some recommendations are proposed for future research.

(4)

2.3.1 Perturbance . . . 12 2.3.2 Acceptance criterion . . . 13 2.3.3 History . . . 15 3. Methodology 16 3.1. Simulation study . . . 16 3.1.1. Factors . . . 16 3.1.2. Data generation . . . 17 3.1.3. Outcome variables . . . 18 3.2 Data analysis . . . 19 4. Results 20 4.1 The local minima problem in nonmetric MDS . . . 20

4.2 Types of DA and ILS . . . 23

4.3 Global . . . 25

(5)

5. Discussion 29

(6)

1. Introduction

Multidimensional scaling (MDS) (Borg & Groenen, 2005) is an ordination method that helps to visualize similarities or dissimilarities of objects in a clear way, usually in two or three dimensions. It places objects in such a way that similar objects are close to each other and different objects are further apart. MDS differs from comparable algorithms, like principal component analysis (PCA), in that the dimensions in MDS are meaningless, which means that the configuration of points can be rotated, mirrored, scaled and translated without changing meaning, just like the handling of a roadmap. The current study focusses on nonmetric MDS, which bases the configuration of points on the rank ordering of the data rather than on its numeric values, like in metric MDS.

Kruskal (1964a) developed an actual loss function (a “Badness-of-fit” statistic) to be minimized, called raw stress:

σr=

X

i<j

wij(dij(X) − ˆdij)2 (1)

where dij(X) is the Euclidean distance between object i and object j and ˆdij and wij the

corresponding transformed data and weight element. Raw stress (1) thus represents the amount of divergence between the actual Euclidean distances and the transformed data. This transformation is to be explained in more detail in the next section.

However, raw stress is sensitive to the size of the configuration, and if there is no constraint to avoid the transformations to be 0, all objects may end up in the origin. To control for this unwanted behaviour, a normalized version of stress varying between 0 and 1 can be used: σn = P i<jwij(dij(X) − ˆdij)2 P i<jwijdˆ2ij (2) As for the minimization of normalized stress, a problem of (sub)gradient algorithms is that they might end up in local minima. Even when the stress of the local minimum

(7)

solution is low, the practical meaning of a local minimum solution can be misleading (Mathar, 1997; Borg & Mair, 2017). In metric MDS, this problem is found to be most severe in unidimensional scaling (De Leeuw & Heiser, 1977) and is not at all severe in full-dimensional scaling (p = n − 1) (De Leeuw, 1993)), where only one (global) minimum exists. For dimensionalities between these extremas the problem becomes less severe as there are more dimensions (Groenen, 1993; Groenen & Heiser, 1996). Global optimization methods for metric MDS proposed in the literature are multiple random starts,

multi-level-single-linkage (Timmer, 1984), genetic algorithms (Goldberg, 1989), tabu search (Groenen, 1993), the tunneling method (Groenen, 1993; Groenen & Heiser, 1996), distance smoothing (Groenen, Heiser & Meulman, 1999), simulated annealing (Trejos et al., 2000) and deterministic annealing (Bae, Qiu & Fox, 2010). In Borg & Groenen (2005) it is stated that the local minimum problem is worse in nonmetric MDS, but not at all severe in unidimensional scaling. However, supporting literature that explains how severe this problem is, is still lacking.

The current study aims to map the seriousness of the local minima problem in nonmetric MDS by means of evaluation the results of the single random start procedure (SR). Moreover it aims to avoid local minima without intervening into the MDS algorithm. The current research proposes deterministic annealing (DA) and iterated local search (ILS) to avoid local minima in nonmetric MDS. DA is already found effective in metric MDS (Bae, Qui & Fox, 2010), but not yet studied in nonmetric MDS. ILS is effectively applied to combinatorial and scheduling problems (Baum, 1986; Taillard, 1995; Stützle, 1998), but has not been applied to MDS yet. ILS applied to MDS can be considered as an adaption of multiple random starts (MR), that aims to make a biased walk rather than a random walk through the solution space. The performance of the proposed methods will be compared with the performance of MR and MDS with a classical scaling start (CS) (Torgerson, 1952; Gower, 1966).

(8)

It is expected that the results of SR yield a good understanding of how serious the local minima problem in nonmetric MDS is. The following trends might be visible: the problem becomes more serious 1) as there are less objects, 2) as there are more missings, 3) as there are fewer categories. It is expected that MR and CS perform well, but take a long time to run. It is aimed that DA and ILS do as well as MR and CS, and outperform these competing procedures as far as running time is concerned.

In the next section, MDS, DA, and ILS will be explained in more detail. In the third section, the simulation study and the data analysis will be described. The results of the comparison between porformance of the methods are discussed in the fourth section, followed by a general discussion in the last section.

(9)

2. Algorithms

2.1 Multidimensional scaling (MDS)

The optimization problem in nonmetric MDS is a two-step procedure that optimizes the Euclidean distances and the transformed data respectively. In the first step, the

Euclidean distances are optimized by means of the SMACOF algorithm (De Leeuw, 1977), which applies iterative majorization to multiple dimensions. Since the SMACOF algorithm was introduced it was preferred over earlier methods, for it is simpler and more powerful due to its monotonically decrease of the stress criterion. At each step, SMACOF minimizes a simple quadratic function (τ ) which almost lies entirely above the original stress function and touches the function in the so called supporting point. The majorization inequality looks as follows: σr( ˆd, X) = η_d2ˆ+ trX0V X − 2trX0B(X)X (3) ≤ η2 ˆ d+ trX 0 V X − 2trX0B(Z)Z (4) = τ ( ˆd, X, Z) where η2 ˆ

d is a constant and the V is a matrix in which vij = −wij if i is not equal to j and

Pn

j=1wij for the diagonal elements of matrix V . B is a matrix in which the off-diagonal

elements of matrix are 0 when dij(Z) is equal to 0 and −wijdˆij/dij(Z) if not, with the

diagonal elements of the B matrix defined as minus the sum of its off-diagonal row sums. Matrices X and Z represent the current and the previous configurations coordinates respectively.

The minimum of τ can be obtained by setting its derivative to zero:

(10)

which yields the update formula of the SMACOF algorithm as:

X = V+B(Z)Z, (5)

which is known as the Guttman transform (De Leeuw & Heiser, 1980), where V+ _{is the}

Moore-Penrose inverse of the V matrix. It yields the new configuration of X, which will be used in the next iteration until the minimum is attained, no more improvement is detected, or the maximum number of iterations has been reached. In the current study, normalized stress is minimized, which does not affect the location of the stationary points, since the denominator of normalized stress is constant and explicitly standardized to be equal to

n(n − 1)/2.

In the second step, the transformed data is optimized using Kruskal’s up-and-down blocks algorithm or “pool adjacent violators algorithm” (PAVA) (Kruskal, 1964b), while keeping the Euclidean distances fixed. This ordinal transformation is described by:

if δij < δkl, then ˆdij ≤ ˆdkl (6)

i.e. if objects i and j are more similar than objects k and l, the distance between i and j should be less or equal to the distance between objects k and l, which suggests a monotonic relationship. The idea of obtaining a monotonic relationship between the data and the Euclidean distances was brought under attention by Shepard (1962). A monotone regression fits a free-form line to a sequence of data, which is non-decreasing and lies as close to the data as possible, in a least-square sense.

For the interested reader, a more thorough description of the SMACOF algorithm can be found in Borg & Groenen (2005).

(11)

2.2 Deterministic annealing (DA)

Deterministic annealing (DA) is an optimization method used effectively in (pairwise) data clustering (Rose, Gurewitz & Fox, 1990; Hofmann & Buhmann, 1997), graph

matching (Gold & Rangarajan, 1996), and metric MDS (Klock & Buhmann, 1997; Bae, Qui & Fox, 2010). Like in simulated annealing (SA), the algorithm involves a temperature parameter T , which is lowered during the process, regulating the balance between

diversification and specification. At high temperatures (the diversification stage) the algorithm is not yet greedy to find the best solution, but explores the field of solutions. At lower temperature, the algorithm has already explored the field and starts to focus on finding the best solution (specification stage).

In contrast to SA, where optimization is done using a stochastic random approach, DA takes a deterministic approach to get to the global minimum. This is done by gradually introducing information that is available based on the value of T , and optimizing the solution according to that information. For nonmetric MDS, the information that is taken into account for optimization under each T is presented in the ˆ∆ matrix (i.e. the expected data matrix), which is specified by:

ˆ δij =        δij − T √ 2p if δij > T √ 2p 0 otherwise        (7)

The data under each T will be transformed:

ˆ

D = f ( ˆ∆) (8)

and then, stress is minimized under each subsequent T :

σn= P i<jwij(dij(Z) − ˆd[T ]ij)2 P i<jwijdˆ2ij (9)

(12)

2.2.1 Cooling scheme. In prior research, temperature T was cooled by means of multiplying it with a cooling parameter between 0 and 1 (Klock and Buhmann, 1999; Bae, Qiu & Fox, 2010). However, in nonmetric MDS, the expected data will be ordinally

transformed and only the rank order of the data matters, not their actual values. This means that the solution only changes when a new element becomes nonzero in ˆ∆, actually changing the rank order of the data. Let us consider an example data marix to illustrate this phenomenon. ∆ = 0 3 4 3 3 0 1 5 4 1 0 6 3 5 6 0 with vectorized lower triangle:

δ = 3 4 3 1 5 6

For simplicity reasons, √2p was left out of the formula for obtaining ˆδ so that we can see

new data elements become nonzero when T is below an elements value. In Table 1 the starting T is of thus height that only the largest data element becomes visible in ˆδ. The

temperature is cooled by means a cooling parameter of 0.95, until the next largest data element is introduced.

Table 1: Change in ˆd during descendance of T .

T = 5.990; δ = 0 0 0 0 0 0.010 ;ˆ d = 0.151 1.409 0.003 1.409 0.148 1.409ˆ

T = 5.691; δ = 0 0 0 0 0 0.310 ;ˆ d = 0.151 1.409 0.003 1.409 0.148 1.409ˆ

T = 5.406; δ = 0 0 0 0 0 0.594 ;ˆ d = 0.151 1.409 0.003 1.409 0.148 1.409ˆ

T = 5.136; δ = 0 0 0 0 0 0.864 ;ˆ d = 0.151 1.409 0.003 1.409 0.148 1.409ˆ

(13)

We can see that ˆd is only changing when a new rank order is introduced in ˆδ, i.e. when a

new data element becomes nonzero in ˆδ, that is for T = 4.88. This means that for

nonmetric MDS, it is more convenient to let loose of the cooling parameter and instead create a vector T which is based upon the tieblocks of the data, illustrated in Table 2.

Table 2: T based on descendently sorted δ.

δsorted 6 5 4 3 3 1

T 5.99 4.99 3.99 2.99 — 0.99

Consider that T might be of different length than ˆδ for there can be tieblocks in the

data with a length greater than one (like the two 3’s in the example). By determining T based on the tieblocks of the data, unnecessary steps will be skipped, which will reduce the running time of the algorithm. Moreover, running time will especially be reduced when dealing 1) with ordinal partial ranked data, for the algorithm needs only 5 rounds for a 5-point Likert scale and 2) with missing data in full ranked data, since the length of vector

T will be even more reduced because there are probably less tieblocks.

2.2.2 Start configuration. DA has been applied in combination with a random start configuration (Klock and Buhmann, 1999; Bae, Qiu & Fox, 2010). This is in contradiction with what one would expect from a procedure with “deterministic” in its name. In the current research, a deterministic start configuration is proposed, to be tested besides the random start configuration. In this start configuration, only the objects that are involved in the largest data element(s) are placed apart in a deterministic way. In developing this start configuration, it was found that it is important that the points are not placed in a straight line because the algorithm has difficulty to escape from that single dimension. In the example just proposed, objects 3 and 4 are most different (δ34 is the

(14)

placed 1 up and 0.04 to the left. The rest of the objects are placed in the origin (see Figure 1). A summary of the DA algorithm is presented in Box 1.

Figure 1: Deterministic start

1 2 3 4 −1.0 −0.5 0.0 0.5 1.0 −0.10 −0.05 0.00 0.05 0.10 dim1 dim2

(15)

(16)

2.3 Iterated local search (ILS)

Iterated local search (ILS) (Lourenço, Martin & Stützle, 2010) is a stochastic search algorithm that aims to find a more optimal solution by “perturbing” the current optimized solution (X∗) so that it ideally reaches a point (X0) within another “basin of attraction”, a

set of solutions around a certain minimum (see Figure 2). After optimization, this new minimum (X∗0) will or will not be accepted according to a pre-chosen acceptance criterion.

This process is repeated until the solution cannot be improved any further or when the algorithm has been yielding the same solution for too many times. ILS is succesfully implemented in combinatorial problems like the traveling salesman problem (TSP) (Baum, 1986), the quadratic assignment problem (QAP) (Taillard, 1995) and scheduling problems like the flow shop problem (FSP) (Stützle, 1998). In the following sections, the specific implementation of perturbance, acceptance criterions and history for MDS will be described.

Figure 2: Perturbance

(17)

2.3.1 Perturbance. In combinatorial problems, perturbation consists of making changes in the sequence of objects. In the current study, perturbation should be applied to more than one dimension. This is achieved by adding error to objects coordinates

(Wagenaar & Padmos, 1971). A significant matter in the development of ILS is how strong the perturbation should be. On one side, when perturbance is too strong, ILS behaves like MR and only with low probability a better solution will be found. It is therefore preferable to find a perturbation scheme for which the algorithm makes a biased walk through the solutions, so that running time is reduced compared to MR. On the other side,

perturbations should not be made too small, because otherwise SMACOF will often fall back into the same minimum just visited and thus the diversification of the walk will be very limited.

For complex problems like MDS, perturbations of fixed strength for all points will probably not lead to good performance of ILS. Some objects contribute more to stress than others, and probably need more adjustment. In the current study, perturbance per point is described by a normal distribution with its standard deviation dependent on:

1. Temperature T , as introduced in section 2.2. The algorithm starts with T = 1, which is cooled in each replication by multiplication with cooling parameter α. The overall configuration is perturbed more at higher temperatures and allowed to explore the entire field of solutions, while as the temperature lowers the overall configuration will be perturbed less and starts to specify on a certain region more;

2. Overall stress (√σn), i.e. the sum of stress per objects (σI), since it is preferred to

perturb the configuration as a whole more when the overall stress is high; 3. Stress per object

a. relative to overall stress (√σn(i)/σn), to perturb each object somewhat more based on

its relative contribution to the overall stress;

b. in a absolute sense (√nσn(i)), to compensate for the fact that the relative stress per

(18)

Thus, for each object a seperate normal distribution for perturbance is specified: sdi = T ∗ (( √ σn(i)/σn) + √ σn+ √ n ∗ σn(i) (10) ,

which adapts the perturbation during run-time, which implies a reactive search.

2.3.2 Acceptance criterion. The acceptance criterion determines whether the new solution (X∗’) is accepted as the best solution or not. The most straightforward acceptance

criterion is only to accept the new solution when its stress is lower than that of the current solution. This criterion clearly favors intensification over diversification. The criterion on the other extreme (in favor of diversification) is the random walk criterion, which always accepts the new solution, even if it has higher stress. Many choices of criterions lie between these extremas. For example, a simulated annealing type of acceptance criterion used in the large-step Markov chains algorithm (Martin, Otto & Felten, 1991; Martin, Otto & Felten, 1992), accepts X∗0 as the best solution if it has lower stress than X∗, and if it has

higher stress it is accepted with a certain probability.

An important sidenote is that the perturbation scheme and the acceptance criterion should not be seen as independent parameters, for together they define the balance

between intensification and diversification. On one side, large permutations are only useful when they can be accepted by an appropriate acceptance criterion that is not too biased to better solutions. On the other side, when small perturbations are applied, there is no need to use an acceptance criterion accepting large deteriorations with a certain probability.

(19)

In the current study, two acceptance criteria are taken into account: 1) accepting only solutions with lower stress:

Better(X∗, X∗0, history) =          X∗0 if σcurrent < σbest X∗ otherwise (11)

and 2) accepting also solutions with higher stress with a certain probability:

M arkov(X∗, X∗0, history) =                    X∗0 if σcurrent < σbest

X∗0 if σcurrent > σbest with probability P

X∗ if otherwise

(12)

with the probability to which worse solutions are accepted is specified by:

P = e

−(1.0−T )2_/(√_{maxr∗T ∗}√_σ

best)

2 ∗ 3.14 ∗qσcurrent/σbest− 0.9

, (13)

which is built on a normal distribution based on:

1. Best stress (√σbest); for small best stress the probability to which a worse solution is

accepted is generally lower;

2. The relative difference between the current and best solution (qσcurrent/σbest− 0.9);

small deteriorations are more probable to be accepted than large deteriorations; 3. Temperature T ; at high temperatures it is more probable that a worse solution is

being accepted than at low temperatures;

4. The maximum number of iterations maxr; to make sure that the maximum

probability to which worse solutions will be accepted does not approaches zero too soon. In the current study, we applied an α of 0.99, which corresponts with a maximum of 448 replications.

(20)

2.3.3 History. In the acceptance criterion formulas, the term history came along. It is important to note here that perturbance is not always applied to the best solution. In the current study, the algorithm is allowed to walk √n perturbances further than the best

solution, where n is the number of objects. If the algorithm has not accepted another “best”, it returns to the best solution. If after √n times √n perturbances no “better”

(21)

3. Methodology

3.1. Simulation study

The development of the algorithms and the data-analysis was done in R (R Core Team, 2017) and the simulation study was done in C (Microsoft Visual Studio, 2017).

3.1.1. Factors. The seriousness of the local minima problem will be tested by means of a Monte Carlo simulation study considering between factors procedure, n, p and

missings and within factors rank and type. The factors are now described in detail.

procedure. Procedures that are taken into account are SR, DA, ILS, MR, and CS. type. For DA and ILS, two types of each algorithm are proposed, which involve

deterministic start and random start for DA and “Better” acceptance criterion and “Markov” acceptance criterion for ILS.

n. It was decided to investigate how the problem changes from small to large number

of objects n, as was studied earlier in metric MDS (Groenen, 1993). In a pilot simulation study, n was varied from 10, 20, 50, 100 and 200 objects. The severity of the problem did not significantly differ between n = 100 and n = 200. It was therefore decided to exclude

n = 200 from the main simulation study.

p. In a pilot study, dimensions p = 2 and p = 3 were taken into consideration. The

severity of the local minima problem did not significantly differ between these levels of dimensionality, thus only p = 2 was included in the main simulation study. Since p only involves one level, it will not be considered as a factor.

missings. It was decided to implement missings in the current study for there is no

clarity in literature how missingness influences the serverity of the local minima problem. Also, by introducing missings, there will be more local minima, while also the global minimum is known (the solution with stress equal to 0). In practice, it is desirable to plan an incomplete data design, since the acquisition of judgements on how dissimilar objects are can be very demanding. Moreover, many of judgements on dissimilarity of pairs are

(22)

actually redundant information (Young & Cliff, 1972). Birchfield & Subramanya (2005) stated that the maximum allowable percentage of missings to regain a stable solution is ((n − p − 1)/n)2_{. For n = 100, the maximum allowable missings is 94 percent, which seems}

rather extreme. However, earlier research on MDS applied to n = 833 mobile base stations also contained 98.8% missings (Groenen, Mathar & Trejos, 2000). In the current study, 0% 50% and 100% of the maximum allowable missings implemented.

rank. It is aimed to examine how the local minima problem changes when there are

few categories versus when full rank is approached. Included levels are partial ranks with 5, 9, and 25 categories and full rank.

An overview of the included factors and corresponding levels are presented in Table 3.

Table 3: Factors and corresponding levels of the simulation study.

Factor Levels

procedure SR, DA, ILS, MR, CS

type (within) for DA; deterministic start and random start

for ILS; acceptance criterion “Better” and acceptance criterion “Markov”

n 10, 20, 50, 100

missings 0%, 0.5((n − p − 1)/n)2 _{%, ((n − p − 1)/n)}2 _%

rank (within) ordinal partial 5 categories

ordinal partial 9 categories ordinal partial 25 categories ordinal full

3.1.2. Data generation. For each combination of between factors, r = 100 true configurations are generated. The current study makes use of full ranked and partial

ranked data. Ordinal full data is obtained by δij = log(1 + dij), where dij are the Euclidean

(23)

ties are present. In partial rank, in this case 5-, 9- and 25- points scale are used to rank pairs, with 1 representing “very similar” and the highest rank number representing “very dissimilar”. When partial rank is used, there are many ties for the same ranking numbers are chosen multiple times. Partial data is created by randomly taking (nr of categories - 1) elements δij and ordering these elements from low to high. All elements that fall below the

first random threshold fall in category 1, all elements that are equal or above threshold 1 and below random threshold 2 fall in the second category, and so on. For each category, the elements of δij were replaced by the mean of the δij’s in that category.

The missing data elements were implemented by randomly replacing a 1 (available) by a 0 (missing) in weight matrix W . It was made sure that W was still connected; i.e. that all objects must be somehow connected by valid nonmissing data. This is required for the irriducability of the V matix. In MDS, this is a condition that should be met, for otherwise two or more seperate MDS problems exist.

3.1.3. Outcome variables. Since in the current study the generated data does not contain error, global minimum solutions can be recognized by solutions having zero stress. Solutions that have a stress greater than zero are considered to be local minima. The forthcoming outcome variable global (yes/no) is of main interest. Secondary outcome variables are spearman and time corresponding the global minimum solution. The

spearman rank correlation involves information about the quality of the global minimum solutions. It appeared that some global minimum solutions showed signs of degenerate solutions, that is, solutions with excessive many ties in the distances while maintaining proper order and zero stress contribution. The lower the spearman rank correlation between the true and the solution’s distances, the more degenerated the solution is. The running time of the solution contains information on how much effort was needed to find the global minimum. Table 4 provides an overview of the outcome variables.

(24)

Table 4: Outcome variables

Variable Definition

global whether a global minima was found (yes/no)

spearman rank correlation between the true distances and the solution’s distances

time running time of the solution

3.2 Data analysis

First, the seriousness of the local minima problem in nonmetric MDS is investigated by performing a logistic multilevel analysis on results of the SR procedure, with global as outcome variable and n, missings and rank as (within) factors.

To reduce the number of groups to be compared, the best version of DA and ILS are selected. The performance of the two types of DA and ILS will be compared in an empirical way, by means of examining differences in proportions global and corresponding confidence intervals. If the two versions are suspected to differ significantly, a logistic multilevel analyses will be performed with global as outcome variable and type as within factor, blocking for n, missing, and rank (within) and the two-way interactions among them.

Finally, the performance of the selected versions of DA and ILS will be compared to MR and CS with another logistic multilevel analysis with global as outcome variable and

procedure as between factor, blocking for factors n, missing, and rank (within) and the

three-way interactions among them.

The secondary variables, spearman and time, will be evaluated in an empirical way, by means of considering differences in distributions between groups.

(25)

4. Results

Out of n = 33592 cases, n = 31643 cases were converged within the maximum amount of iterations and were included for further analysis. It was also decided to include the n = 1949 cases for which the solution was not converged within the maximum number of

iterations. These non-converged solutions kept on moving the objects coordinates tiny bits, even when stress was already zero. It was decided to exclude n = 8 cases preliminary to analysis. One of these was diverged, which means that stress ascended during optimization. The other seven were removed due to an unfortunate distribution of missing data in

combination with the corresponding data matrix, even though the weight matrix was connected.

4.1 The local minima problem in nonmetric MDS

An idea of the seriousness of the local minima problem is obtained by considering the proportions of global minima that were found by SR, presented in Table 5. It seems that as the amount of missings increases, the problem becomes more severe. However, for rank 5 and 9 in combination with small n, the problem becomes less severe as missings are introduced. Furthermore, when there are no missings, the global minimum is found more often as n increases and as rank approaches full rank. But, when missing = 1 the opposite trend is visible: the problem becomes more severe as n increases and more severe as rank approaches full rank. The problem for small n is not as severe when there are many missings. However, it might be that the solutions found are very degenerated. Overall, it seems that it is most probable to find the global minimum when studying a large sample and no missings, with rank approaching full rank.

(26)

(27)

(28)

A logistic multilevel analysis was executed to study main and interaction effects of n,

missings and rank on global. In Table 6, the significant effects are presented. Levels n =

10, missing = 0 and rank = 5 were taken as references. The effects found can be interpreted by considering the odds ratio, which lies above zero and has no upper boundary. Odds ratios smaller than one indicate that the odds to finding a global minimum for a certain level is less than the odds of the reference level, whereas an odds ratio larger than one indicates a larger odds to finding the global minimum. In Table 6, some odds ratios stand out. The odds to finding a global minimum is remarkably larger than that of the reference group 1) when the maximum allowable missings are introduced, and 2) for higher levels of rank in combination with higher levels of n (two-way

interactions). The odds to finding a global minimum is remarkably smaller than that of the reference group 1) when n is large in combination with the maximum allowable missings (two-way interactions), and 2) for higher levels of n in combination with the maximum allowable missings in full rank (three-way interactions).

4.2 Types of DA and ILS

Comparing the performance of the two types of ILS empirically, no significant

difference was found in terms of proportion global, neither in terms of spearman and time. It was therefore decided to combine the results of the two versions of ILS for further comparisons. However, the performance of DA with deterministic start and DA with random start seemed to differ (see Figure 3). In all conditions, DA with deterministic start performs equally well or better than DA with random start. The trends found studying the SR results, described in the previous subsection, are again visible here.

The difference between the two versions of DA was supported by a logistic multilevel regression analysis, controlling for n, missing, and rank and the two-way interactions among them. Three-way interactions were not considered in this comparison, for this model was too complicated for this subset of the data. The odds to finding a global

(29)

minimum for DA with deterministic start was 1.80 times the odds to finding a global minimum for DA with random start, p < 0.001. It was therefore decided to only take DA with deterministic start into consideration for further comparisons.

(30)

4.3 Global

In Table 7, the proportions global minima are presented for each procedure in each condition. This table shows that procedures MR and CS generally perform better in terms of global optimization than DA or ILS. However, when there are no missings introduced, DA and ILS do almost equally well as MR and CS in full rank. Here, DA also seems to performs well in partial rank when n is small, whereas ILS performs performs well in partial rank when n is large. When the maximum allowable missings are introduced, DA and ILS keep up with MR and CS in case of partial rank with small n. As for large n with the highest level of missing, all of the procedures have difficulty finding the global minimum as

rank approaches full rank. The MR and CS procedures yield comparable results, and only

remarkably differ when n = 20 with the maximum amount of missings and full rank.

A logistic multilevel regression analysis was performed to determine whether there was a significant difference in performance based on these proportions between DA, ILS, MR, and CS controlling for n, missings, and rank and the three-way interactions among them. Taking DA as a reference, it was found that the performance of ILS did not significantly differ. However, the performance of MR and CS significantly differed from that of DA. The odds of finding a global minimum for MR was 16.71 times the odds of finding a global minimum for DA, p < 0.001. The odds of finding a global minimum for CS was 12.24 times the odds of success for DA, p < 0.001.

(31)

(32)

4.4. Secondary outcome variables

Emperical investigation of performance of DA, ILS, MR, and CS in terms of spearman only found a numerous difference where the maximum allowable amount of missings was introduced, see Figure 4.

Figure 4: Distributions spearman of global minimum solutions of DA, ILS, MR, and CS

on data with maximum allowable missings.

The procedures seems to find solutions with a higher spearman as rank approaches full rank. Also, the standard deviation of spearman seems to shrink as rank approaches full rank. However, these trends are not visible for higher levels of n. When rank = 25, specifically ILS and MR tend to find solutions with low spearman, with its distribution having a relatively large standard deviation. It appears that for this combination, MR and ILS tend to find degenerate solutions more often, whereas DA seems to be stable as far as finding qualitative solutions is concerned. In case of full rank, it is not clear how the trend

(33)

behaves, since only a few global minima were found in these conditions. For all levels of n, CS finds solutions with the highest spearman, and with a small standard deviation.

As far as running time is concerned, the same trends were visible for each level of n. In Figure 5, running times of n = 100 objects for each missing condition are presented. In these conditions, CS takes thus long to run, that it generally falls out of the range that is presented here. Apparently, the calculation of the CS start is very time consuming. For the other three procedures, MR generally takes the longest time to run, followed by ILS and DA respectively. However, for n = 100 in full rank specifically, the sequence is different. Here, DA takes the longest time to run, followed by MR and ILS respectively. This makes sense when one considers that in case of missing = 0, n = 100 in full rank, DA probably does 100 ∗ 99/2 = 4950 replications, since in full rank there are probably no ties. Even when a proportion of 0.5 ∗ ((100 − 2 − 1)/100)2 = 0.47 missings are introduced, DA still

probably needs 0.47 ∗ 4950 = 2327 replications, against a hundred replications in MR and maximum 448 replications in ILS.

(34)

5. Discussion

In the current study, the seriousness of the local minima problem in nonmetric MDS was studied for different numbers of objects, different amount of missings and different number of categories. Moreover, two versions of DA and ILS were proposed to avoid local minima. For DA, the deterministic start was tested against the random start, whereas in ILS, the “Better” acceptance criterion was tested against the “Markov” acceptance criterion. DA with deterministic start performed better than DA with random start in terms of proportion global minima found. The two versions of ILS performed equally well. The best version of DA and the (combined) ILS did not differ in performance on global, while MR and CS performed significantly better. Moreover, MR and CS seems to only incidentally fail to find the global minimum.

In terms of primary outcome variable global, it might be that it is more effective to make a random walk through the solution space (MR) or start with coordinates based on the data (CS) then to explore the solution space gradually (like in DA) or make a biased walk (like in ILS). However, maybe the proposed DA and ILS need some tweaking in order to perform equally well as MR and CS, or even better. Some recommandations are

proposed later in this section.

Empirical analysis of secondary variables spearman and time found that CS and DA were quite reliable in finding qualitative solutions over all conditions, whereas ILS and MR tended to find more degenerated solutions for higher levels of n as rank approached full rank. As far as time is concerned, MR and especially CS generally took more time to find the global minimum than DA or ILS. CS took thus long for n = 100 that it is not practical in usage in a simulation study.

As far as DA is concerned, several versions of the deterministic start were tested before the main simulation study. In one of these versions, the objects involved in the most dissimilar pair were placed exactly equally far from the origin (by placing these objects on a circle). The performance of this start did not seem to differ from the deterministic start

(35)

proposed in this study and it caused the algorithm to be slower. For future research, the following topics are proposed:

1. Take a closer look at various start configurations to find out what exactly improves the chance of DA to find a global minimum;

2. Study whether it is effective to combine DA and ILS, i.e. to add perturbance to the solution of each replication of DA; and

3. Examine what causes DA to be robust to find qualitatively good global minima.

In the proposed ILS procedure, the algorithm stops if after √n times √n perturbances

no solution has been accepted. This might be too little for n = 10, for which the algorithm should accept another solution within 10 perturbances. Concerning perturbation in ILS, only adding error was being proposed, without special adaptions concerning nonmetricness. Some recommendations for future research are:

1. Study whether a different stopping criterion is better and whether this criterion should be adapted to n or not. Possibly there is no need for a stopping criterion and it is best to set the number of replications beforehand. Herein, it would be interesting to investigate whether ILS performs equally well as MR when the algorithm is ran equally long as MR;

2. Examine whether it is possible that perturbances applied to the TSP, for example the double bridge move, are of value in MDS too. The double bridge move will probably change the rank order of distances, which might be preferable in the search for possible solutions;

3. Investigate whether perturbation by means of adding error could be improved by making sure that it changes rank orders. Probably, objects close to the origin should be perturbed less than objects far from the origin, since here objects only need to be moved a little to change the rank order; and

(36)

4. Involving non-monotonic temperature schedules (Hu, Kahng & Tsao, 1995), i.e. when it seems that no more improvement is gained by intensification, one could increase the temperature to do diversification for a limited time and then decrease the temperature again (intensification);

5. Study whether it is also effective for ILS to start with a certain nonrandom start. The study design of the current study has some limitations. First, only data without error was taken into account. It was decided not to add error to the data so that it was evident what the global minimum was, i.e. a solution with zero stress. However, this is less realistic, since dissimilarity data based on human judgement will certainly contain error. Also, the manner in which data was assigned to partial ranked data, did not guarentee that at least one data element was assigned to all categories. This means that for rank = 25, there might in some cases be less than 25 categories. This might yield a distorted picture of the results. And finally, while examining the results of the Monte Carlo simulation study, it seemed that the maximum amount of missings was too high for n = 50 and

n = 100 to be recovered, especially in full rank. And finally, in the current study, little

emphasis is laid upon the secondary variable spearman, which might be very interesting to investigate in depth. For future research, it is recommended to:

1. Execute the Monte Carlo simulation study in C, since this will be less time consuming than executing it in R;

2. Find a way to assign at least one data element to each categorie in case of partial rank;

3. Consider a manner to evaluate whether the global minimum is found when dealing with data with error and test the procedures on data with and without error; 4. Review the formula for maximum amount of missings used in the current study to

make sure that global optimization is still possible;

5. Take a closer look at the quality of the solutions for n = 10 in combination with the maximum amount of missings to consider their amount of degenerativeness;

(37)

6. Examine what causes degenerated solutions in general and tweaking the algorithms to make the global minimum solutions found as little degenerated as possible.

Thus, there is still enough to investigate regarding the topic of avoiding local minima in nonmetric multidimensional scaling by means of DA and ILS.

(38)

References

Bae, S., Qiu, J. & Fox, G. C., (2010). Multidimensional Scaling by Deterministic Annealing with Iterative Majorization algorithm. Pervasive Technology Institute, School of Informatics and Computing, Indiana University.

Baum, E.B. (1986). Towards practical “neural” computation for combinatorial

optimization problems. In: J. Denker (ed.), Neural Networks for Computing. AIP conference proceedings, pp. 53-64.

Birchfield, S. T., & Subramanya, A. (2005). Microphone Array Position Calibration by Basis-Point Classical Multidimensional Scaling. IEEE Transactions on Speach and

Audio Processing, 3 (5), 1025-1034.

Borg, I. & Groenen, P. J. F. (2005). Modern multidimensional scaling: Theory and

applications. New York, NY: Springer.

Borg, I. & Mair, P. (2017). The Choice of Initial Configurations in Multidimensional Scaling: Local Minima, Fit, and Interpretability. Austrian Journal of Statistics, 46, 19-32.

De Leeuw, J. (1977). Applications of convex analysis to multidimensional scaling. In J. R. Barra, F. Brodeau, G. Romier, & B. van Cutsem (Eds.), Recent developments in

statistics (pp. 133-145). Amsterdam, The Netherlands: North-Holland.

De Leeuw, J. (1993). Fitting distances by least squares. Unpublished manuscript. De Leeuw, J., & Heiser, w. J (1977). Convergence of the majorization method for

multidimensional scaling Journal of Classification, 5 , 163-180.

De Leeuw, J., & Heiser, W. J. (1980). Multidimensional scaling with restrictions on the configuration. In P. R. Krishnaiah (Ed.), Multivariate analysis, V, pp. 501-522. Amsterdam, The Netherlands: North-Holland.

(39)

Gold, S. & Rangarajan, A. (1996). A graduated assignment algorithm for graph matching. IEEE Trans. Pattern Anal. Machine Intell., 18(4), 377-388.

Goldberg, D. E. (1989). Genetic algorithms in search optimization and machine

learning. Addison-Wesley, Reading.

Gower, J. C. (1966). Gower1966.pdf. Biometrika, 53, 325-338.

Groenen, P. J. F. (1993). The majorization approach to multidimensional scaling:

Some problems and extensions. Leiden, The Netherlands: DSWO.

Groenen, P. J. F., & Heiser, W. J. (1996). The tunneling method for global optimization in multidimensional scaling. Psychometrika, 61, 529-550.

Groenen, P. J. F., Heiser, W. J. & Meulman, J. J. (1999). Global optimization in least squares multidimensional scaling by distance smoothing. Journal of

Classification, 16, 529-550.

Groenen, P. J. F., Mathar, R., & Trejos, J. (2000). Global Optimization Methods for Multidimensional Scaling Applied to Mobile Communication. In W. Gaul, O. Opitz, & M. Schader (Eds.), Data analysis (pp. 459-469). Heidelberg: Springer.

Hofmann, T., & Buhmann, J.M. (1997). Pairwise data clustering by deterministic annealing. IEEE Trans Pattern Anal. Machine Intell., 19(1), 1-14.

Hu, T.C., Kahng, A.B. & Tsao, C.-W.A. (1995). Old bachelor acceptance: A new class of non-monotone threshold accepting methods. ORSA Journal on Computing, 7(4), 417-425.

Klock, H., & Buhmann, J. M. (2000). Data visualization by multidimensional scaling: a deterministic annealing approach. Pattern Recognition, 29(4), pp. 651-669.

(40)

Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1-27.

Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: A numerical method.

Psychometrika, 29, 115-129.

Lourenço, H. R., Martin, O. C. & Stützle, T. (2010). Iterated Local Search : Framework and Applications, Volume 146.

Martin, O., Otto, S.W. & Felten, E.W. (1991). Large-step Markov chains for the traveling salesman problem. Complex Systems 5(3), 299-326.

Martin, O., Otto, S.W. & Felten, E.W. (1992). Large-step Markov chains for the TSP incorporating local search heuristics. Operations Research Letters, 11, 219-224.

Mathar, R. (1997). A Hybrid Global Optimization Algorithm for Multidimensional Scaling. Institut für Statistik, Aachen, Germany.

Microsof, Visual Studio (2017). URLhttps://visualstudio.microsoft.com/vs/whatsnew/ .

R Core Team (2017). R: A language and environment for statistical computing. R foundation for Statistical Computing, Vienna, Austria.

URL http://www.R-project.org/ .

Rose, K., Gurewitz, E. & Fox, G. (1990). Statistical mechanics and phase transitions in clustering, Phys. Rev. Lett. 65(8), 945-948.

Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. I. II. Psychometrika, 27, 125-140; 219-246.

Stützle, T. (1998). Applying iterated local search to the permutation flow shop problem. Technical Report AIDA-98-04, FG Intellektik, TU Darmstadt, August.

(41)

Taillard, E.D. (1995). Comparison of iterative searches for the quadratic assignment problem. Location Science, 3, 87-105.

Timmer, G. T. (1984). Global optimization: A stochastic approach. Doctoral dissertation. Erasmus University, Rotterdam.

Torgerson, W. S. (1952). Multidimensional Scaling: I. Theory and Method.

Psychometrika, 17(4): 401-419.

Trejos, J., Castillo, W., González, J., Villalobos, M. (2000). Application of Simulated Annealing in some Multidimensional Scaling Problems. In H. A. L. Kiers, J. P. Rasson, P. J. F. Groenen, M. Schader (Eds.), Data Analysis, Classification, and

Related Methods(pp. 297-302). Springer, Berlin: Heidelberg.

Wagenaar, W. A. & Padmos, P. (1971). Quantitative interpretation of stress in Kruskal’s method multidimensional scaling technique. British Journal of

Mathematical and Statistical Psychology, 24, 101-110.

Young, F. W. & Cliff, N. (1972). Interactive scaling with individual subjects.

Avoiding local minima in nonmetric multidimensional scaling by means of deterministic annealing and iterated local search

Avoiding Local Minima in Nonmetric

Multidimensional Scaling by Means of Deterministic

Annealing and Iterated Local Search.

Table of Contents