On the mathematical equivalence, biological plausibility, and sparsity assumptions of network models

(1)

On the

Mathematical Equivalence,

Biological Plausibility,

& Sparsity Assumptions

of Network Models

University of Amsterdam

Thesis - Research Masters Psychology

Author: Joost Kruis

Supervisor:

prof. dr. Gunter Maris Second Assessor:

prof. dr. Han L.J. van der Maas

(2)

The first article in this thesis is titled ‘Rasch in the Bayesian Brain’, and is written by Gunter Maris and myself. In the first part Maris ties together three important, and theoretically very di↵erent, mechanisms for (statisti-cally) explaining the (pair-wise) relations between binary variables. Fur-thermore, Maris demonstrates how the selection bias mechanism employs Bayesian inference, and can be implemented such that a single algorithm can recover, with suitable input, each of 2n (with n the number of observed variables) di↵erent posterior distributions. In the biological part I (Joost Kruis) argue that the selection bias mechanism is commensurate with the sparse structure of anatomical neural networks, and explains the common finding that functional neural networks are dense. The insight here is that node or neuron di↵erentiation is key to reconciling sparse anatomical and dense functional networks.

2: The Alternative Bet on Sparsity - Kruis & Maris, 2015

The second article in this thesis is titled ‘The Alternative Bet on Sparsity’. In the article I describe how current methods to estimate network structures require implausible assumptions about the sparsity of the underlying net-work. Next, I propose a new framework in which to view the concept of sparsity, and introduce three models that each estimate network structures with a sparse amount of parameters. Finally, I demonstrate these methods of sparsity, and discuss our findings.

(3)

Rasch in the Bayesian Brain

Gunter Maris

CITO - University of Amsterdam

Joost Kruis

University of Amsterdam

July 3, 2015

Abstract

We demonstrate how three seemingly di↵erent statistical models for explaining relations between binary random variables are mathe-matically equivalent. Of these three, one particular instance is both computationally powerful and biologically plausible.

We set out to demonstrate how the Rasch model (Rasch, 1960) from psychometrics can be implemented in a neural network in a mathematically elegant, computationally powerful, and biologically plausible manner.

The mathematical part of the paper ties together three important, and theoretically very di↵erent, mechanisms for (statistically) explaining the (pair-wise) relations between binary variables. The first of these explains the observed relations through an unobserved or latent variable. That is, condi-tionally on the unobserved variable all variables are independent, such that marginally (with respect to the unobserved variable) they are dependent. As a prototype of such model we consider the Rasch model, from the field of psychometrics. The second, in a sense, does not explain the relations, but merely represents them. As a prototype of such model we consider the Curie-Weiss model, from the field of statistical mechanics. The third mechanism explains the relations as arising from conditioning on a common e↵ect of the observed variables (i.e., selection bias). For this mechanism we introduce a new prototype model. We show that all three types of models are statistically equivalent.

(4)

In the computational part of the paper we demonstrate how the selection bias mechanism implements Bayesian inference, and can be implemented such

that a single algorithm can recover, with suitable input, each of 2n (with n

the number of observed variables) di↵erent posterior distributions.

The biological part argues that the selection bias mechanism is commen-surate with the sparse structure of anatomical neural networks, and explains the common finding that functional neural networks are dense. The insight here is that node or neuron di↵erentiation is key to reconciling sparse anatom-ical and dense functional networks.

The neural implementation of the Rasch model not only has amusement value, but sheds light on how a brain is capable of reconstructing the cause from observing its e↵ect, and ties together three important, and theoretically very di↵erent, mechanisms for (statistically) explaining the relations between binary variables.

1 Mathematical elegance

The Rasch (1960) model represents the distribution of a set of binary

(+/-coded) random variables Xi conditionally on an unobserved latent (random)

variable ⇥, commonly referred to as ability, as follows:

p(x_{|✓) =}Y

i

exp(xi[✓ + bi])

exp(+[✓ + bi]) + exp( [✓ + bi])

Historically, the model has been developed for modeling the responses of

per-sons to items, with Xi indicating whether or not the response to item i is

correct. Latent variables, such as ability in the Rasch model, are arguably both the most powerful and controversial concept in the whole of psychol-ogy. Latent variable models have proved to be extremely successful at fitting observed multivariate distributions. At the same time their theoretical and philosophical underpinning remains problematic (Borsboom, Mellenbergh, & van Heerden, 2004). Graphically, the Rasch model can be represented as a Directed Acyclic Graph (DAG) with the latent variable ⇥ as the common cause of the observed response variables Xi, as illustrated in Figure 1. This

is exactly where the problem with the model resides. The cause is never observed, and, and similar to physics around the turn of the 20-th century, we are in need of . . .

(5)

Latent variable _{, Direct interaction} E=1_, Selection bias

Figure 1: Representation of the Rasch model as a DAG, the Curie-Weiss model as an undirected graph, and the collider selection bias model as a DAG.

Popper (2005), p. 211. . . . an epistemological programme: to rid the theory of unobservables, that is, of magnitudes inaccessible to experimental observation; to rid it, one might say, of metaphysical elements.

Acknowledging that ⇥ is an unobserved random variable, we can by en-dowing it with a distribution obtain the corresponding manifest probabilities:

p(x) = Z 1 1 Y i exp(xi[✓ + bi]) exp(+[✓ + bi]) + exp( [✓ + bi]) f (✓)d✓

Although we have an expression for the manifest probabilities of the Rasch model, for most all choices for the distribution of the latent variable, the expression is quite intractable.

As a first step towards realizing our goal, we show that the highly prob-lematic Rasch model is equivalent to the highly implausible Curie-Weiss (neu-ral) network model. The Curie-Weiss model comes from statistical physics where it is the simplest non-trivial model that exhibits a phase transition. It is considered as being mainly of theoretical interest as it violates fundamental principles of statistical physics (Kac, 1968).

The key result for establishing the connection between these two problem-atic models has been known for a long time and has also been rediscovered quite a few times in quite diverse fields of science (Kac, 1968; Cox & Wer-muth, 1994; Olkin & Tate, 1961; McCullagh, 1994; Anderson & Yu, 2007). Despite this, however, the key result is, to say the least, less than well known, which is why we devote some space to the connection.

(6)

The Curie-Weiss model is characterized by the following manifest prob-abilities, where for simplicity we ignore the influence of temperature on the distribution: p(x) = exp( P ixibi+ [ P ixi]2) Z

with Z being the appropriate normalizing factor (often called the partition function) to make the probabilities sum to one. Graphically, the Curie-Weiss model can be represented as a fully connected graph, with all edges equal, as illustrated in Figure 1. As Kac (1968) notes, the model violates fundamental principles of statistical physics as it is dimension independent with interaction energy depending on the size of the system. We should, and will, not bother too much about what any of this means, except that the Curie-Weiss model is for physicists at least as problematic as the Rasch model is for psychologists. In his Brandeis lecture, Mark Kac (1968) established the relation between the Curie-Weiss model and the Rasch model through an ingenuous use of the Gaussian integral: exp(a2) = Z 1 1 exp(2a✓ ✓2₎ p ⇡ d✓

What Kac realized is that whenever you see the exponential of a square, you can replace it with the right hand side integral. Applying this to the Curie-Weiss model we hence obtain a latent variable representation:

p(x) = Z 1 1 exp(P_ixibi+ 2[P_ixi]✓ ✓2) p ⇡Z d✓

The key insight from this reformulation of the Curie-Weiss model is that, in its latent variable form, the square in the exponential is gone, which allows us to rewrite the model further as

p(x) = Z 1 1 Y i exp(xi[bi+ 2✓]) exp(+[bi+ 2✓]) + exp( [bi+ 2✓]) Q

i[exp(+[bi+ 2xi✓]) + exp( [bi+ 2xi✓])] exp( ✓2)

p

⇡Z d✓ (1)

Looking back at the manifest probabilities for the Rasch model we see that the Curie-Weiss model is nothing but a particular marginal Rasch model, with a slightly peculiar distribution for the latent variable.

(7)

For the second step towards realizing our first goal, we consider a set of

independent Bernoulli random variables Ci (which we collectively call the

cause) together with a single (0/1 coded) dependent binary random variable E (which we call the e↵ect):

p(c, e) = 0 @ exp([ P ici]2) sup c exp([P_ici]2) 1 A e0 @1 exp([ P ici]2) sup c exp([P_ici]2) 1 A 1 e Y i exp(bici) exp(bi) + exp( bi) (2) This distribution can be graphically represented in a DAG and is known in the literature on causality as a collider, or common e↵ect, structure, as illustrated in Figure 1.

It is known from the literature on causality that selection with respect to a collider variable introduces (spurious) correlation among the causes (Heckman, 1979; Greenland, Pearl, & Robins, 1999). We can use this mech-anism to relate this particular collider structure to the Curie-Weiss network and hence to the Rasch model. Selection with respect to a collider variable is mathematically represented in the distribution of the causes conditionally on the e↵ect. The specific collider structure (i.e., the distribution of the ef-fect conditionally on the causes) was chosen such that the distribution of the cause conditionally on the e↵ect exactly gives the Curie-Weiss model from statistical physics:

p(C = c|E = 1) = exp(c

T_{b + [}P ici]2)

Z

Formally, we may consider this representation as implementing Bayesian statistical inference, with C acting as a prior distribution on the causes (in-terpreted here as binary parameters), E_{|c as the probabilistic model relating} the observation (i.e., E) to the parameters (i.e., C), such that C|E = 1 gives the posterior distribution of the parameters given the observation.

Having studied the three forms of statistical explanation in their simplest non-trivial form, we can swiftly generalize all three to realistic models. We start from the full Ising model, of which the Curie-Weiss model is the simplest form:

p(x) = exp(x

T_{b + x}T_⌃x)

(8)

where ij modulates the correlation between xi and xj.

Using the eigenvalue decomposition of the connectivity matrix ⌃, and the observation that the diagonal values of ⌃ are arbitrary (i.e., do not change the probabilities), we have that

⌃ + KI =X

j

jaaT

such that we may rewrite the Ising model as:

p(x) = exp(x T_{b +}P j j[ P iaijxi]2) Z

Through judicious choice of K we can ensure that all eigenvalues are non-negative, and hence we may use the Mark Kac trick for each value of j. This gives a latent variable representation of the full Ising model (Marsman, Maris, Bechger, & Glas, 2015) with as many latent dimensions as there are non-zero eigenvalues. The particular latent variable model can be shown to be the multidimensional two parameter logistic model (Reckase, 2009), of which the Rasch model is the simplest instance. Similarly, we can introduce (independent) e↵ect variables for each eigenvalue:

p(Ej = 1|c) = exp( j[P_iaijci]2) sup c exp( j[ P iaijci]2) such that (C_{|E = 1) ⇠ X.}

We conclude that mathematically and statistically, explanations of ob-served correlations through marginalization with respect to latent variables, through direct interactions between variables, and through conditioning on common e↵ect variables are indistinguishable. Although the three forms of statistical explanation are observationally equivalent, their theoretical inter-pretation is radically di↵erent.

2 Computational strength

Alexander Pope: Lulled in the countless chambers of the brain, our thoughts are linked by many a hidden chain; awake but one, and in, what myriads rise!

(9)

We demonstrate how the common e↵ect formulation of the Rasch model

provides us with a powerful computational tool. For every subset S of the

e↵ect variables, we find that (C|ES = 1S) is a di↵erent Ising network model.

Hence, if we consider n e↵ect variables and n independent causes, we can represent 2n _{di↵erent Ising networks using only n}2 _{edges. Although the Ising}

networks are di↵erent, they are not unrelated, as they all involve subsets of

the eigenvalues and eigenvectors of the complete Ising network (C_{|E = 1).}

It is important to observe that E_S = 1_S does not imply that E_SC = 0_SC, the

E_SC are integrated out.

We use this fact by first setting up a Markov chain for which the joint distribution of causes and e↵ects (C,E) is the invariant distribution. As a simple example we consider a Gibbs sampler which in turn updates the state of each of the cause and e↵ect variables conditionally on the current state of all other variables. In such a Markov chain, the marginal probability of observing an e↵ect is the following:

p(Ej = 1) = E (exp( j

[P_iaijCi]2))

sup

c

exp( j[Piaijci]2)

which will tend to be arbitrarily close to zero, for almost any system of

non-trivial size, and decreases with increasing j. That is, we can run this

Markov chain, basically, until the sun burns out without ever once observing the e↵ect. This may seem to be a problem and a major disadvantage, but it actually is an important strength of the neural network. Suppose that some outside force (i.e., outside of the neural network) impacts selectively on the e↵ect node. Whilst running our Markov chain, every time the e↵ect node gets selected, its value is put to one. The e↵ect of this outside impact on the system is such that the Markov chain starts to converge to the Ising

network model C_|Ej = 1. That is, by forcing the e↵ect to occur, the network

automatically recovers the source.

Hence, this network model is capable of inferring the source from observ-ing the e↵ect, which is a computationally very powerful mechanism. From a Bayesian point of view, the neural network samples from the prior distri-bution of the causes until the e↵ect is observed after which it converges to sampling from the posterior distribution of the causes conditionally on the e↵ect. The precision with which the source is reconstructed depends entirely on the value of j.

(10)

We see that a common e↵ect network with n independent causes, and m (mutually) independent e↵ects is capable of representing 2m_{distinct posterior}

networks with (at most) nm edges. This result is closely related to the way in which bipartite graphs are capable of projecting dense networks structures (Zhou, Ren, Medo, & Zhang, 2007).

3 Biological plausibility

Latent variable models have shown to be very successful in describing vari-ation in observed behavior as a function of a common cause (e.g. ability in IRT models). Nonetheless, the validity of the theoretical concept of a latent variable is heavily disputed (Borsboom et al., 2004), in large part because these latent causes are never directly observed in the human brain.

The Curie-Weiss model has somewhat more merit as an explanation for observed behavior, since it does not invoke this latent variable. However, its characteristic fully connected network structure, with all edges equal, is obviously not an accurate representation of the human brain. More generally, dense network structures of brain activity are only plausible when looking to functional network structures, i.e. a structure based on the observed statistical dependencies between nodes. And even though in these functional networks of the brain there can be unconnected nodes, because of marginal independence, the functional connections that are found, often exist between anatomically unconnected neurons (Rubinov & Sporns, 2010)

In the anatomical brain network the network structure is based on the connections that are physically present between neurons in the brain. In contrast to functional brain networks, that occupy a virtual (and therefore unlimited) space, the anatomical brain is contained in a limited space, which constraints the number and density of neurons that can exist in this space. As a result, the organization of the anatomical brain network is a trade-o↵ between the minimization of the material and metabolic costs, that come with wiring and running the brain, and preserving the required functionality (Bullmore & Sporns, 2012). This is partially achieved by the exhibition of modularity in the brain, i.e. dense connectivity within a module but sparse connectivity between modules (Park & Friston, 2013). In contrast to the Rasch model, which involves an unobserved variable, and the Curie-Weiss

model which involves n2 _{edges for a network of n nodes, the collider neural}

(11)

is sparse, involving only n 1 edges for n nodes, which is in correspondence with the anatomical structure of the brain.

Furthermore, the modus operandi of a collider neural network is in line with the Bayesian coding hypothesis, a theory on how information is proba-bilistically represented by the brain. In Bayesian coding, the function of the brain is to infer the causes of sensory inputs (e↵ects). Carrying out this a function in an ‘optimal’ way, requires the representation and handling of the uncertainty, since the inference takes place from the often noisy and ambigu-ous input. The nervambigu-ous system accomplishes this by encoding probabilistic models and updating these by neural processing of sensory information us-ing Bayesian inference. Through the process of updatus-ing these probabilistic representations, the brain is able to infer of what caused its sensory input (Colombo & Seri`es, 2012; Friston, 2003, 2005; Knill & Pouget, 2004; Tenen-baum, Kemp, Griffiths, & Goodman, 2011).

In Bayesian coding, the neuronal network is represented as a probabilistic generative model of incoming sensory input:

p(E, ) =

1

Z

1

p(E_{|C, )p(C, )dc}

This marginal distribution is specified with a prior distribution over the causes p(C, ), and the generative distribution or likelihood of the inputs

given the causes p(E_{|C, ).} represents the unknown parameters, with fixed

quantities, of the generative model, and correspond to connection strengths in

the brains model of how inputs are caused. These parameters are learned

by matching the marginal distribution to the actual input p(E) until the densities of the predicted and observed input match as closely as possible (Friston, 2003). When the prediction error is minimized by the generative model, the causes of the sensory input are inferred by sampling from the posterior density distribution of the causes given the sensory input:

p(C_{|E, ) =} p(E|C, )p(C, )

p(E, )

It can thus be seen that the common e↵ect formulation of the Rasch model implements the theory of Bayesian coding in the brain. That several psychophysical experiments have revealed that the nervous system indeed performs probabilistic inference, and does so in a nearly optimal way, in a variety of tasks, is compelling evidence in favor of the Bayesian coding

(12)

hypothesis (Knill & Pouget, 2004; Ernst & Banks, 2002; K¨ording & Wolpert, 2004; Weiss, Simoncelli, & Adelson, 2002). However, a challenge in Bayesian coding has been to determine how such computations are implemented at the neural level.

In other words, how can the brain actually start sampling from the pos-terior distribution of the causes given the e↵ect. As discussed in section two, for these kinds of non-trivial dynamics we can use a Markov chain which has the joint distribution of causes and e↵ect as its invariant distribution. In this case the invariant joint distribution has the following form:

p(C, E) = p(E_|C)Y

i

p(Ci)

We get this distribution by implementing the Markov chain: p(Ct+1, Et+1) =

Z

p(Ct+1, Et+1|Ct, Et)p(Ct, Et)dCtEt

That converges in the limit to the Curie-Weiss network corresponding to the common e↵ect network.

p(C|E = 1) = exp (

P

iaiCi)2

Z

The neural implementation of this process is captured in the architecture and dynamics of the transition kernel p(Ct+1, Et+1|Ct, Et) of the Markov

chain. This operation might possibly be approximated with an univariate translation of the theory of cue integration, represented by a network of spiking integrate-and-fire neurons. Cue integration postulates that the mag-nitude of the reaction from multiple sensory neurons caused by multiple sen-sory stimuli, is the sum of the individual reactions from each sensen-sory neuron to the appropriate individual stimuli (Gold & Shadlen, 2001; Knill & Pouget, 2004; Deneve, 2008; Ma, Beck, & Pouget, 2008; Gerwinn, Macke, & Bethge, 2009).

In the univariate case, we can thus present the likelihood of the Markov chain as a single integrate-and-fire neuron, that accumulates the activity of the causes over time, and spikes when the e↵ect occurs (E = 1), thereby broadcasting the sum of the accumulated activity in the causes (aT_{c). Single}

neurons that exclusively spike after a sensory stimulus have already been detected in several brain regions of animals, for example in the auditory cortex of rats, and the olfactory system of insects (Perez-Orive et al., 2002;

(13)

DeWeese, Wehr, & Zador, 2003). This behavior is referred to as ‘binary coding’, as the neuron appears to produce either a 0 or a 1 in response to a stimulus, whereas the probability of spiking overall is very low (Olshausen & Field, 2004).

4 Discussion

In this paper, we demonstrated that mathematically and statistically, ex-planations of observed correlations through marginalization with respect to latent variables, through direct interactions between variables, and through conditioning on common e↵ect variables are indistinguishable. The com-mon e↵ect formulation of the Rasch model result in a network model that is capable of inferring the source from observing the e↵ect. Not only is this computationally a very powerful mechanism, but also a biologically plausible model because it is, like the brain, sparse and has a comparable functionality as proposed by the popular and supported theory of predictive coding.

The statistical equivalence of the Rasch model, the Curie-Weiss network, and the common e↵ect model, has important advantages. In their di↵erent fields of application di↵erent aspects of these models have been studied and di↵erent methodology has been developed. Through their connection much of these developments becomes available to all fields of application.

References

Anderson, C. J., & Yu, H.-T. (2007). Log-multiplicative association models as item response models. Psychometrika, 72 (1), 5–23.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological review , 111 (4), 1061.

Bullmore, E., & Sporns, O. (2012). The economy of brain network organi-zation. Nature Reviews Neuroscience, 13 , 336–349.

Colombo, M., & Seri`es, P. (2012). Bayes in the brainon bayesian modelling in neuroscience. The British journal for the philosophy of science, axr043. Cox, D. R., & Wermuth, N. (1994). A note on the quadratic exponential

binary distribution. Biometrika, 81 (2), 403–408.

Deneve, S. (2008). Bayesian spiking neurons i: inference. Neural computa-tion, 20 (1), 91–117.

(14)

DeWeese, M. R., Wehr, M., & Zador, A. M. (2003). Binary spiking in auditory cortex. The Journal of neuroscience, 23 (21), 7940–7949. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic

information in a statistically optimal fashion. Nature, 415 (6870), 429– 433.

Friston, K. (2003). Learning and inference in the brain. Neural Networks, 16 (9), 1325–1352.

Friston, K. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences, 360 (1456), 815–836. Gerwinn, S., Macke, J., & Bethge, M. (2009). Bayesian population decoding

of spiking neurons. Frontiers in computational neuroscience, 3 .

Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in cognitive sciences, 5 (1), 10–16.

Greenland, S., Pearl, J., & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 37–48.

Heckman, J. J. (1979). Sample selection bias as a specification error. Econo-metrica: Journal of the econometric society, 153–161.

Kac, M. (1968). Statistical physics: Phase transitions and superfluidity. In M. Chretien, E. Gross, & S. Deser (Eds.), Brandeis university sum-mer institute in theoretical physics. (Vol. 1, pp. 241–305). Gordon and Breach Science Publishers, New York.

Knill, D. C., & Pouget, A. (2004). The bayesian brain: the role of uncertainty in neural coding and computation. TRENDS in Neurosciences, 27 (12), 712–719.

K¨ording, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensori-motor learning. Nature, 427 (6971), 244–247.

Ma, W. J., Beck, J. M., & Pouget, A. (2008). Spiking networks for bayesian inference and choice. Current opinion in neurobiology, 18 (2), 217–222. Marsman, M., Maris, G., Bechger, T., & Glas, C. (2015). Bayesian inference

for low-rank ising networks. Scientific reports, 5 .

McCullagh, P. (1994). Exponential mixtures and quadratic exponential families. Biometrika, 81 (4), 721–729.

Olkin, I., & Tate, R. F. (1961). Multivariate correlation models with mixed discrete and continuous variables. The Annals of Mathematical Statis-tics, 448–465.

Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current opinion in neurobiology, 14 (4), 481–487.

(15)

Park, H.-J., & Friston, K. (2013). Structural and functional brain networks: from connections to cognition. Science, 342 , 579 – 587.

Perez-Orive, J., Mazor, O., Turner, G. C., Cassenaer, S., Wilson, R. I., & Laurent, G. (2002). Oscillations and sparsening of odor representations in the mushroom body. Science, 297 (5580), 359–365.

Popper, K. (2005). The logic of scientific discovery. Routledge.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago: The University of Chicago Press.) Reckase, M. (2009). Multidimensional item response theory. Springer. Rubinov, M., & Sporns, O. (2010). Complex network measures of brain

connectivity: uses and interpretations. Neuroimage, 52 , 1059–1069. Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, N. D. (2011).

How to grow a mind: Statistics, structure, and abstraction. science, 331 (6022), 1279–1285.

Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nature neuroscience, 5 (6), 598–604.

Zhou, T., Ren, J., Medo, M., & Zhang, Y.-C. (2007). Bipartite network projection and personal recommendation. Physical Review E , 76 (4), 046115.

(16)

The Alternative Bet on Sparsity

Three methods for constructing networks with only N parameters

Joost Kruis

University of Amsterdam

Gunter Maris

CITO - University of Amsterdam

July 3, 2015

(17)

Abstract

Current e↵orts to estimate network models must adhere to the ‘bet on spar-sity’ principle, that is assume that the underlying network structure is sparse. We propose an alternative ‘bet on sparsity’, namely that of sparsity in the number of network parameters to be estimated. We present and demon-strate three di↵erent methods for this kind of sparsity in the context of the Ising network model. Furthermore, we show that estimation procedures forc-ing a network to have a sparse structure, can still recover the original data structure by inflating the influence of main-e↵ects, even if the true network structure is dense.

(18)

The ‘Bet on Sparsity’ Principle

Current e↵orts to estimate network models are subject to what was termed the ‘bet on sparsity’ principle by Hastie, Tibshirani, and Friedman (2001).

Use a procedure that does well in sparse problems, since no procedure does well in dense problems.

From the ‘bet on sparsity’ principle follows that network structures resulting from these estimation procedures are only plausible if the true model in the population is also sparse. Hence, when estimating network models the assumption of a sparse underlying network must be made. An assumption that is at least problematic in some cases, as aptly described by Gelman (2013).

How do we think about the ‘bet on sparsity’ principle in a world where the truth is dense? I’m thinking here of social science, where no e↵ects are clean and no coefficient is zero.

The current standing leaves us at an impasse; present methods cannot re-liably estimate dense network structures, however the assumption that all network structures are sparse is implausible. In this paper an alternative ‘bet on sparsity’ is proposed, namely that of sparsity in the number of pa-rameters to be estimated. Taking the Ising network model as a starting point, three methods are proposed that require only a sparse amount of parameters to be estimated for construction of the network.

The Ising model (Ising, 1925) is the prototypical form of a pairwise markov random field (PMRF; Besag, 1974; Lauritzen, 1996) when data are binary

(X 2 { 1, 1}), and captures all main e↵ects and pairwise interactions. The

structure of the network consists of nodes (V ) that represent the random variables, and edges (E) that represent the associations between these ables. Because a PMRF is undirected, a full Ising model for N random

vari-ables requires the estimation of N threshold parameters (⌧i; main e↵ects),

and N (N 1)/2 network parameters (!ij; pairwise interactions). The Ising

model belongs to the exponential family of distributions, the item-scores are the sufficient statistic for the magnetic field (thresholds), and the number of identical responses for a pair of items is the sufficient statistic for the pairwise

(19)

interactions.

The Ising model is characterized by the following distribution:

P (X = x) = 1 Z exp X i ⌧ixi+ X <ij> !ijxixj ! Z =X x exp X i ⌧ixi+ X <ij> !ijxixj !

where P_<ij> takes the sum over all distinct pairs of nodes i and j (j > i), and Z being the normalizing constant, or partition function, that sums over all possible network states to make the probabilities sum to one. The log-likelihood expression for the Ising model is:

L(⌧ , ⌦; x) = X i ⌧ixi+ X <ij> !ijxixj X x X i ⌧ixi+ X <ij> !ijxixj !

It can easily be seen that Z becomes quickly intractable when N goes up. For this reason pseudo-log-likelihood is often used to estimate the Ising model (Besag, 1974). Li(⌧ , ⌦; x) = xi ⌧i+ X j !ijxj ! X xi exp xi ⌧i+ X j !ijxj !!

The univariate conditional log-likelihoods are optimized individually and their product constitutes the pseudo-log-likelihood that approximates the full log-likelihood of the Ising model.

ln PL =

k

X

i=1

Li(⌧ , ⌦; x)

Using a pseudo-log-likelihood approach removes the intractable normalizing constant. However, since the complexity of the model grows when the num-ber of random variables goes up, the amount of observations required to

(20)

estimate the model also becomes large. When the number of observations is limited, regularization must be applied, and the optimization happens over a penalized version of the pseudo-log-likelihood (Epskamp, Maris, Waldorp, & Borsboom, in press; van Borkulo et al., 2014).

Methods of Sparsity

The most common regularization method is `1 regularization, also known as

the least absolute shrinkage and selection operator (lasso; Tibshirani, 1996). In `1 regularization the sum of absolute parameter values of !ij is penalized

to be under some value, which causes parameter estimates to shrink to ex-actly zero, and the optimization happens over this penalized version of the pseudo-likelihood.

For the ‘alternative bet on sparsity’, instead of betting on the sparsity of the true network, here the focus is on the sparsity of the estimation procedure itself. In the remainder of this section three methods of sparsity are pre-sented, in which the pairwise associations between the random variables in the network structure are captured by estimating at most N parameters.

The first method of sparsity is the aforementioned `1 regularization, where

additionally the number of pairwise associations that are not shrunk to zero is actively constrained to be N .

`1 ! !ij 6= 0 for at most N pairs

The second method of sparsity is based on a low-rank approximation to the full connectivity matrix (A) of the Ising model (Marsman, Maris, Bechger, & Glas, 2015). The best approximation of rank r to A uses a latent variable representation of the Ising model (Kac, 1968), and consists of the eigenvalue decomposition in which all but the largest r eigenvalues are equated to zero.

As the low-rank approach requires N ⇥ r parameter estimates for the full

network, the second method of sparsity uses a rank 1, or in line with the first

method r1 approach for estimating the network structure.

r1 ! !ij = ai⇥ aj

The third method of sparsity originates from the Interaction model (Haber-man, 2007), an item response theory model that accounts for interactions

(21)

between items and the total score, such that the response pattern and person ability are conditionally independent given the sum score. These interaction between the item and total score are modeled with an additive interaction term ( ) estimated for each item. In the third method of sparsity, the pair-wise associations between two random variables (items) in a network, are determined by the sum of the interaction parameter estimates of those vari-ables. As the additivity of the parameters is fundamental in this method,

and the interaction uses an univariate approach, we termed it the a1 method.

a1 ! !ij = i+ j + 1

The three methods of sparsity we propose are linked to the estimation of the pairwise interactions between the nodes in a network. Next to this each method also estimates N parameters for the main-e↵ects, i.e. the threshold

parameters in the `1 method, and the item easiness parameters in the r1 and

a1 approach.

The three proposed methods of sparsity all have a di↵erent inferential frame-work, and find their optimal configuration using respectively a, pseudo-, marginal-, and conditional maximum likelihood approach. As such, com-parison of their respective performance is a very interesting, but not trivial, matter. We cannot simply perform a likelihood-ratio test to compare the proposed solutions by each method. Therefore, we suggest to compare the performance of each method by assessing the extent to which new data, gen-erated with the solution of each method, is able to replicate the structure of the initial data. That is, correspondence in the distribution of the item- and sum-scores, and similarity in the number of identical responses for all item pairs in the data.

As the item-scores and number of identical responses for all item pairs are sufficient statistics in the Ising model, we expect excellent recovery of those in case of a unconstrained model. We also look to the recovery of the sum-score distribution as this is often of interest to the researcher. A physicist might ask, for example, why the entire state of the Ising model is +. The psychol-ogist can be concerned with how a set of depression symptoms is distributed in the population. In an educational setting we are often interested in how the sum score is distributed within our sample. The question is thus whether we can explain the sum-score distribution with just the pairwise interactions and thresholds.

(22)

Results

Figure 1: Network structure under which the initial data was generated.

Simulation To illustrate the three methods of sparsity we generated data under the network structure shown in Figure 1 and compared the ability of each method to recover this initial network structure. In Figure 1 the numbered circles represent the nodes, and the colored lines between two nodes represent their respective edge. A green edge represent a positive association between two nodes, whereas a red edge illustrates a negative association. The strength of these associations is captured in the thickness of each edge, wherein the thicker the edge the stronger the association between two nodes. In Figure 2 the estimated network structure is displayed for all three methods of sparsity. It is readily seen that the a1 method shows excellent recovery of

the initial network structure. The r1 method also captures the underlying

network structure in large part, only the strong negative edges on the left are not recovered. The reason for this is that in the r1 method the direction

(positive/negative) and strength of an edge is established by the product of the estimated ↵ parameters for the respective nodes. This means that a negative edge can only be obtained when ↵ is positive for one node and negative for the other. As such, when three adjoining nodes have a negative association with each other in the true network structure, the r1 method will

not be able to recover this. In fact the number of negative associations the

(23)

max( !ij|r1) = N 2 · ✓ N N 2 ◆

The `1 method recovers (as expected) only part of the network, starting

with the strongest edge and adding existing edges based on strength and connectivity with the already selected edges. If we inspect the correspondence of score distributions and pairwise similarity (Figure 3), we find something

interesting. Although the `1 method did only recover part of the network,

the structure of the data generated under this model is highly similar to that of the initial data.

(a) `1 method (b) r1 method

(c) a1method

(24)

(a) ECDF sum-score

(b) ECDF item-score

(c) pairwise similarity

Figure 3: Correspondence between initial, and generated data.

After some investigation we determined that this was because the thresholds parameters alone were able to explain a large amount of the variation in the response patterns. We show this by generating new data under each method where all pairwise connections were set to zero and only the estimated threshold parameters were used. The resulting data structure is displayed in

Figure 4 and shows that the `1method was able to recover the data structure

(25)

(a) ECDF sum-score

(b) ECDF item-score

Figure 4: Correspondence between initial, and generated data, only thresholds.

Further examination showed that the size of the threshold parameters, rela-tive to the strength of the pairwise interactions, determines how well the `1

method works if the underlying network structure is dense. When the value of the thresholds is much greater than that of the pairwise interactions, the

`1 method can recover a lot of the structure with just thresholds. However

as the size of the thresholds becomes closer to that of the pairwise interac-tions, the performance of the `1 method begins to decline rapidly, especially

for stronger positive pairwise associations. In contrast, both the a1 and r1

method recover the network and data structure very well, regardless of the threshold range. This is illustrated in Figure 5, where the recovery of each method is displayed given di↵erent sizes of the threshold parameters.

(26)

(a) ⌧ 2 { 2, 2} (b) ⌧ 2 { 1, 1}

(c) ⌧ 2 { 0.5, 0.5} (d) ⌧ _{2 { 0.2, 0.2}}

(27)

Data Example To investigate which type of sparsity we actually encounter in the population, we consider a test for English language skill that was administered to 2842 students in the last year of secondary education in the Netherlands. Among the questions in the test were 32 binary scored items which we used for this analysis. Figure 6 displays the score distributions for

each method and shows that while the r1 method approximates the original

data best, both the a1 and `1 method are also able to recover the structure

of the data pretty well.

(a) ECDF sum-score

(b) ECDF item-score

(28)

Discussion

We have shown how the structure of a network model can be constructed with three di↵erent methods of sparsity, each estimating only N parameters for the pairwise associations, and an equal number for the thresholds, thereby e↵ectively eliminating the need to obtain large amounts of data to estimate a

full network. In addition, the a1 and r1 method don’t require an assumption

that the true underlying network structure is sparse, and with that open the door for estimation of dense network structures.

Furthermore, estimation procedures that force a network to have a sparse structure are still able to recover the structure of the original data well when threshold parameters are relatively large with respect to pairwise associa-tions. This is an inherent vice for such methods, as even when a network structure resulting from these is able to generate the original data, we can never be sure if the underlying network is indeed sparse, or if the thresholds just have a greater influence compared to the pairwise associations.

The ‘alternative bet on sparsity’ is intended as a stepping stone towards a more realistic approach of modeling network structures, and opens the way to new lines of research. For example, given that we do not know the true underlying structure of a network, how can we determine whether it is actually sparse or dense if all three methods of sparsity recover the data well.

(29)

References

Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Method-ological), 192–236.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. (pp. 395–479). Reading, MA: Addison-Wesley.

Epskamp, S. (2015). Isingsampler: Sampling methods and distribution func-tions for the ising model [Computer software manual].

Epskamp, S., Maris, G., Waldorp, L. J., & Borsboom, D. (in press). Network psychometrics. In P. Irwing, D. Hughes, & T. Booth (Eds.), Handbook of psychometrics. Wiley, New York.

Gelman, A. (2013, 12). Whither the “bet on sparsity principle” in a

non-sparse world?

http://andrewgelman.com/2013/12/16/whither-the-bet-on-sparsity-principle-in-a-nonsparse-world.

Haberman, S. (2007). The interaction model. In Multivariate and mixture distribution rasch models (p. 201-216). Springer, New York.

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. Springer, New York.

Ising, E. (1925). Beitrag zur theorie des ferromagnetismus. Zeitschrift f¨ur

Physik A Hadrons and Nuclei, 31 (1), 253–258.

Kac, M. (1968). Statistical physics: Phase transitions and superfluidity. In M. Chretien, E. Gross, & S. Deser (Eds.), Brandeis university sum-mer institute in theoretical physics. (Vol. 1, pp. 241–305). Gordon and Breach Science Publishers, New York.

Lauritzen, S. L. (1996). Graphical models. Oxford University Press.

Marsman, M., Maris, G., Bechger, T., & Glas, C. (2015). Bayesian inference for low-rank ising networks. Scientific reports, 5 .

(30)

R Core Team. (2014). R: A language and environment for statistical com-puting [Computer software manual]. Vienna, Austria. Retrieved from http://www.R-project.org/

Rasch, G. (1960). Probabalistic models for some intelligence and attainment tests. Danmarks paedagogiske institut.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58 , 267-288.

van Borkulo, C. D., Borsboom, D., Epskamp, S., Blanken, T. F., Boschloo, L., Schoevers, R. A., & Waldorp, L. J. (2014). A new method for constructing networks from binary data. Scientific Reports, 4 .

(31)

Appendix

Models

Pairwise Markov Random Field

A pairwise Markov random field (PMRF) is an undirected network with the pairwise Markov property; if there is no edge between two nodes, they are conditionally independent given all other nodes:

Xi?? Xj|X (i,j) = x (i,j)() {i, j} /2 E

The distribution of X for a PMRF can be parameterized as follows:

P (X = x) = 1 Z Y i i(xi) Y <ij> ij(xi, xj)

With Z as the normalizing constant:

Z =X x Y i i(xi) Y <ij> ij(xi, xj)

i(xi) and ij(xi, xj) represent the (unidentified) functions for respectively

the node and pairwise potentials of the network. To identify these functions the logarithms of each potential function are often constrained to sum to zero.

Ising Model

The Ising model is the prototypical form of the PMRF when data are binary,

which can either be coded X _{2 { 1, 1}, or X 2 {0, 1}. In the former}

configuration we obtain the Ising model distribution, through the suitable log-linear representation for the potential functions in the PMRF:

ln i(xi) = ⌧ixi ! ⌧i = ln i(xi = 1)

ln ij(xi, xj) = !ijxixj ! !ij = ln ij(xi = 1, xj = 1)

Where ⌧i is the threshold parameter, and !ij the network parameter that

(32)

Threshold parameter ⌧i 2 R, indicates the preference of Xi to be in state xi,

and is centered around zero, which indicates that the node has no preference

to be any specific state. Negative values indicate a preference for Xi = 1,

whereas positive values indicate a preference for Xi = 1.

Network parameter !ij 2 R, indicates the preference of Xi to be in the same

state as Xj. Negative values indicate a preference for Xi 6= Xj, whereas

positive values indicate a preference for Xi = Xj. A value of zero indicates

that the preference of node Xi to be in a certain state is independent of the

state of node Xj, from which follows:

!ij = 0() {i, j} /2 E

It is important to note, that the distribution of the Ising model can also be defined as a function of X _{2 {0, 1}:} P (X = x_{|x 2 {0, 1}) =} 1 Z exp x T_{⌃x =} 1 Z exp X i iixi+ X i6=j ijxixj !

Where ⌃ represents the magnetic field of the Ising model. However, as we will show next, the values of ⌃ are not identical to, but a function of the ⌧

and ⌦ parameter values in the X 2 { 1, 1} configuration of the Ising model.

We can rewrite the X 2 {0, 1} Ising model to the X 2 { 1, 1} configuration

in the following way:

1: Define Y _{2 { 1, 1} as a function of X 2 {0, 1}}

xi =

yi+ 1

2

2: Write the Ising model as a function of Y 2 { 1, 1}

P (Y = y) = 1 Z exp X i ii  yi+ 1 2 + X i6=j ij  yi+ 1 2  yj + 1 2 !

3: +1 in first sum cancels out in Z, expand second sum.

= 1 Z⇤ exp X i h ii 2 i yi+ X i6=j h ij 4 i (yiyj + yi+ yj + 1) !

(33)

4: +1 in second sum cancels out in Z⇤, factorize second sum. = 1 Z⇤⇤exp X i h _ii 2 i yi + X i6=j h _ij 4 i (yiyj) + h _ij 4 i yi + h _ij 4 i yj !

5: Rearrange and simplify.

P (Y = y) = 1 Z⇤⇤exp X i " ii 2 yi+ X i6=j ij 2 # yi+ X i6=j h _ij 4 i yiyj !

We can now derive the transformation of the values in the magnetic field (⌃)

for the X _{2 {0, 1} Ising configuration, into the threshold (⌧ ) and network}

(⌦) parameters in the X _{2 { 1, 1} Ising configuration:}

⌧i = ii 2 + X i6=j ij 2 () ii = 2 ⌧i 4 X i6=j !ij ! !ij = ij 4 () ij = 4!ij

And rewrite the distribution function for the X _{2 { 1, 1} Ising configuration}

in terms of matrix multiplication:

P (X = x) = 1 Z⇤⇤exp X i ⌧ixi+ X <ij> !ijxixj ! = 1 Z⇤⇤exp x T_{⌧ + x}T_⌦x (Constrained) `1 regularization

`1 regularization optimizes the Ising model pseudo likelihood, by penalizing

the sum of absolute parameter values to be under some value. The following expression is maximized for each node i:

max " Li(⌧ , ⌦; x) k X j=1,j!=i |!ij| #

Where Pk_j=1,j!=i_|!ij| is the penalty function, and the tuning parameter.

(34)

force the estimates of !ij to shrink to exactly zero.

In constrained `1 regularization the range of values is determined for which

the resulting networks contain N !ij parameters that are not zero. Although

all the models that conform to this constraint have the same number of non zero edges, it is possible for them to have di↵erent neighborhood sets. To select the model with the best set of neighbors, the extended Bayesian Information Criterion (EBIC; van Borkulo et al., 2014) is used.

Low Rank Approximation - r1

In the latent variable representation of the Ising model used by low rank approximation, every eigenvector for a connectivity matrix gives rise to a latent variable, such that all variables are independent conditionally on the full set of latent variables:

9⇥ : ?? X|⇥

These latent variables (⇥) explain all the pairwise interactions in a statistical sense, and their distribution conditionally on the latent variables is known as a multidimensional Item Response Theory (IRT) model in the field of psychometrics:

P (X = x_{|⇥ = ✓) =}Y

i

exp (xi[bi+ 2ai✓])

exp (+ [bi+ 2ai✓]) + exp ( [bi+ 2ai✓])

Where bi is the item easiness parameter with a similar interpretation as the

threshold parameter in the Ising model. ✓ is a vector of length r that contains

the realization of the latent variable (⇥), while ai is a vector of length r

that contains the discrimination of item i on every latent variable in the

multidimensional space. Because r = 1 in the r1 method of sparsity, we are

dealing with the well known two-parameter logistic model (2PL; Birnbaum, 1968).

Interaction Model - a1

The interaction model is a generalization of the Rasch model (RM; Rasch,

1960) for binary responses (X _{2 {0, 1}) that does not assume local}

(35)

the sum score (S), such that the response pattern and person ability (✓) are conditionally independent given the sum score.

P (X = x) = exp (s✓

0_{x + (s} ₁₎ 0_x)

Z

Where is the vector of item easiness parameters, and the additive

inter-action term. If all = 0 the interaction model reduces to the RM.

Simulation Study

The simulation study from the results section was done in the program R (R Core Team, 2014).

We constructed a network for 25 nodes using the following code for the pairwise associations matrix (see figure 1 for graphical display):

omega.i <- rep(seq(-.25,.25,length=5),5) con.mat <- outer(omega.i,omega.i,"+")/10 diag(con.mat) <- 0

Initial threshold parameters were set as a sequence with length 25 from -1 to 1. For the part corresponding to Figure 5, this entire sequence was either multiplied or divided by an integer.

25000 response patterns were then sampled from the partial conditionals of the pairwise associations matrix and each configuration of the thresholds pa-rameters.

Next, each method of sparsity was applied to the generated data, and from the resulting models new response patterns were sampled.

For estimation of the `1 method we used the IsingFit package (van Borkulo

et al., 2014), which we modified to search for the range that resulted in a

network containing at most 25 edges that were non zero. New response

pat-terns, under the `1 constructed model, were generated using the IsingSampler

(36)

Estimation and data generation for the r1 method was done with a custom

routine created by Maarten Marsman, for the a1 method a routine was

cre-ated by the authors. For both these methods an R package will be crecre-ated in the near future and made publicly available.

The pairwise correspondence for the item pair [i,j] in the original and

sim-ulated data was recorded as follows (given X _{2 { 1, 1}, k = number of}

observations for each pair): Ci,j = X k ✓ xik+ xjk 2 ◆2

To examine the extend in which each method of sparsity was capable of recovering the original data structure, the values for each item pair were plotted against each other for the original and generated data (as seen in Figure 5).

On the mathematical equivalence, biological plausibility, and sparsity assumptions of network models

On the

Mathematical Equivalence,

Biological Plausibility,

& Sparsity Assumptions

of Network Models

University of Amsterdam

Thesis - Research Masters Psychology

Contents

Rasch in the Bayesian Brain

Gunter Maris

CITO - University of Amsterdam

Joost Kruis

University of Amsterdam

July 3, 2015

1

Mathematical elegance

2

Computational strength

3

Biological plausibility

4

Discussion

References

The Alternative Bet on Sparsity

Joost Kruis

University of Amsterdam

Gunter Maris

CITO - University of Amsterdam

July 3, 2015

The ‘Bet on Sparsity’ Principle

Methods of Sparsity

Results

Discussion

References

Appendix

Models

Simulation Study