Testing the Full Information Approach for Handling Missing Data in the j2 model for Social Network Analysis

(1)

Steven Kuijper 5471 words Research Internship

Dr. Bonne Zijlstra & Dr. Terrence Jorgensen 11-11-2019

(2)

Abstract

Dealing with missing data in social network analysis is a challenge is social research and especially in social network analyses. Analyzing social networks that contain missing data can lead to biased parameter estimates. Existing statistical techniques in social network analysis for handling missing data involve imputing relational ties and as such constructing complete networks. This simulation study aims to test a different method for handling missing data: the Full Information approach. The j2 model for social network analysis incorporates the Full

Information approach and estimates parameters using an adaptive random walk model. Through this approach, the j2 model can estimate parameters with missing data present in the

dataset, without imputing relational ties. Networks of 40 and 20 actors were simulated, each with different amounts and patterns of missing data. Results indicated that for networks of 40 actors with five and 25 percent missing data, regardless of the missing data pattern, parameter estimates and standard errors were relatively unbiased. Similar results were found for

networks of 20 actors, although 25 percent missing data for such network might be the maximum amount. Results showed that the Full Information approach is quite promising.

(3)

Testing the Full Information Approach for Handling Missing Data in the j2 model for Social Network Analysis

A social network can best be thought of as “Individuals […] tied to one another by invisible bonds which are knitted together into a criss-cross mesh of connections, much as a fishing net or a length of cloth is made from intertwined fabrics” (Scott, 1988, p. 109). A very simple example of such invisible bonds are that of cliques in schools or classrooms. Peers within these cliques engage and interact with one another in all sorts of manners, and therefore makes it an interesting subject of analysis for social researchers. Especially hypotheses about specific interaction patterns can be formulated and tested.

With the introduction of social network analysis, researchers were able to mathematically formalize such ties between individuals and investigate them in a quantitative rather than a qualitative manner. In social network analysis a social network is commonly represented as a directed graph depicting the absence or presence of a relation for a certain set of actors. Figure 1 displays an example of a social network taken from a study by Cornelissen, McLellan & Schofield (2017) whereby collegial ties of school staff were longitudinally investigated. Looking at the structure of these social networks one can easily see where the fishing net metaphor comes from, because much like a fishing net whereby wires are connected to knots, in social network analysis social relations (the wires) are connect to actors (knots) in social network analysis.

(4)

Figure 1: Social network of school staff (Cornelissen, McLellan & Schofield ,2017).

Nodes represent staff and the ties represent relational ties between actors. The size of the nodes represents the popularity of an actor.

Formalizing social networks involves attaching values to the presence or absence of a relationship between actors. Suppose that the presence of a tie is represented by 1 and the absence of a tie by 0. Consider person A and person B, and that person A considers B to be his friend, but B does not consider A to be his friend. Thus, A has a directed relation with B, but B not with A. However, the opposite direction of this relationship between A and B can also be true. B might consider a to be his friend but A not B, and, A and B might also not call each other a friend. Drawing from this, there are three main relational configurations given two actors (a dyad): asymmetric relations (A=1, B=0) and (A=0, B=1), symmetric relations (A=1, B=1) and null relations (A=0, B=0).

Missing data

Social network analysis is susceptible to missing data. Kossinets (2008) found that network parameter estimates can become biased when missing values are present in the data. This poses a problem for applied researchers. Although it is desirable to have all observations in a network, it often is the reality that data are missing for various reasons. In general, three

(5)

patterns of missing data can be distinguished: tie missing data, dyad missing data, and actor missing data. Consider tie missing data first. Suppose that A considers B to be a friend, but it is unknown whether B considers A to be a friend. In this case we have no information about one tie within a certain dyad. In the dyad missing data pattern, there is no information of both ties given a certain dyad. Returning to the actor A and B example, it is unknown if A

considers B to be a friend and it is unknown if B considers A to be a friend. Actor missing data, refers to the absence of one or more actors in the network. Actors could, for instance, be missing from the data because they were not present or because they did not want to

participate in the study. Finally, there is one more special case that should be mentioned and that is missing actors in a study which has used peer nominations for data collection. When an actor is missing, he or she cannot nominate other actors, but the missing actor can in in fact be nominated. In reality either of these missing data patterns or a combination thereof can be present in the data.

Data collection. When collecting data for a social network study, applied researchers have

many different data collection techniques available to choose from. These different methods are to a certain degree all susceptible to missing data. Consider that a researcher aims to investigate friendship structures within a classroom. He or she might apply a frequently used method whereby every student in the classroom is asked to report on each student in the classroom with the question whether they consider each person to be their friend. Now suppose that for whatever reason, a few students fail to report whether they consider some students to be their friends. This immediately results in missing data. Now consider that multiple students are not present at the moment of data collection. Then that too will result in missing data. Hence, the frequently used method of individual peer nomination is very susceptive of missing data.

(6)

However, there are also data collection methods that are less prone to missing data. Neal (2008) proposes a method of data collection whereby actors are asked to identify the presence or absence of a relational tie between each dyad of individuals in the system. From these reports researchers can then construct the social network. A somewhat similar approach was suggested by Cairns et. al (1985). By asking a subset of actors to report on groups or cliques of actors in the system who “hung around together a lot”, and to report individual children who were isolated from their peers. Both methods have the advantage over the classic peer nomination method that it is harder to have big gaps in the data due to the lack of respondents by the actors. If it happens to be that one actor fails to report on a dyad or a clique, chances are very slim that all other actors also forget to report on these dyads and cliques. These methods are therefore less prone to missing data. A disadvantage of these methods of data collection is that there is no straightforward technique to analyze these networks and estimate parameters. Moreover, these methods also have the disadvantage that they are quite labor intensive for the researchers. These might be the reasons why data collection for social network analyses is often done by the classic peer nominations method.

Handling missing data. Albeit not impossible, it can be difficult to have complete datasets

in social network analyses. Follow up procedures could for instance be used to fill the gaps in the data after the initial data collection, but this is quite labor intensive. Especially if there are many missing values present. Statistical techniques to handle such missing data and provide unbiased parameter estimates are therefore important. One of these techniques is imputation of missing network data. This approach can be applied to analyze datasets with missing values. Stork and Richards (1992) proposed the imputation by reconstruction technique. With this technique, an incoming tie to an unknown (missing) actor will be reciprocated by

imputing a reciprocal tie to the known actor who sent the initial tie. The rationale for this imputation technique rests on the assumption that if actor A claims to have a tie to actor B,

(7)

then actor B will probably also claim to have a tie with actor A. This technique unfortunately does not work for every type of missing data; it merely works for tie missing data. In the case of dyad missing data, whereby no tie information is available, it is not possible to impute a tie based on the presence or absence of an incoming tie.

A different approach was proposed by Burt (1987): unconditional imputation. In this approach, the choice of imputation is based on the overall network density. If the network density—which is defined as ratio of the number of ties and the total potential number of ties—is larger than 0.5, then impute a tie for all missing values, and if the network density is smaller than 0.5, then impute a zero—meaning no tie—for all missing values.

A more recent proposed technique is imputation by preferential attachment (Huisman and Steglich, 2008). In preferential attachment, the decision whether or not to impute is

probabilistic and depends on the actor’s connectivity to other actors. This is in sharp contrast to unconditional imputation and imputation by reconstruction, were the decision to impute is not probabilistic, rather a binary decision.

Huisman (2009) compared the imputation by reconstruction, unconditional imputation, and preferential attachment statistical missing data handling techniques through a simulation study. Huisman generated complete directed and undirected networks and then deleted a portion of the ties or actors to generate missing data. Surprisingly one of the simplest imputation techniques, imputation by reconstruction, generated the least biased parameter estimates in directed networks. Thus, applied researchers have the choice between deleting the missing data and actors, or to impute data. Neither choices are preferable. When only a few observations (ties) are missing from the data, then a lot of information potentially has to be removed in order to work with the data. Thus valuable information is lost in the process of deletion.

(8)

Full information data analysis

In statistical analyses other than social network analyses, various missing data techniques exist. One of which is a maximum likelihood (ML) approach for dealing with missing data. In principle, ML approaches use the complete information from the data. It thus estimates

parameters based on the information available without the necessity of listwise deletion or imputation of data. One application of the ML approach of dealing with missing data is in the context of structural equation modelling and is called Full Information Maximum Likelihood (FIML) (see Arbuckle, 1995). The core principle is similar: utilizing all data (and its full information) to estimate parameters. Enders and Bandalos (2001) have examined the performance of FIML with respect to different types of missing data using a Monte Carlo study. Results indicated that FIML yielded less biased and more efficient parameter estimates compared to listwise deletion, pairwise deletion, and similar response pattern imputation. The bias was especially low (1%-2%) when the data were missing at random (MAR).

The conceptual idea of FIML is quite elegant because it maximizes the available information in the data. The conceptual idea of FIML would also be very attractive to incorporate in social network analysis, where missing data is often commonplace. However, an analysis method analogous to FIML has—to this date—never been applied to social network analyses. For this reason, a new method of dealing with missing data in social network analysis will be tested through a Monte Carlo simulation study. This method will, analogously to ML and FIML, use all information from the actors in the network, without imputing or deleting information. This new method will be referred to as the Full Information approach.

(9)

Statistical models

Before going into the details of this study, a proper overview of some key statistical models in social network analysis is in order.

j1 and j2 model. The p1 (Holland and Leinhardt, 1981) and p2 (van Duijn, 1995) models

took large leaps in the development of social network models. The p1 model introduced

probability functions for different dyadic outcomes through multinomial models. The p2

model extended the p1 model by introducing random effects for the sender and reciprocity

parameters.

The j1 and j2 social network models are an adaptation of the p1 and p2 model insofar that

the reciprocity parameter r and the density parameter m¸ are modelled differently compared to the p1 and p2 models. Zijlstra (2017) first proposed the j1 model, in which probability

functions of different dyadic outcomes are modeled through the density parameters mij and the

parameter cij which indirectly models the reciprocity rij:

𝑃(𝑌_𝑖𝑗 = 0, 𝑌_𝑗𝑖 = 0) = (1 + 𝑐_𝑖𝑗)/𝑘_𝑖𝑗 𝑃(𝑌_𝑖𝑗 = 1, 𝑌_𝑗𝑖 = 0) = {exp(𝑚_𝑖𝑗) − 𝑐_𝑖𝑗}/𝑘_𝑖𝑗 𝑃(𝑌𝑖𝑗 = 0, 𝑌𝑗𝑖 = 1) = {exp(𝑚𝑗𝑖) − 𝑐𝑖𝑗}/𝑘𝑖𝑗

𝑃(𝑌𝑖𝑗 = 1, 𝑌𝑗𝑖 = 1) = {exp(𝑚𝑖𝑗 + 𝑚𝑗𝑖) + 𝑐𝑖𝑗}/𝑘𝑖𝑗

With

𝑘_𝑖𝑗 = 1 + exp(𝑚_𝑖𝑗) + exp(𝑚_𝑗𝑖) + exp(𝑚_𝑖𝑗 + 𝑚_𝑗𝑖) And

exp(𝑚𝑖𝑗) =

𝑃(0) 𝑃(1)

(10)

exp(𝑟_𝑖𝑗) =𝑃(0,0) 𝑃(1,1) 𝑃(1,0) 𝑃(0,1)

The density parameter mij models the log odds of a single tie rather than the density

parameter in the p1 model, which models the log odds of an asymmetric dyad versus a null

dyad. The reciprocity parameter rij is modeled similar to the reciprocity parameter in the p1

model. The values for cij are obtained by solving rij for cij.

Thus, in the j1 model the probability of a tie 𝑃(𝑌_𝑖𝑗 = 1) depends only on mij and is

independent of cij , which is a clear difference compared to the p1 model.

The j2 model, extends the j1 model analogously to how the p2 model extends the p1 model.

The density parameter mij is modeled by adding random sender effects Ai and receiver effects

Bi, an overall parameter for the network density m, and covariates X1, X2, X3, with

corresponding regression weights γ1, γ2, γ3:

𝑚𝑖𝑗 = 𝐴𝑖 + 𝐵𝑖 + 𝑚 + 𝛾1 𝑋1𝑖+ 𝛾2𝑋2𝑗+ 𝛾3𝑋3𝑖𝑗

and the parameter for the reciprocity rij is modeled as an overall parameter for the

reciprocity r with regression coefficient γ4 for covariate X4ij

𝑟𝑖𝑗 = 𝑟 + 𝛾4𝑋4𝑖𝑗

The j2 model thus extends the j1 model analogously to how the p2 model extends the p1, but

with the extra advantage that the reciprocity parameter and the density parameter are not dependent on each other.

(11)

Present study

The objective of this study is to assess the degree to which the Full Information approach of dealing with missing data, is able to estimate unbiased network parameters and standard errors in the presence of varying amounts and patterns of missing data.

Missing data patterns. This study will evaluate the performance of the Full Information

approach on three types of missing data: Tie missing data, Dyad missing data, Actor missing data, and Random missing data. Consider an adjacency matrix of 3 by 3 (Figure 1), whereby 1 refers to a relation tie, 0 refers to no relation, and NA refers to missing data. Tie missing data (left of Figure 2) implies that it is unknown whether actor 1 has a tie with actor 2, but it is known that actor two reports to have a tie with actor 1. Dyad missing data (middle in Figure 2) implies that for both actor one and two it is unknown if they report to have a tie. In other words, the relational ties of the dyad are unknown. With actor missing data (right of Figure 2) the entire actor has missing data (in this case actor 1). Finally, random missing data are generated randomly, with the possibility that either of the missing data patterns or a

combination can be present in the data. In every condition the missing data were generated missing completely at random (MCAR)

Overview. First, it will be investigated whether relatively small amounts (5% and 25%

respectively) of random missing data, tie missing data, dyad missing data, and actor missing data yield unbiased network parameters of reciprocity, density and receiver and sender effects, with unbiased standard errors. In the next stage, it will be investigated if relatively large amounts of random missing data, tie missing data, dyad missing data and actor missing data yield unbiased network parameters and correct standard errors. Finally, based on these findings, guidelines for applied researchers will be provided on how to handle missing data.

(12)

Figure 2

Tie missing data (left), Dyad missing data (middle), Actor missing data (right).

Simulation Method

Parameter estimation

The j2 social network model will be used to estimate five network parameters: the sender

variance σA2, the receiver variance σB2, the sender covariance σAB2, the network density m and

the reciprocity r. The parameter estimates are obtained with a Markov Chain Monte Carlo algorithm, whereby the random sender effects, receiver effects, their corresponding

covariance matrix, and all regression parameters are sampled (Zijlstra, 2017). Estimation of network parameters with missing values and thus estimating it using the Full information approach, is obtained by an adaptive random walk algorithm (Zijlstra, van Duijn, & Snijders, 2009).

Simulations

To assess the overall performance of the Full Information approach on estimating network parameters, a simulation study was performed. The general setup of this study was as follows:

1. Generate a complete network;

2. Generate missing data in each network by randomly removing a given percentage of; out- and in coming ties from the data;

(13)

4. Analyze bias of the estimated parameters and variances.

Population values for the network parameters were chosen to be: 𝑁𝑎𝑐𝑡𝑜𝑟𝑠 = 40, 𝜎𝐴2 =

1, 𝜎_𝐵2 _{= 1, 𝜎}

𝐴𝐵2 = 0, 𝑚 = −1, 𝑟 = 1 . In the first run of simulations we began with a

relative small amount of missing data in every condition, namely 5 percent which amounts to 80 missing observations in a network of 1560 possible ties. The reason for this was to find out if the Full Information method works at all under gentle missing data conditions, before exploring more severe missing data conditions. In the second run of simulations, we increased the percentage of missing data to a somewhat more severe amount of 25 percent in each condition.

Finally, as a post hoc analysis after the main analyses, in the third run of simulations, we simulated networks of 20 actors (instead of 40) with a missing data percentage of 25%, to investigate if the estimation method still produces unbiased estimates under severe missing data conditions. The severity of this condition lies in the very limited amount of information available to estimate parameters from. Values for the j2 parameter estimation were fixed to the

default values.

Method of Analysis

Collins et. al (2006) suggested that multiple criterion should be used when summarizing the simulation results. For this study the percentage of bias and the coverage rates (including its confidence intervals) will be used as criterion for analysis of the simulation results.

Percent bias. Parameter point estimates will be evaluated by looking at the percentage bias

of all parameters, whereby the percentage bias is defined as the difference between the simulated (population) value and the average parameter estimate after 1000 simulations divided by the population value. Although it is possible to conduct a significance test for the

(14)

estimate of the percentage bias, Collins et al. (2006) suggest that for a 1000 simulations small biases can very quickly be found significant. Therefore, the practical significance of the percentage bias rather than the statistical significance of the percentage bias will be interpreted.

Coverage rates. The rejection rates of all network parameters will be evaluated by looking

at the coverage of the 95% credibility intervals and its Agresti-Coull confidence interval. If the interval does not include the value of 95% coverage, the estimate will be classified as statistically significant. The practical significance will also be evaluated. A high coverage (i.e. higher than 95%) implies that the null hypothesis is rejected less often. This implies that the test is more conservative. A low coverage (lower than .95) implies that the null hypothesis is rejected more often than expected, and that implies that the test is more liberal.

Simulation Results: Main Findings

Table 1 displays the percentage biases of the parameters and standard errors in the 5% and 25% missing data conditions. It is important to note that, after seeing 200 simulations in each condition, the random missing data and tie missing data parameters were identical. Meaning that the random missing data pattern only simulated missing ties and no other missing data patterns. We therefore chose to continue with simulating only the tie missing data.

5% Missing data. Looking at the estimate biases of the 5% missing data condition, no

obvious pattern can be found between the bias and the type of missing data. The values are fairly equal with little variation. The standard error bias, however, does seem to increase in the dyad missing data pattern compared to the tie missing data pattern. But overall the estimate bias and the SE bias seem to be relatively low considering the highest percentage bias is 6.5%, which is for the SE of the density bias in the 5% missing data condition.

(15)

When comparing the percentage biases in Table 2 with the 95% coverage rates of estimated parameters in Figure 2, it can first of all be seen that in the no missing data condition, the 95% Agresti–Coull CI of the 95% coverage rates of the density, the receiver and sender variance parameter do not include the desired 95% coverage. The coverage seems to be significantly lower in the case for these parameters in the no missing data condition. This could possibly be explained by the SE bias for these parameters, because these parameters have a larger bias compared to the sender variance and reciprocity parameter.

Under missing data patterns these significantly different 95% coverage rates were not found. Every parameter in each missing data condition had no significantly different coverage rates except for the covariance parameter. In all but the no missing data condition differs this parameter significantly. Because the coverage rates are significantly lower than what they are expected to be, this implies that the null hypothesis is rejected more than expected; indicating a more liberal test. However, this lower rejection rate is not likely to have been caused by bias in the SE, since these are 2% in each condition, including in the no missing data pattern where the 95% coverage rate is not significantly different. Although the percentage estimate bias could not be computed for the covariance parameter since the population value is zero, looking at the point estimates of the covariance parameters in table B of the appendix it can very heuristically stated that the bias is not large since it does not differ by a very large amount. Therefore, there does not seem to be a clear explanation for this result.

(16)

Table 1

Parameter estimate and SE bias, N=40.

25% missing data. Looking at Table 1, the percentage parameter estimate biases of the

25% missing data condition do not appear to be substantially larger than the 5% percent missing data. The percentage bias of the SE does seem to have increased. Though this is visible across almost all missing data patterns, this is especially visible under the actor missing data pattern. There, the percentage biases of the SE appear larger than in the 5% missing data condition.

With regard to the rejection rates, the sender variance and covariance under the random missing data pattern have significantly lower coverage rates than the desired 95%. However,

5% Missing data 25% Missing data

Estimate bias (%) SE bias (%) Estimate bias (%) SE bias (%) No Missing data σA2 1.4 0.0 1.4 0.0 σB2 1.6 2.0 1.6 2.0 σAB2 - 2.0 - 2.0 m 1.4 6.5 1.4 6.5 r 2.7 0.0 2.7 0.0 Random σA2 0.3 1.0 1.5 2.2 σB2 0.0 3.1 0.1 1.8 σAB2 - 2.0 - 1.5 m 0.9 1.0 1.2 2.2 r 0.2 0.0 0.0 1.9 Dyad - σA2 0.8 3.1 1.2 2.3 σB2 0.4 4.8 0.4 5.1 σAB2 - 2.0 - 1.9 m 1.2 1.0 1.4 2.3 r 0.1 1.0 0.6 4.9 Actor σA2 1.1 2.7 3.0 4.9 σB2 0.2 5.3 1.2 5.6 σAB2 - 2.0 - 2.1 m 2.3 1.2 1.8 4.0 r 1.0 0.8 1.9 3.1

(17)

these parameters do not seem to have a surprisingly large estimate bias or SE bias. It is therefore unclear where the low coverage rate for these parameters originated from.

For the actor and dyad missing data patterns, only the covariance parameter in the dyad missing data condition seems to be significantly lower than the desired 95% coverage. The SE bias for this covariance parameter is 2.1% which is fairly equal to the all other covariance SE biases. In all cases where the covariance SE bias is greater than 2%, the coverage rate seems to be significantly lower than 95%. In the single case that the coverage is lower than 2%, namely covariance SE bias of 1.9% under the actor missing data condition, was not found to be significantly different. Thus, the lower coverage might be due to the SE bias of the covariance parameter.

(18)

Figure 2

Plots of the point estimates of the 95% coverage rates with a 95%CI of the no missing data and the random, dyad and actor missing data patterns, with 5% missing data and N=40.

Figure 3

Plots of the point estimates of the 95% coverage rates with a 95%CI of the no missing data and the random, dyad and actor missing data patterns, with 25% missing data and N=40.

(19)

Results: Post-Hoc Analysis

After the main simulation results, we kept the percentage of missing data constant (25%) and decreased the number of network actors (N=20) to investigate the network parameters under more severe missing data conditions.

The estimate biases (Table 2) under the random missing pattern are quite similar to the estimate biases in networks with 40 actors. However, comparing the SE biases under the random missing data pattern with networks of 20 and 40 actors, the SE biases in networks with 20 actors are about twice as large as in networks with 40 actors. Looking then at the coverage rate for these parameters (Figure 4), it can be seen that only the covariance and sender parameter have a significantly different coverage rate. If we then look at the estimate and SE biases under the actor missing data conditions it can be seen that the estimate biases appear larger compared to the random missing data condition, and the SE biases appear smaller. Looking at the coverage rates, the covariance, receiver variance and sender variance parameters are significantly higher than 95%, indicating a more conservative test whereby the null hypothesis is rejected less than expected. The coverage of parameters under the dyad missing data condition are very similar to the random missing data.

(20)

Table 2

Parameter estimate and SE bias, N=20.

Figure 4:

Plots of the point estimates of the 95% coverage rates with the random, dyad and actor missing data patterns, with 25% missing data and N=20.

25% Missing data Estimate bias (%) SE bias (%) Random σA2 0.0 14.6 σB2 0.9 13.2 σAB2 - 11.7 m 2.7 10.2 r 3.6 8.6 Dyad σA2 4.5 7.0 σB2 5.4 5.3 σAB2 - 3.6 m 7.2 1.8 r 8.1 0.0 Actor σA2 9.0 1.9 σB2 9.9 3.8 σAB2 - 5.8 m 11.7 7.8 r 12.6 9.9

(21)

Discussion

The objective of this simulation study was to assess the performance of the Full

Information approach of handling missing data in the j2 social network model. The first goal

was to investigate whether the Full Information approach yield unbiased estimations and SE errors under fairly mild missing data conditions.

Summary results.

Main analysis. After 1000 simulations of each missing data pattern (random, dyad, actor

missing data) in networks of 40 actors with 5% missing data, the j2 model provided very close

parameter estimates with a maximum percentage bias of 2.7%, which would seem very acceptable. The SE bias of these parameters was also fairly acceptable with a maximum percentage bias of 6.5 percent. The coverage rates also seemed to be acceptable, since most parameters did not differ significantly from the 95% coverage. However, the covariance coverage seemed to be the most deviant compared to the other parameters. In all missing data patterns under 5% missing data, the covariance parameter was significantly lower than 95%. This pattern was also visible under 25% missing data, with the exception of the actor missing data pattern where the coverage was not significantly different. This would imply that the null hypothesis is rejected more often than expected, indicating a liberal test. As it was suggested in the results section, this might be due to bias in the SE, because the SE bias for these parameters appeared larger than the bias for all other parameters which differed not

significantly from 95%. The other estimate and SE biases in the 25% missing data condition seemed also quite acceptable with the largest estimate bias being 3.6% and the largest SE bias 5.2%. Thus, in a practical sense, a bias of around 5% would mean a difference in the second decimal of a point or SE estimate, which would seem very acceptable given the condition that missing data is present.

(22)

Post-hoc analysis. This step was meant to strain the Full Information approach and

investigate if under heavy missing data conditions the method was still able to produce relatively unbiased estimates and SEs. In the random and dyad missing data conditions, the parameter bias did not differ a lot compared to networks of N=40. However, the SE bias was substantially larger compared to networks of 40. Therefore the finding that the biased

becomes larger in these conditions is not surprising, since the parameters have to be estimated with very little data. The coverage rates for this missing data pattern was quite the opposite from the networks with 40 actors. The covariance, reciprocity, and sender variance had significantly higher coverage rates which would imply that the null hypothesis is rejected less often than expected indicating a more conservative test.

Covariance parameter

From these results it becomes clear that the coverage rates for the sender-receiver

covariance parameter were the most erratic compared to all other coverage rates of estimated parameters. Only in four missing data patterns across 3 conditions did the covariance

parameter have a non-significantly different coverage rate. As the results show, this is for the networks with 40 actors probably due to the SE bias and in the networks with 20 actors due to the estimate bias. The question then would be: which test (conservative versus liberal) is more preferable or problematic? On the one hand, more conservative statistical tests will yield less often type-I errors and as such give a significant p-value when in fact the null hypothesis is actually false. On the other hand researcher will thus, by definition, find less often a

significant result when the alternative hypothesis is actually true. Still, when comparing this to more liberal test, whereby a significant p-value will be found more frequently when the null hypothesis is false, he will reach an unjustified conclusions that there is an effect.

(23)

However, despite the significantly different coverage rates, the question must be asked if these statistically different coverage rate are a practical issue as well. Perhaps not, because the largest absolute deviance from 95% coverage was found to be 4.1% Although statistically significant, this is still close to the desired 95% value and since it is quite close this practically does not have any serious consequences.

Practical implications

The broader scope of these results implies that for networks with at least 40 actors the Full Information approach works quite good under 5% and 25% missing data. Based on these results, the expectation is that for higher values of missing data the Full Information approach would also be able to provide relatively unbiased parameter estimates and SEs. However, as the results showed in networks with 20 actors, the missing data condition is too severe to provide unbiased estimates. Therefore, applied researchers should caution using the j2 model

with networks of and around 20 actors and a missing data percentage of around 25%. The estimates and SEs would then become too biased to be able to provide reliable statistical evidence. Perhaps even smaller of missing data is still problematic and yield biased estimates and SEs, future simulations with less percentage missing data will be needed to evaluate the Full Information Approach in small samples.

Limitations

This study has two important inherent difficulties that should be addressed. A first and clear difficulty with translating these simulation results to practical research situations is that we have simulated our data on the condition that all missing values are MCAR. In non-simulated situations this assumption might not be a valid one. It is therefore important to stress that the results of this simulation study cannot directly be generalized to non MCAR missing data generating processes. A second difficulty is that due to the computation times and time constraints, we were unable to include covariates in our simulations. j2 Is the

(24)

pre-eminent model for including covariates and the fact that we were not able to do so is unfortunate.

However, although these concerns are justified and important to address, the goal of this study was a first step into investigating the overall performance of the Full Information approach which the j2 model incorporates and we have succeeded in doing so.

Conclusion

The Full Information approach looks very promising. Its ability to estimate parameters purely on the information that is available without having to impute is innovative and first results show that, apart from the covariance parameters, all other parameters and SEs are estimated with little bias. In the cases for networks with 40 actors or more, the covariance parameter can also be used for statistical inference keeping in mind that the accompanying statistical test will be somewhat conservative. In cases for networks with 20 actors or lower, the Full Information approach is not able to provide relatively unbiased estimations and SEs when the missing data percentage is at or around 25%.

(25)

References

Arbuckle, J. L. (1995). Amos user’s guide [Computer software]. Chicago: SmallWaters. Burt, R. S. (1987). A note on missing network data in the general social survey. Social

Networks, 9(1), 63-73.

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological methods, 6(4), 330.

Kantowitz, B. H., Roediger III, H. L., & Elmes, D. G. (2014). Experimental psychology. Nelson Education.

Enders, C. K., & Bandalos, D. L. (2001). The relative performance of full information maximum likelihood estimation for missing data in structural equation models. Structural equation modeling, 8(3), 430-457.

Holland, P. W., & Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs. Journal of the american Statistical association, 76(373), 33-50.

Huisman, M., & Steglich, C. (2008). Treatment of non-response in longitudinal network studies. Social networks, 30(4), 297-308.

Kossinets, G. (2006). Effects of missing data in social networks. Social networks, 28(3), 247-268.

Neal, J. W. (2008). “Kracking” the missing data problem: applying Krackhardt's cognitive social structures to school-based social networks. Sociology of Education, 81(2), 140-162. Stork, D., & Richards, W. D. (1992). Nonrespondents in communication network studies:

Problems and possibilities. Group & Organization Management, 17(2), 193-209.

Van Duijn, M. A., Snijders, T. A., & Zijlstra, B. J. (2004). p2: a random effects model with covariates for directed graphs. Statistica Neerlandica, 58(2), 234-254.

(26)

Zijlstra, B. J. (2017). Regression of directed graphs on independent effects for density and reciprocity. The Journal of Mathematical Sociology, 41(4), 185-192.

Zijlstra, B. J., van Duijn, M. A., & Snijders, T. A. (2009). MCMC estimation for the p2 network regression model with crossed random effects. British Journal of Mathematical and Statistical Psychology, 62(1), 143-166.

Van Duijn, M., & Snijders, T. A. (1995). The p2 model. Internal publication, VSM, University of Groningen.

Savalei, V. (2010). Expected versus observed information in SEM with incomplete normal and nonnormal data. Psychological Methods, 15(4), 352.

Scott, J. (1988). Social network analysis. Sociology, 22(1), 109-127.

Cornelissen, F., McLellan, R. W., & Schofield, J. (2017). Fostering research engagement in partnership schools: Networking and value creation. Oxford Review of Education, 43(6), 695-717.

(27)

Appendix

Table A

Point estimates, SE, SD, 95 and 99 percent coverage rates, estimated with N=40 and no missing data.

Sim value Estimate SE SD 95% coverage 99% coverage Random σA2 1 1.014 0.290 0.290 0.932 0.988 σB2 1 1.016 0.290 0.292 0.930 0.977 σAB2 0 0.004 0.192 0.196 0.941 0.981 m -1 -0.986 0.232 0.248 0.932 0.982 r 1 0.973 0.235 0.235 0.952 0.986

(28)

Table B

Point estimates across replications, SE, SD, 95 and 99 percent coverage rates, estimated with N=40 and 5 percent missing data.

Sim value Estimate SE SD 95% coverage 99% coverage Random σA2 1 1.003 0.295 0.287 0.943 0.990 σB2 1 1.000 0.293 0.284 0.952 0.990 σAB2 0 0.011 0.195 0.199 0.926 0.976 m -1 -1.009 0.240 0.238 0.938 0.981 r 1 0.998 0.250 0.250 0.952 0.992 Dyad σA2 1 1.008 0.293 0.284 0.945 0.989 σB2 1 1.004 0.292 0.278 0.954 0.993 σAB2 0 0.011 0.194 0.197 0.926 0.973 m -1 -1.012 0.243 0.237 0.950 0.987 r 1 1.001 0.243 0.239 0.953 0.994 Actor σA2 1 1.011 0.301 0.293 0.942 0.987 σB2 1 1.038 0.328 1.000 0.947 0.990 σAB2 0 0.011 0.200 0.204 0.927 0.972 m -1 -1.023 0.243 0.246 0.938 0.978 r 1 1.004 0.249 0.247 0.952 0.990

(29)

Table C

(30)

Table D

(31)

Table E

95% Coverage rates with lower and upper bounds of the 95% confidence interval. N=40

95% Coverage rate Lower bound 95%CI Upper bound 95%CI 95% Coverage rate Lower bound 95%CI Upper bound 95%CI No Missing σA2 0.932 0.915 0.946 - - - σB2 0.930 0.912 0.944 - - - σAB2 0.941 0.923 0.954 - - - m 0.932 0.915 0.946 - - - r 0.952 0.937 0.964 - - -

5 % Missing data 25% Missing data

Random σA2 0.943 0.927 0.956 0.934 0.917 0.948 σB2 0.952 0.937 0.964 0.946 0.930 0.959 σAB2 0.926 0.908 0.941 0.933 0.916 0.947 m 0.938 0.921 0.951 0.949 0.933 0.961 r 0.952 0.937 0.964 0.953 0.938 0.965 Dyad σA2 0.945 0.929 0.958 0.949 0.934 0.961 σB2 0.954 0.939 0.966 0.941 0.925 0.954 σAB2 0.926 0.908 0.941 0.935 0.918 0.949 m 0.950 0.935 0.962 0.953 0.938 0.965 r 0.953 0.938 0.965 0.956 0.941 0.967 Actor σA2 0.942 0.926 0.955 0.947 0.931 0.959 σB2 0.947 0.931 0.959 0.951 0.936 0.963 σAB2 0.927 0.909 0.942 0.940 0.923 0.953 m 0.938 0.921 0.951 0.937 0.920 0.951 r 0.952 0.937 0.964 0.948 0.932 0.960

(32)

Table F

95% Coverage rates with lower and upper bounds of the 95% confidence interval. N=20

25 % Missing data

95% Coverage rate Lower bound 95%CI

Upper bound 95%CI Random σA2 0.977 0.966 0.985 σB2 0.958 0.944 0.969 σAB2 0.967 0.954 0.977 m 0.930 0.912 0.944 r 0.957 0.943 0.968 Dyad σA2 0.949 0.933 0.961 σB2 0.932 0.915 0.946 σAB2 0.943 0.927 0.956 m 0.950 0.935 0.962 r 0.950 0.935 0.962 Actor σA2 0.971 0.959 0.980 σB2 0.990 0.981 0.995 σAB2 0.977 0.966 0.985 m 0.962 0.948 0.972 r 0.961 0.947 0.971