• No results found

Chapter 1: Introduction

N/A
N/A
Protected

Academic year: 2021

Share "Chapter 1: Introduction"

Copied!
32
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

1

Chapter 1: Introduction

1.1. Background

HIV/AIDS is a leading health problem in the sub-Saharan African region. The need to formulate well-thought out and effective measures to understand the dynamics of the HIV/AIDS cannot be emphasized enough.

Seroprevalence data is HIV data collected based on a blood survey conducted on expecting mothers visiting antenatal clinics throughout the Republic of South Africa. It is general knowledge that data collected from antenatal seroprevalence surveys tends to overestimate HIV prevalence due to the fact that information is observed from one sector of the population, the pregnant women. It is also known that women infected with HIV have lower pregnancy rates than uninfected women. Notwithstanding the shortcomings of the antenatal seropreva-lence surveys, this tool still ranks highly as a reliable approach to estimate HIV prevaseropreva-lence amongst the entire adult population in a country.

In South Africa, the prevalence of HIV has been used for many years to gauge the spread of the HIV pandemic. The introduction of life-saving anti-retroviral drugs (ARVs) has increased the dif-ficulty of interpreting the prevalence data due to changes in survival period from infection to death. In that regard, the incidence of HIV infection (i.e. the rate at which new infections are acquired over a defined period of time) is a much more sensitive measure of the current state of the epidemic and of the impact of programs.

Mathematical and statistical models are imperative and essential in enhancing our unde r-standing of the changes in the behavior of the HIV epidemic. On that basis, the aim of any mathematical and statistical modeling methodology is to extract enough knowledge from a given database. A number of different models of HIV and AIDS have been developed, ranging from simple extrapolations of past curves to complex transmission models (UNAIDS, 2010).

Index

1.1. Background

1.2. Problem statement 1.3. Aim and objectives

(2)

2

1.2. Problem statement

The antenatal HIV seroprevalence HIV data is comprised of the following demographic charac-teristics for each pregnant woman; age, partner’s age, population group, level of education, gravidity, parity, marital status, province, region, HIV and syphilis. It is therefore clear that the seroprevalence database presents a wealth of information. As shown in the above modeling techniques and research surveys, very little work has been done to fully understand this vast amount of data. This research will attempt to answer questions like; what does the antenatal HIV seroprevalence database tell us and how can this database be used to improve the inter-vention conducted by the government to curb the spread of the HIV pandemic. This will there-fore entail using relevant statistical techniques to fully understand the database (Sibanda & Pre-torius 2011).

Central to this research will be the objective of understanding in detail the differential effects of demographic characteristics of pregnant women on their risk of acquiring HIV infection using unorthodox methodologies like design of experiments, artificial neural networks and binary lo-gistic regression. Design of experiments is traditionally a structured intensive methodology used for finding solutions to problems of an engineering nature. The technique enables the formulation of sound engineering solutions.

Neural networks consist of artificial neurons that process information. In most cases, a neural network is an adaptive system that changes its structure during a learning phase. In that re-gard, neural networks are used to model complex relationships between inputs and outputs to find a pattern in data. Neural networks have been applied to a wide range of applications such as character recognition, image compression and stock market prediction. This current re-search will therefore attempt to use neural networks in studying the antenatal HIV seropreva-lence data.

Logistic regression is a statistical methodology for inferring the outcomes of a categorical de-pendent variable, based on one or more predictor variables. In this regard, the probabilities describing the possible outcome of a single event are modeled, as a function of explanatory variables, using a logistic function. Statistically, the categoric outcomes may be binary or ordi-nal, and the predictor variables may be continuous or categorical. In this research, this will in-volve modeling the presence or absence of HIV infection using demographic characteristics as predictor variables.

(3)

3 1.3. Aims and objectives

With the enormous amount of data presented by annual South African antenatal HIV seroprev-alence, it is important to develop powerful techniques to study and understand the data in or-der to generate valuable knowledge for sound decision making. Descriptive and predictive data mining techniques that involved detailed data characterization, classification, and outlier analy-sis will be used.

Central to this study will be the characterization of differential effects of demographic charac-teristics on the risk of acquiring HIV amongst pregnant women, using unorthodox statistical methods like design of experiments and artificial neural networks. To validate the modeling re-sults a binary logistic regression methodology will be used. Most epidemiologists prefer the bi-nary logistic model to study epidemiological data especially where bibi-nary categorical outcomes are involved. In addition, this research further investigated the usefulness of decision trees in understanding the effects of demographic characteristics on the risk of acquiring an HIV infec-tion. The Tree node in SAS Enterprise MinerTM is part of the SAS SEMMA (Sample, Explore, Modify, Assess) data mining tools. The Tree therefore represents a segmentation of data creat-ed by applying a number of simple rules. Each rule is applicreat-ed after another, resulting in a hier-archy of segments within segments, giving rise to a hierhier-archy resembling a tree. In addition to nominal, binary and ordinal targets, the tree can be successfully used to predict outcomes for interval targets. It has been widely reported in science literature that the advantage of decision trees over other modeling methodologies such as neural networks, is that the technique pro-duces a model that can easily be explained. The decision tree has an added advantage of being able to treat missing data as inputs.

The receiver operating curve (ROC) curves developed using SAS Enterprise MinerTM (SAS Insti-tute Inc. 2012) were used to compare the classification accuracy of the different modeling methodologies. In general ROC curves are drawn by varying the cut-off point that determines which event probabilities are considered to predict the event.

1.3.1. Specific Objectives of this project 1.3.1.1. Objective One

This study will attempt to utilize a screening design of experiment (DOE) technique to develop a ranked list of important through unimportant demographic factors that affect the spread of HIV in the South African population (Sibanda & Pretorius 2011a).

(4)

4 1.3.1.2. Objective Two

This research step will explore the application of response surface methodology (RSM) to study the intricate relationships between antenatal data demographic characteristics and the risk of acquiring an HIV infection. RSM techniques allow for the estimation of interaction and quadrat-ic effects (Sibanda & Pretorius 2012a).

1.3.1.3. Objective Three

The third objective will compare results from two response surface methodologies in determin-ing the effect of demographic characteristics on HIV status of antenatal clinic attendees. The two response surface methodologies to be studied will be the central composite face-centered and Box-Behnken designs. The purpose of this study will be to show that the results obtained in research objective two are not design-specific and thus can be reproduced using a different response surface model (Sibanda & Pretorius 2013).

1.3.1.4. Objective Four

The fourth objective of this research will attempt to validate the response surface methodology results through the use of a binary logistic regression model. This aspect of our research was brought about by recommendations from epidemiologists that the S-shape of the logistic func-tion is most favored for the study of the HIV risk amongst antenatal clinic attendees in South Africa. Furthermore, binary logistic regression models are models of choice in the study of bi-nary categoric data. This step is important as the design of experiment methodologies are not usually used for epidemiological modeling (Sibanda & Pretorius 2012b).

1.3.1.5. Objective Five

This aspect of our research will be focused on writing a review scientific report on the applica-tion of artificial neural networks to study HIV/AIDS. Tradiapplica-tionally, neural networks have been applied to a broad range of fields such as data mining, engineering and biology. In recent years neural networks have found application in data mining projects for the purposes of prediction, classification, knowledge discovery, response modeling and time series analysis. In this work, an attempt will be made to highlight cutting edge scientific research that used artificial neural networks to study HIV/AIDS. An attempt in this review will be to cast the spotlight on the latter research as it pertains to human behavior, diagnostic, vaccine and biomedical research (Sibanda & Pretorius 2012c).

(5)

5 1.3.1.6. Objective Six

Objective six of this research will attempt the novel application of multilayer perceptrons (MLP) neural networks to further study the effect of demographic characteristics on the risk of acquir-ing an HIV infection amongst antenatal clinic attendees in South Africa (Sibanda & Pretorius 2011b).

1.3.1.7. Objective Seven

This part of our research will involve the application of receiver of characteristics (ROC) curves to compare the classification accuracy of the modeling methodologies used in this project, namely the design of experiments, logistic regression, neural networks and decision trees. It is imperative to be able to use a scientifically sound technique to compare the performance of the different classifiers.

1.3.1.8. Objective Eight

To complete this study a scorecard design was also employed to validate the results from the logistic regression, neural networks and decision trees. Scorecard design is generally a method used in insurance industry to score credit applicants. It is therefore a technique for assessing the relative risk of providing credit to an applicant. For the purposes of this research, a table will be developed comprising of a set of demographic characteristics, in which case each char-acteristic will be consist of various attributes, each one assigned with a number of points. The points will then be summed and compared to a decision threshold to determine the relative risk of each characteristic. The advantage of the scorecard is the ease with which the infor-mation can be interpreted. In addition, the risk factors and the corresponding bins are easy to interpret and are based on expert knowledge. The scorecards can also be made predictive by using logistic regression to combine the risk factors into a predictive scorecard (Viane et. al., 2002).

(6)

6

1.4. Design of Experiments

1.4.1. Introduction

Design of experiments was invented at Rothamsted Experimental Station in the 1920s. Alt-hough the experimental design method was first used in an agricultural context, the method has found successful applications military, commerce and other industries.

The fundamental principles in design of experiments are solutions to the problems in experi-mentation posed by the two types of nuisance factors and serve to improve the efficiency of experiments. Those fundamental principles are randomization, replication, blocking, orthogo-nality and factorial experimentation.

Randomization is a method that protects against an unknown bias distorting the results of the experiment.

Orthogonality in an experiment results in the factor effects being uncorrelated and therefore more easily interpreted. The factors in an orthogonal experimental design are varied inde-pendently of each other. Factorial experimentation is a method in which the effects due to each factor and to combinations of factors are estimated. Factorial designs are geometrically constructed and vary all the factors simultaneously and orthogonally.

The main uses of design of experiments are;  Screening many factors,

 Discovering interactions among factors,  Optimizing a process.

1.4.2. Selection of Design of Experiment (DOE)

Choice of a DOE is dependent on the aims of the investigation and the number of variables in-volved.

1.4.3. Experimental Design objectives 1.4.3.1. Comparative objective

This approach is tailor-made for an experiment characterized by multiple variables, with the sole purpose of infer on the importance of one variable in the presence of other variables. The overriding objective is to ascertain if a given variable is important or not. The randomized block design is a typical example of a comparative design.

(7)

7 1.4.3.2. Screening objective

The aim of this approach is to choose the cardinal effects from the large number of insiginifi-cant or unimportant effects. The designs are also called “main effects designs”. Typical exam-ples of screening designs are the Plackett-Burman, full and fractional factorial designs.

1.4.3.2.1. Plackett-Burman designs

These designs were developed by R.L. Plackett and J.P. Burman in 1946. The goal of these ex-periments is to determine the reliance of a variable on other unconnected variables. In these designs, interactions between factors are considered negligible.

1.4.3.2.2. Factorial Designs

The definition of a full factorial design is an experiment that is made up of more than two vari-ables and envari-ables the investigation of the effects of inputs on a response as well as facilitating an understanding of interactional effects between inputs on a selected response.

1.4.3.3. Response Surface objective

These experiments are developed to investigate the possible synergic interplay between varia-bles. This provides an insight into the local curvature of the response surface under investiga-tion. Typical examples of response surface methodologies are the central composite and Box-Behnken designs.

1.4.3.3.1. Central Composite designs (CCDs)

These designs are called Box-Wilson CCDs and are comprised of a factorial design characterized by center-points. In addition, these designs possess stellar points to facilitate the investigation of the curvity of the plot. There are fundamentally three types of CCDs.

1.4.3.3.2. Central Composite Circumscribed (CCC)

These designs are characterized by circinate and spheroidal symmetry. The models require five levels for each input.

1.4.3.3.3. Central Composite Inscribed (CCI)

The main characteristic of these designs is that the specified cut-off points are truly limits. Like the CCC designs these designs also require five levels for each input.

(8)

8 1.4.3.3.4. Central Composite Composite Face-centered (CCF)

For the CCF designs, the stellar points are positioned in the middle of each face of the factorial space. On the contrary, these models only require three levels for each factor.

1.4.3.3.5. Box Behnken Designs

These independent quadratic designs are not characterized by an in-built factorial design. Box-Behnken designs are rotatable and require three levels of each factor, and have limited capabil-ity for orthogonal blocking compared to the central composite designs.

1.4.4. Regression Model

This is the use of regression technique to model a response as a mathematical function of fac-tors.

1.5. Artificial Neural Networks

1.5.1. Introduction

The initial research on perceptrons was conducted by Frank Rosenblatt in 1958. The early per-ceptrons were comprised of three layers namely; the input, the middle layer whose function was to coalesce inputs with weights via a threshold function and finally the output layer.

1.5.2. Classification of neural networks (NN)

Functionally, neural networks are classified into two broad categories namely feed-forward and recurrent networks as shown in Fig. 2.1. This classification is based on the training regime of the NN. Examples of feed-forward networks are single-layer, multi-layer and radial basis neural networks. Typical examples of recurrent NNs are competitive and Hopfield networks.

(9)

9 Fig.2.1: Classification of neural network architectures

Multilayer neural networks have found increasing application in numerous scientific research areas. In some instances neural networks have been found to be as robust as traditional statis-tical techniques. However, unlike traditional statisstatis-tical methods the multilayer perceptron do not make prior assumptions with regards to the data distribution. The advantages of the neural networks include their ability to model highly non-linear functions, as well as being able to ac-curately generalize on previously unseen data.

1.5.3. The Multilayer Perceptron (MLP) Model

The MLP is made up of a network of interconnected neurons (Fig. 2.2). In turn, neurons are connected by weights and output signals generated by the sum of the inputs to the neuron.

(10)

10 Fig. 2.2: A schematic representation of an MLP with three layers

The most widely used activation function for the multilayer perceptron is the “logistic function” (Fig. 2.3).

(2.1)

where the variable P, stands for the population, e is Euler's number and the variable t is the time.

(11)

11 Fig. 2.3: Logistic transfer function

The outcome produced by a neuron is multiplied by the respective weight and fed-forward to become input to the neurons in the next layer of the network. That is why MLPs are referred to as feed-forward NNs. There are many variations of the multilayer perceptron available, mostly characterised by the number of layers within a neuron.

Research has shown that an appropriate choice of connecting weights and transfer functions is important. Multilayer perceptrons are taught through training and for training to occur there is a need to generate a training dataset. There are two types of training techniques for the multi-layer perceptrons, namely supervised and unsupervised training.

1.5.4. Types of Neural Network Training 1.5.4.1. Supervised Training

In this type of training, MLPs are supplied with a dataset as well as the expected outputs from each of the dataset. This is the most widely used training regime for neural networks. The MLP is allowed to undergo a series of epochs, up and until the resulting output is closely matched to the expected output, with a very low rate of error.

(12)

12 1.5.4.2. Unsupervised Training

On the other hand, in this type of training the MLP is not provided with expected output. This type of training is mostly used in situations where the NN is designed to place inputs into sev-eral groups. Just like supervised training, the training entails numerous iterations. The classifi-cation groups are exposed as the training of the neural network.

1.5.5. Training a Multilayer Perceptron using the Back-propagation algorithm

As already discussed above, the training of a multilayer perceptron involves the modification and adjustment of weights. Progressive changes in weights and plotting the corresponding changes in weights generates an error plot (Fig. 2.4). The central aim of training NNs is to ob-tain the optimal combination of weights resulting in least error and the back-propagation train-ing algorithm uses the gradient descent technique to obtain the least possible error. However, this is not always possible.

Fig. 2.4: A three-dimensional error plot 1.5.6. The back propagation algorithm

There are fundamentally two implementation methodologies for the back-propagation algo-rithm. The first one is the so-called on-line training characterized by the modification network weights following the presentation of each pattern. The other method is the batch training

ap-Err

o

(13)

13 proach that involves the summation of errors of all patterns. In practice, numerous training it-erations (sometimes thousands) are needed prior to the attainment of an acceptable level of error. As a rule of the thumb, training ideally should be terminated when the neural network achieves a maximum performance on test independent data. However, that might not be the minimum network error.

Fig. 2.5: The back-propagation algorithm

1.5.7. Validating Neural Network Training

Validating neural networks is important because it allows for the determination if more training is required. In order to validate a NN, a validation dataset is required.

1.5.8. Determining the Number of Hidden Layers

It has been shown that multilayer perceptrons with only one hidden layer are universal approx-imators (Hornik et al, 1989). More hidden layers can make the problem easier or harder.

A. Initialization of network weights B. Input input from training data to network C. Propagation of input vector through network D. Calculation of error signal by comparing actual outputto desired E. Propagation of error back through network F. Adjustments of weights to minimize error G. repeat steps 2-7 until error satisfactorily

(14)

14 Table 2.2: Number of hidden layers

Number of Hidden Layers

Result

None Can only represent linear functions 1 Has the ability to estimate any function

2 Represents an “decision boundary” using an appropriate activation Function

1.5.9. Number of Neurons in the Hidden Layer

The determination of the number of neurons within a hidden layer is paramount in the decision of the final NN architecture. Even though the hidden layers are not directly connected to the external environment, they still influence greatly the final outcome. It is therefore very im-portant to select carefully the number of hidden layers and number of neurons for the hidden layers. Less than optimal hidden layers result in under-fitting, whilst too many hidden layers re-sult in over-fitting.

1.5.10. Activation Functions

The vast majority of neural networks use activation functions, which extend the output of the NN to appropriate ranges. Examples of activation functions include the sigmoid function, hy-perbolic tangent and linear function.

1.5.10.1. Sigmoid Activation Function

In general, sigmoid activation functions utilize a sigmoid function to attain the desired activa-tion. The sigmoid function is outlined as shown in equation 2.2;

(2.2) A sigmoid curve is S-shaped.

(15)

15 Fig. 2.6: The Sigmoid Function

It is important to note that the sigmoid activation function only returns positive values. There-fore, using the sigmoid function, the neural network will not return negative numbers.

1.5.10.2. Hyperbolic Tangent Activation Function

Unlike the sigmoid activation function, this function does return values less than zero. On the other hand, the hyperbolic tangent function does provide negative numbers. The equation of the hyperbolic activation function (Tanh function) is shown below;

(2.3)

(16)

16 1.5.10.3. Linear Activation Function

This function is in reality not an activation function and this is probably the least commonly used activation function. In addition, this function does not modify a pattern before releasing it. The equation for the linear activation function is;

f (x) = x (2.4)

This activation function might be useful in applications where the purpose is to output the whole range of numbers.

Fig.2.8: The linear activation function

1.6. Logistic Regression

1.6.1. Introduction

1.6.1.1. Binary Data

For each observation I, the response Yi can take only two values coded 0 and 1. For this

re-search the coded vales would stand for HIV positive (+1) and HIV negative (0). Therefore as-suming, pi is the success probability for observation I. Yi has a Bernoulli distribution.

(17)

17 1.6.1.2. Binomial Data

Each observation I is a count of ri successes out of ni trials. Assuming pi is the success

probabil-ity for observation I. Therefore ri has a Binomial distribution, ri ~ B (ni, pi). However, a binomial

distribution with ni = 1 is a Bernoulli distribution.

1.6.2. Models for Binomial and Binary Data

An important approach for analysing binary response data is the use of statistical model to de-scribe the relationship inherent between the response and input variables. This approach is equally applicable to data from experimental studies where individual and experimental units have been randomized to a number of treatment groups or to observational studies where in-dividuals have been sampled from some conceptual population by random sampling.

1.6.3. Statistical Modelling

At the centre of any modelling exercise is the need to develop a mathematical representation of the inherent relationship between a response variable and a number of input variables.

1.6.4. Uses of Statistical Modelling

(i) To investigate the possible relationship between a given response and a number of varia-bles,

(ii) To study the pattern of any relationship between a particular response and variables (iii) Modelling may motivate the study of underlying reasons for any model structure

(iv) To estimate in what way the response would change if certain explanatory variables change

1.6.5. Methods of Estimation

The process of fitting a model to dataset involves the determination of unknown parameters in the model. The two widely used methods of fitting linear modes are the least squares and maximum likelihood approaches.

1.6.5.1. The Method of Least Squares

There are two reasons for the use of the method of least squares, namely;

(i) minimization of the difference between observations and their expected values,

(ii) the parameter estimates and their derived quantities such as fitted values, have a number of optimality proportions, such as being unbiased, having minimum variance when compared with all other unbiased estimators and linearity estimates, meaning that if data assumed to have a normal distribution, the residual sum of squares on fitting a linear model has a Chi-square

(18)

dis-18 tribution. This is the basis for the use of F-tests to examine the significance of regression or for comparing two models.

1.6.5.2. The method of maximum likelihood

While the method of least squares is usually adopted in fitting linear regression models, the maximum likelihood method is most frequently used. This method is based on the construc-tion of the likelihood of the unknown parameters in the model for the sample data.

1.6.6. Transformation of Binomial Response Data

This involves the transformation of the probability scale from the range (0,1) to (-∞,+∞). Other transformations include;

1.6.6.1. The Logistic Transformations

The logistic transformation is log {1−𝑝1 }, which is written as logit p. 1−𝑝1 is the odds of a success and so the logistic transformation of p is the log odds of a success. The function logit (p) is a sigmoid curve that is symmetric about p=0.5.

(19)

19 1.6.6.2. The Probit Transformation

The probit function is symmetric in p, and for any value of p in the range (0,1), the correspond-ing value of the probit of p will lie between -∞ and +∞. When p=0.5, probit p=0. The probit transformation of p has the same general form as the logistic transformation.

1.6.6.3. Advantages of Logit to Probit Transformation

There are three reasons why the logit transformation is preferred to the probit transformation; 1. It has a direct interpretation in terms of the logarithm of the odds of a success. This

in-terpretation is particularly important in the analysis of data from epidemiological stud-ies,

2. Models based on the logistic transformation are particularly appropriate for the analy-sis of data that have been collected retrospectively,

3. Binary data can be summarized in terms of quantities called sufficient statistics when logistic transformation is used

It is for the above reasons that the logistic transformation is going to be used in this study.

1.6.6.4. Goodness-of-fit of a logistic regression

Following the successful fitting of the model to a given dataset, the next step would be to com-pare the accuracy of the predicted values to the observed values and if there is good agree-ment then the model is considered to be acceptable. The measure of model adequacy is termed goodness-of-fit, which is described in terms of deviance, Pearson's chi-square statistic, the Hosmer-Lemeshow statistic and analogues of the R2 statistic.

1.6.6.4.1. Deviance Statistic

The D-statistic often called the Deviance measures the extent to which a current model (Lc)

de-viates from a full model (Lf). The full model is not useful in its own right, since it does not

pro-vide a simpler summary of the data than the individual observations themselves. However, by comparing Lc with Lf, the extent to which the current model adequately represents the data can

be judged.

To compare Lc and Lf, it is convenient to use minus twice the logarithm of the ratio of these

(20)

20 where values of D will be encountered when Lc is small relative to Lf, indicating the current

model is poor.

1.6.6.4.2. Pearson's chi-square statistic

One of the most popular alternatives to the deviance is the Pearson's chi-square statistic de-fined by;

(2.6) where;

X2 = Pearson's cumulative test statistic, which asymptotically approaches a χ2 distribution. Oi = an observed frequency;

Ei = an expected (theoretical) frequency, asserted by the null hypothesis;

n = the number of cells in the table

The deviance and the Pearson's chi-square statistics have the same asymptotic chi-square dis-tribution, when the model is fitted correctly. The numerical values of the two statistics will generally differ, but the difference will seldom be of practical importance. Since the maximum likelihood estimates of the success probabilities maximize the likelihood function for the cur-rent model, the deviance is the goodness-of-fit statistic that is minimized by these estimates. On that basis, it is more appropriate to utilise the deviance than the Pearson chi-square statistic as measures of goodness-of-fit when linear logistic models.

1.6.6.4.3. The Hosmer-Lemeshow statistic

In contrast to the deviance, the Hosmer-Lemeshow statistic is a measure of the goodness-of-fit of a model that can be used in modelling ungrouped binary data. Indeed, if the data are rec-orded in a grouped form, they must be ungrouped before this statistic can be evaluated.

1.6.7. Strategy for Model Selection

Ideally, the process of modelling should lead to the identification of input factors to be included in the final statistical model for a given binary response dataset. The model selection strategy depends on the underlying purpose of a study. In this current study the aim is to determine which of the many demographic characteristics have a significant effect on the risk of acquiring an HIV infection amongst pregnant women in South Africa. In a nutshell, therefore the central aim of any modelling exercise is to evaluate the dependence of the response probability on the variables of interest. When the number of potential explanatory variables, including interac-tions, non-linear terms and so on, is too large, it might be feasible to fit all possible

(21)

combina-21 tions of terms, paying due regard to the hierarchical principle. Models that are not hierarchical are difficult to interpret.

1.6.8. Model Checking

This involved the verification if the model fitted to a given dataset is appropriate and accurate. Indeed, a thorough examination of the extent to which the fitted model provides an appropri-ate description of the observed data is a vital aspect of the modelling process. Measures of model checking include residual, outlier and influential observations analysis.

1.6.8.1. Residuals

The measure of agreement between an observation on a response variable and the respective fitted value is termed the residuals. Therefore, residuals are a measure of the adequacy of a fit-ted model.

1.6.8.2. Outliers

Observations that are surprisingly distant from the remaining observations in the sample are termed outliers. Such values may be as a result of measurement error, i.e. error in reading, cal-culating, or reading a numerical value; they may be due to an execution error or an extreme manifestation of natural variability.

1.6.8.3. Influential observations

A given observation is considered to be influential if its omission from a dataset results in dis-proportionate changes to the model under review. Although outliers may also be influential observations, influential observation need not necessarily be an outlier.

1.7. Comparison of Models using ROC Curve

In general a binary classification technique aims to categorise events into two broad classes namely, true and a false. This in turn leads to four possible classifications for each event: a true positive, a true negative, a false positive, or a false negative. This scenario is generally termed a confusion matrix (Fig. 2.10).

A confusion matrix can be used to calculate various model performance measures, as shown in equations 2.7, 2.8 and 2.9.

(22)

22 Predicted Observed Positive Negative Positive 3361 (TP) 1294 (FP) Negative 375 (FN) 2370 (TN)

Fig. 2.10: Format of a Confusion Matrix

Based on Fig. 2.10, Accuracy = 0.77, Precision = 0.72 and Recall = 0.90.

Predicted

Observed Positive Negative

Positive 3361 (TP) 101 (FP)

Negative 375 (FN) 198 (TN)

Figure 2.11: Effect of changes in false positive and true negative on the measures of accuracy Based on Fig. 2.11, Accuracy = 0.88, Precision = 0.97 and Recall = 0.90.

1.7.1. The Basics of ROC Curves

The receiver operating curve (ROC) are graphs used to indicate the performance of a model over different threshold levels. These graphs were initially developed to determine the best operating points for a signal processing apparatus. ROC graphs are drawn by plotting the true positive rate against the false positive rate. Fig. 2.12 shows various regions covered by the ROC curve. Measure of Accuracy =𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆+𝑻𝒓𝒖𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆+𝑭𝒂𝒍𝒔𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆+𝑭𝒂𝒍𝒔𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆+𝑻𝒓𝒖𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆 2.7 Measure of Precision = 𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆+𝑭𝒂𝒍𝒔 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 2.8 Measure of Recall = 𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆+𝑭𝒂𝒍𝒔𝒆 𝑵𝒆𝒈𝒂𝒕𝒊𝒗𝒆𝑻𝒓𝒖𝒆 𝑷𝒐𝒔𝒊𝒕𝒊𝒗𝒆 2.9

(23)

23 Fig. 2.12: Different regions of the ROC curve

1.7.2. Methods of model evaluation

The central aim of any modelling technique is to improve predictive accuracy. In the study of risk, a small improvement in predictive capability can lead to a substantial increase in benefit. The important question for an analyst is the determination whether a given model has predic-tive superiority over another. It is imperapredic-tive for researchers who utilize predicpredic-tive models for binary classification, to understand the circumstances under which each evaluation method is most appropriate.

(i) Global classification rate Table 2.3: Global classification

True HIV negative True HIV positive Total

Predicted HIV negative x m x + m

Predicted HIV positive y n y + n

Total x + y m + n (x + m) + (y + n)

The above model might have a global percentage classification rate for HIV negative of; 𝐺𝑙𝑜𝑏𝑎𝑙 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒 (𝐻𝐼𝑉 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒) = 𝑥

(𝑥 + 𝑚) + (𝑦 + 𝑛)𝑥 100

1 Perfect performance

(a) Liberal performance

(b) Conservative

performance random performance (c) worse than random

performance always negative classification 1

0

0 1 False positive rate

Tr u e Po si ti ve Ra te

(24)

24 The global classification rate is ideal provided the underlying “costs associated with each error are known or presumed to be the same”. In this regard, the model with the highest classifica-tion would be chosen.

(ii) Kolmogorov-Smirnov statistic (K-S test)

This is one of the methods used for evaluating predictive binary classification models, and measures the distance between the distribution functions of two classifications. The predictive model generating the largest separability is considered to be superior. A graphical example of a K-S test is shown in Fig. 2.13.

Fig. 2.13: K-S test

The disadvantages of the K-S test include the fact that this methodology assumes that the in-herent costs of miscalculating errors are equal.

(iii) Individual Misclassification

In reality, however the costs of certain misclassifications are greater than others. A thorough understanding of the situation at hand is required in order to rank the costs of misclassifica-tions. For this current research, a greater mistake might be false negative, in which a pregnant

HIV negative HIV positive 100 80 60 40 20 0.00 0.7 1.0 Score Cut Off

Cumulative % of observations

Greatest separation of distributions oc-curs at a score 0.7

(25)

25 woman is told that she is uninfected with an HIV, resulting in the individual not being enrolled for life-saving anti-retroviral treatment (ARVs). On the other hand, advising a false positive ver-dict cause unnecessary emotional distress as the individual is put on ARVs.

(iv) The Receiver Operating Curve (ROC)

The ROC curve plots the sensitivity of a model on the vertical axis against 1-specificity on the horizontal axis. The area under the ROC curve (AUROC) is allows for the comparison of differ-ent binary classification models. The technique is ideal in situations where there is paucity of information on costs of wrongly classifying events. The AUROC measure is equivalent to a Gini index, c-statistic and the metric θ (Thomas et al., 2002).

Fig. 2.14: ROC Curve illustration θ = Area under the Curve

(v) Area under Receiver-of-Characteristics curve (AUROC)

This statistic is also used for method validation, with an area value of 0.5 suggesting a random modem with very minimal discriminative advantage. On the other hand, and area value of 1.0 suggests a perfect model.

1.7.3. Choosing the Right Model

The SAS Enterprise MinerTM (SAS Inc. 2002) programme can be successfully used to generate a number of model types that include scorecards, decision trees, logistic regression and neural networks. Some of the considerations for selecting the best model include ease of application,

θ=1.0 0.5< θ < 1.0 θ=0.5

1 - Specificity

Se

ns

iti

vi

ty

(26)

26 understanding and justification. The researcher should also consider the predictive perfor-mance of the model in the selection of the best model.

1.7.3.1. Scorecards

The scorecard model is one of the traditional forms of scoring models. The scorecard is made up of a table containing characteristics with their corresponding attributes. Points are allocated for each attribute and the points vary depending on whether or not the attribute is high or low risk. More points are granted to attributes that are low risk. The overall score is considered relative to a stipulated threshold number of points.

1.7.3.2. Decision Trees

It is generally believe that a decision tree has the capability to outperform a scorecard model with regards to its ability to accurately predict outcomes. This belief is based on the fact that decision trees are able to analyse interactions between attributes. In that regard the decision tree does add value to the understanding of the risk levels of different attributes

1.7.3.3. Neural Networks

In general, neural networks present better accuracy of prediction compared to scorecards and decision trees. The disadvantages of the neural networks are that they are black boxes, and present difficulty in attempting to explain and justify the decisions they arrive at.

1.7.4. Development of a Scorecard

(i) Development of Sample

The input dataset comprised of HIV positive and HIV negative individuals. The data partition node, on SAS Enterprise Miner (SAS, Inc.) divided the dataset into 50% training, 25% validation and 25% test. Models will be compared based on the validation data.

(27)

27 (ii) Classing

This is a procedure that involves placing inputs variables into bins. Points are provided to indi-vidual attributes on the basis of their relative risk. This relative risk of the attribute is deter-mined by the attribute’s weight-of evidence (WOE). On the other hand, the significance of the characteristic is determined by its coefficient in a logistic regression.

The classing process determines how many points an attribute is worth relative to the other at-tributes on the same characteristic. After classing has defined the atat-tributes of a characteristic, the characteristic’s predictive power (i.e. its ability to separate high risks from low risk) can be assessed with the Information Value (IV) measure. This will aid in the selection of attributes to be included in the scorecard. The IV is the weighted sum of WOE of the characteristic’s attrib-utes. The sum is weighted by the difference between the proportions of HIV negative and HIV positive individuals in the respective attribute.

Following the identification of the relative risks of attributes within a given demographic char-acteristic, a logistic regression is used to measure the demographic characteristics against each other. A number of selection methods such as forward, backward and stepwise can be used in the scorecard node to eliminate the insignificant demographic characteristics.

Weight − of − evidence = ln (𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝐻𝐼𝑉 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑓 𝐻𝐼𝑉 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒)

for each group i of a characteristic

2.1 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑣𝑎𝑙𝑢𝑒 = (𝐷𝑖𝑠𝑡𝑟 𝐻𝐼𝑉 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐷𝑖𝑠𝑡𝑟 𝐻𝐼𝑉 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) ∗ 𝐿 𝑖=1 ln(𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝐻𝐼𝑉 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝐻𝐼𝑉 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒) where L is the number of attributes (levels)of the characteristic

(28)

28 Table 2.4: Example of Scorecard

Characteristics Attribute

Scorepoints Actual level Coded level

Women’s age (years) <20 -1 -

21-29 0 -

>30 1 -

Partner’s age (years) <24 -1 -

25-33 0 - >34 1 - Education (grades) <8 -1 - 9-11 0 - 12-13 1 - Parity 0 -1 - 1 0 - >2 1 -

(iii) Logistic Regression

Following the determination of the relative risks for the attributes, a logistic regression is used to calculate the regression coefficients, which in turn are multiplied by the WOE values of the attributes to form the basis for the score points in the scorecard. Table 2.6 shows an example of a scorecard.

(iv) Scorepoints scaling

The scaling of the scorecard points facilitates the attainment of scorepoints that are easy to in-terpret.

Score points = Weight of Evidence * Regression Coefficient

(v) Scorecard Assessment

The SAS Enterprise MinerTM provides various charts that are used to assess the quality of the scorecard.

(a) Scorecard distribution chart- This also shows which scores are most frequent and provides an insight into whether or not the distribution is normal and if there are outliers present. (b) Kolmogorov-Smirnoff (KS) statistic

(29)

29 (d) area under the ROC curve (AUROC)

The KS statistic, Gini coefficient and AUROC are used to measure the discriminatory power of the scorecard.

(vi) Model comparison

This involved the comparison of the predictive accuracy neural networks, logistic regression and decision trees using the Model comparison node in SAS Enterprise MinerTM(SAS Inc. 2012). The AUROC statistic was used to achieve model comparison and the results were validated using the K-S and Gini statistics.

(30)

30

1.1. Introduction

In this chapter, the aspects of the experimental methods, planning and design as well as tools and procedures for the analysis, are presented and motivated. Some additional de-tails of the different experimental methodologies are explained in chapter 4 in the context of the experimental results.

2.2. Research Outline

Fig. 3.1: Research Study Plan 2.2.1. Step One: Data Exploration and Classification

As explained in chapter 2, the methodology of classification will enable the summarization of voluminous and complex datasets, facilitate the detection of relationships and struc-ture within the data set, allow for more efficient organization and retrieval of information, summaries of data can allow investigators to make predictions or discover hypothesis to account for the structure in the data as well as facilitate the formulation of general hy-potheses to account for the observed data.

2.2.2. Step Two: Screening Design A. Data Exploration & Classification B. Screening Design C. Response Surface Design (Central Composite Function) D. Comparative Study of Two Response Surface Methodologies (RSMs) E. Comparative Study of RSM with Binary Logistic Regression G. Application of multilayer perceptron to model demographic characteristics H. A review of the application of neural networks in modeling HIV/AIDS I. Model Assessment with ROC Curves: Validation using a Scorecard design J. Development and validation of an HIV risk scorecard Comparison of all modelswith Full Regression Additional Research that was not included in the initial research proposal-To add value to the research project

(31)

31 This step is undertaken when the experiment has a large number of input variables that have the capacity to influence the response. It is aimed at reducing the number of varia-bles to include only the significant ones.

In this current research project, a screening design is going to be used to rank the im-portance of demographic characteristics on influencing the risk of acquiring HIV infection. As stated in the introduction to this thesis, each pregnant woman attending an antenatal clinic in South Africa is described using various demographic characteristics, such as popu-lation group, level of education, age, partner’s age, parity, gravidity etc. In literature to-date, no recorded work has been conducted in attempting to understand if these demo-graphic characteristics predispose an individual to acquiring HIV.

In other words, this work is geared towards ascertaining whether or not there is a link be-tween demographic characteristics and the risk of acquiring HIV, if so, apply a screening design to rank the differential effects of these characteristics on the risk of acquiring the HIV infection. However, the screening design has the disadvantage of not being able to effectively characterize possible interactions between demographic characteristics (Siban-da & Pretorius 2011).

2.2.3. Step Three: Response Surface Methodology

As already indicated in the screening objective above, the easiest way of estimating a 1st degree polynomial is to use a factorial design and this technique is sufficient to detect the main effects. However, if it is suspected that there are interactions of explanatory varia-bles, then a more complicated design, such as a response surface methodology is needs to be implemented to estimate a second-degree polynomial model. In this study, a se-cond-order polynomial model used a central composite face centered design estimate the model coefficients of the four selected factors believed to influence the risk of acquiring HIV infection (Sibanda & Pretorius 2012).

2.2.4. Step Four: Comparison of two response surface methodologies

This step of the research will be conducted to compare results from two response surface methodologies. This is important as it confirms if the results obtained are design-specific and provides a measure of repeatability by using a different RSM technique. A central composite design as shown in step three, is the most common response surface design, and is built on a factorial design. It requires five factorial levels. On the other hand, Box-Behnken designs use the midpoints of the cube edges instead of the corner points, which results in fewer runs, but, unlike the central composite design, all the runs must be done even if there is no curvature. Furthermore, the Box-Behnken design uses only three factor levels, and should be used when the screening experiment indicated curvature to be sig-nificant (Sibanda & Pretorius 2013).

2.2.5. Step Five: Comparison of response surface methodology and a binary logistic re-gression results

This step of the research will be conducted to compare results of modeling the effects of demographic characteristics on the risk of acquiring HIV using a Box Behnken design and a binary logistic regression. The logistic regression is used in epidemiology to study the re-lationships between a disease in two modalities (diseased or disease-free) and risk factors which may be qualitative or quantitative variables. This step of the research is used to benchmark the performance of the design of experiment methodologies, as the latter techniques are not traditionally used in disease modeling (Sibanda & Pretorius 2012).

(32)

32 2.2.6. Step Six: Application of MLPs to model demographic characteristics

MLPs are feed-forward artificial neural networks and are comprised of various layers fully connected to each other. MLPs employ a supervised learning technique called back-propagation. MLPs will be trained and validated on given antenatal data, there after used to predict or classify new data. Demographic characteristics will be used as input varia-bles while the HIV status will be the response parameter (Sibanda & Pretorius 2011). 2.2.7. Step Seven: A review of the application of neural networks in modeling HIV/AIDS Neural network are finding increasing application in various fields ranging from engineer-ing sciences to life sciences. This review aims to highlight the use of neural networks in the study of HIV/AIDS (Sibanda & Pretorius 2012).

2.2.8. Step Eight: Model comparison using ROC Curves

Receiver-of-characteristics (ROC) curves will be used to compare the classification accura-cy of the models (Sibanda & Pretorius 2013).

2.2.9. Step Nine: Development and validation of an HIV risk scorecard

This research paper will cover the development of an HIV risk scorecard using SAS Enter-prise MinerTM. The project will encompass the selection of the data sample, classing, se-lection of demographic characteristics, fitting of a regression model, generation of weights-of-evidence (WOE), calculation of information values (IVs), creation and valida-tion of an HIV risk scorecard (Sibanda & Pretorius 2013).

2.3. Software tools

Design of experiments, neural networks and logistic regression analysis tool used in this study was SAS software, produced by SAS Institute, Cary, NC, USA. SAS Enterprise MinerTM was used to compare results from the three modeling methodologies, namely design of experiments, artificial neural networks and binary logistic regression.

Referenties

GERELATEERDE DOCUMENTEN

The Lancet: A Journal of British and Foreign Medicine, Surgery, Obstetrics, Physiology, Pathology, Pharmacology, Public Health and News..

To what extent does the degree of women and the age of members of the Board of Directors and the members of Audit Committee influence the quality of the disclosure of IFRS 15 and

The BBD design has larger standard errors at the edges of the design space compared to the CCF and this could be attributed to the architecture of the designs and hence the

In this chapter, the study on the relationship between corporate donors who fund NPOs as part of their CSI-spend and the NPOs who receive this funding will be introduced by means of a

The general aim of this study is to investigate the effects of, and to evaluate the effectiveness of Clinically Standardized Meditation as a strategy for stress

The object orientated GIS model in this thesis serves as a prototype model to provide an information system with 3D analysis capabilities for the electrical utilities inside the

In this placebo-controlled, prospectively randomized, and doubly masked study, we in- vestigated the separate and combined effects of parenteral E 2 , natural P, and/or placebo

Chapter 1: Background and Introduction to the study 9 From the above discussion, a conclusion can be drawn that sports properties need a better understanding of their potential