• No results found

Index 3.1.

N/A
N/A
Protected

Academic year: 2021

Share "Index 3.1."

Copied!
62
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

57

Chapter 3: Results

Index

3.1. Data Exploration

3.2. Use of Two-level fractional factorial design to model demographic characteristics 3.3. Response Surface Modeling and optimization to elucidate the differential effects of

demographic characteristics on HIV prevalence in South Africa.

3.4. Application of Central Composite face-centered and Box-Behnken designs to study the effect of demographic characteristics on HIV risk in South Africa.

3.5. Comparative Study of the application of Box-Behnken design and binary logistic re-gression (BLR) to study the effect of demographic characteristics on HIV risk in South Africa.

3.6. Novel application of multi-layer perceptrons (MLP) neural networks to model HIV in South Africa using seroprevalence data from antenatal clinics.

3.7. Using ROC curves to compare the classification accuracies of neural networks, logistic regression and decision trees.

(2)

58

3.1. Data Exploration

Fig. 4.1: HIV frequency

Fig. 4.1 shows that 9 750 pregnant women tested positive for HIV infection compared to 23 835 women who tested negative. This therefore translates to 29% of the total antenatal clinic attendees exhibiting an HIV infection. The total number of individuals tested for HIV in-fection was 33 585.

Fig. 4.2: Syphilis frequency

Fig. 4.2 shows that 933 pregnant women tested positive for syphilis infection. Syphilis is a dis-ease spread through sexual intercourse and is caused by a bacterium called Treponema pal-lidum. Syphilis is also transmitted from mother to foetus during pregnancy or at birth causing a congenital syphilis. In simple terms, Fig 4.2 indicates that 2.78% of pregnant mothers were

HIV Positive (9750) HIV Negative (23 835)

Frequency of HIV Result

Frequency of Syphilis

HIV Negative 32 652

HIV Positive 933

(3)

59 found to be infected with Syphilis. Syphilis has been found to increase the efficiency of HIV transmission by a factor of three.

Fig. 4.3: Frequency of HIV infection by syphilis status

Fig. 4.3, shows that the great majority of antenatal clinic attendees tested syphilis negative. As stated in Fig. 4.2, approximately 32 652 pregnant mothers tested negative for syphilis, but of these individuals approximately 9 750 women were HIV positive (29%). In addition, Fig. 4.3 shows that for the small number of individuals that had syphilis infection (approximately 933) an equal number had HIV infection (50%). Therefore, where syphilis infection exists, there tends to be a higher level of HIV infection. This is hardly surprising as it has been scientifically established that syphilis infection tends to enhance HIV infection.

(4)

60 Fig. 4.4: Frequency of HIV infection by pregnant women’s ages

Fig. 4.4 shows that the ages of pregnant women attending antenatal clinics in South Africa range from 14 to about 40 years old. The greatest number is observed between 16 and 35 years old, where the highest frequency of HIV infection is also observed. Women younger than 16 and those older than 40 years old exhibit the lowest levels of HIV infection. The highest HIV infection within an age-group is observed between the ages of 20 and 30 at about 33%. This information is very important to policy makers attempting to curb the spread of the epidemic. More education and awareness campaigns need to be targeted at the younger women, to promote safer sexual practices and emphasizing the importance of delaying sexual debut.

(5)

61 Fig. 4.5: Frequency of HIV infection by gravidity

Fig. 4.5 shows a plot of the frequency of HIV infection against number of pregnancies a women has had. The greatest number of antenatal clinic attendees were distributed between preg-nancies 1 and 5, with the most number of HIV positive individuals undergoing their second pregnancy.

Once more this information shows that the younger women presenting at the clinic for their first pregnancy tend to exhibit lower levels of HIV infection and the risk of HIV infection creases with the number of pregnancies. However, after the fifth pregnancy the risk of HIV in-fection falls dramatically and this is to be expected as these are older female citizens, possibly with an improved understanding of the HIV epidemic.

(6)

62 Fig. 4.6: Frequency of HIV infection by parity

The greatest number of antenatal clinic attendees in 2007, had no children i.e. zero parity. In other words, these women were attending the antenatal clinic for the first time with their first pregnancy and expecting their first-born child. The highest HIV infection rate was observed amongst the women with one child (single parity), and these were women presenting to the clinic for their second pregnancy. This information confirmed the observation made in Fig. 4.5.

Fig. 4.7: Frequency of HIV infection by educational level

(0=Primary education, 1=secondary education and 1=Tertiary education) The vast majority of antenatal clinic attendees had a secondary education (approximately 50%). However, 45% of pregnant women with a primary education were found to be HIV posi-tive status compared to 41% and 40% for women with a secondary and a tertiary education re-spectively.

(7)

63 Fig. 4.8: Frequency of HIV infection by male partner’s age

The largest number of male sexual partners for the pregnant women attending antenatal clin-ics fall within the age-group 26 to 30 years, however, the age-group 31 to 35years exhibited the highest levels of HIV infection at 63%. Young men below the age of 20 years had the least rate of HIV infection at 9.76%.

(8)

64

3.2. Application of Two-level fractional factorial design to determine and optimize the

effect of demographic characteristics on HIV prevalence using the 2006 South African

annual antenatal HIV and Syphilis seroprevalence data (Sibanda & Pretorius 2011).

3.2.1. Predictive model

The Design-of-experiments facility on SASTM was used to generate a predictive model for HIV, that contained all the demographic characteristics, as shown in the equation below:

Table 4.9: Predictive model generated by the screening design

HIV =

+ 0.31 - 0.1125 * Parity - 0.0465 * Gravidity - 0.1434 * Education - 0.031 * Syphilis + 0.2775 * Mother’s age + 0.079 * Father’s age 3.2.2. Coefficient Plot

A great way to visualize the results of a regression is to use a coefficient plot as shown in Fig. 4.9 below. This plot shows the contribution of each regression coefficient in a model and in that regard it appears that the mother’s age has the greatest contribution, followed by the edu-cational level of the antenatal clinic attendee (Sibanda & Pretorius 2011).

Fig. 4.9: Fractional factorial coefficient plot -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

Parity Gravidity Education Syphilis Mother's age Father's age C o e ff ic ie n t Demographic characteristic

(9)

65 3.2.3. Lenth’s Plot

A Lenth plot (Fig. 4.10) was computed to determine the relative contribution of each demo-graphic characteristic to the HIV risk. A Lenth plot is a bar chart used to determine possible significant effects. The plot is created using a method, proposed by Lenth in 1989, that com-putes a simultaneous margin of error (SME) around zero. Effect sizes that exceed the SME are considered to be significant. Lenth uses a pseudo-standard error (PSE) to construct the SME. A preliminary estimate of the standard error is computed as 1.5 times the median of the absolute value of the estimated effects. Only the effects within 2.5 times the preliminary estimate are included in the trimmed median in an attempt to include only the inactive effects in the esti-mate. A significance level (alpha) of 0.05 was adopted for the SME. As shown in the Lenth plot (Fig. 21), the mother’s age had the greatest influence on the risk of acquiring HIV infection, fol-lowed by educational level, parity, father’s age, gravidity and lastly syphilis status (Sibanda & Pretorius 2011:18).

(10)

66 3.2.4. Normal plot

A normal plot was constructed by plotting the sorted values of the responses against the theo-retical quantiles from a normal distribution. The normal plot is used to determine significant ef-fects. Insignificant effects correspond to points that lie on or near a line whose slope is the standard deviation of the error, while the significant effects correspond to points that depart from the line. The normal plot shown in Fig. 4.11, confirmed results obtained by the Lenth plot, that the mother’s age had the greatest effect in the risk of acquiring HIV (Fig. 4.10) (Sibanda & Pretorius 2011:18).

Fig. 4.11: Normal plot 3.2.5. Model adequacy

Residual analysis was used as the main method for assessing the adequacy of the regression model. The methods used for residual analysis were normal probability plot of residuals, plot of residuals versus predicted response and outlier analysis using threshold or cutoff values.

(11)

67 3.2.5.1. Normal Probability Plot of residuals

Fig. 4.12: Normal probability plot of errors

The normal probability plot is a graphical technique for normality testing and it is a test for as-sessing whether or not a data set is approximately normally distributed. The data are plotted against a theoretical normal distribution in such a way that the points should form an approxi-mate straight line. Departures from this straight line indicate departures from normality (Sibanda & Pretorius 2011:18).

3.2.5.2. Plot of Residuals against Predicted Response

Plotting residuals versus the value of a fitted response produced a distribution of points scat-tered randomly about 0, regardless of the size of the fitted value (Sibanda & Pretorius 2011:18).

-2 -1.5 -1 -0.5 0 0.5 1 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 R2=0.87

(12)

68 Fig. 4.13: Plot of residuals against predicted values

3.2.5.3. Plot of Residuals against Experimental Cases

A plot of the residuals against experimental cases is a way of detecting non-dependence of the error terms on the experimental cases, as shown in Fig. 4.14.

Fig. 4.14: Plot of residuals against experimental cases

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0

1

2

3

4

5

6

7

8

(13)

69 3.2.5.4. Constrained Optimization

Constrained optimization was achieved by imposing the constraints on the mother’s age and her educational level as shown below;

-1 < mother’s age < 1 -1 < educational level < 1

Table 4.10 shows the outcomes of the constrained optimization procedure, where the highest prevalence of HIV of 86% was obtained at the lowest level of education and highest maternal age.

Table 4.10: Outcome of constrained optimization

Variable Level HIV prevalence risk

Education -1 86%

Mother’s age 1

The constrained optimization results indicate that as the mother’s age increases, the risk of HIV infection is significantly increased. However, a decrease in educational level of the mother re-sults in a marked increase in the risk of HIV infection ((Sibanda & Pretorius 2011:19).

3.3. Response surface modeling and optimization to elucidate the differential effects

of demographic characteristics using a central composite face-centered design

(Sibanda & Pretorius 2012)

3.3.1. Central Composite face-centered

The design had 4 factors and generated 36 runs. The design did not exhibit any outliers and in-fluential observations. The fit statistics of the orthogonal central composite face-centered de-sign are shown in Table 4.13.

Table 4.11: Fit statistics for the central composite face-centered design

Master Model Predictive Model

Mean 0.293 0.293

R-square 77.94% 68.12%

Adj. R-square 11.75% 57.49%

(14)

70 The predictive model for HIV generated by the central composite face-centered design of ex-periment is shown in Table 4.12.

Table 4.12: Predicted model for HIV using the Central Composite Face-centred design CCF HIV = Factors +0.307 +0.16* Mother’s age +0.0018* Father’s age -0.08* Education +0.005* Parity +0.04* Mother’s age *Father’s age -0.11* Mother’s age *Education -0.02* Mother’s age *Parity -0.02* Father’s age *Parity

The coefficient plot derived from the predictive model above is shown in Fig. 4.15.

Fig. 4.15: CCF Coefficient plot 3.3.3.1. Numerical Optimization

The central composite face-centered design confirmed that the mother’s age was the most in-fluential characteristic in determining the risk of acquiring HIV. The interaction of demograph-ic characteristdemograph-ics was also found to be signifdemograph-icant in influencing the risk of acquiring an HIV in-fection. -0.15 -0.1 -0.05 0 0.05 0.1 0.15

Co-efficient plot

CCF

(15)

71 Fig. 4.16: Surface plot of the relationship between father’s age and mother’s age on HIV 3.3.4. Residual Analysis

There are many statistical tools for model validation, but the primary tool for most process modeling applications is graphical residual analysis. The residual plots assist in examining the underlying statistical assumptions about residuals (Sibanda & Pretorius 2011: 252-10).

Therefore residual analysis is a useful class of techniques for the evaluation of the goodness of a fitted model. One method of residual analysis is the normal plot of residuals, shown in Fig.4.17.

(16)

72 The normal plot of residuals (Fig.4.17), evaluates whether there are outliers in the dataset. All the points lie on the diagonal, implying that the residuals constitute normally distributed noise. A curved pattern indicates non-modelled quadratic relations or incorrect transformations (Sibanda & Pretorius 2011: 252-11).

3.4. Novel application of central composite face-centered and

Box-Behnken designs to study the effect of demographic characteristics on

HIV risk in South Africa (Sibanda & Pretorius 2012).

3.4.1. Model Summary Statistics of the central composite face-centered and

Box-Behnken design

Table 4.13: CCF and BBD predictive models Predictive model Central composite face-centred

Orthogonal design Box-Behnken design Mean 0.293 2.823 R-square 68.12% 86.61% Adj. R-square 57.49% 77.69% RMSE 0.16 0.237 CV 54.54 8.39

The R-square statistics of the Box-Behnken design were considerably higher than for the or-thogonal central composite face-centred and the Box-Behnken designs. Statistically, high R-square values imply that a large proportion of variation in the observed values is explained by the model. Further tests indicated that the R-square values for the 2-factor interaction models were higher than for the linear models, suggesting that antenatal data was better modelled by the main and 2-factor interaction models. In addition adeq. precision that is used as a measure of the signal-to-noise ratio indicated that both models could be used to navigate the design space. The orthogonal central composite face-centred and Box-Behnken designs had adeq. precisions of 8.84 and 31.33 respectively indicating an adequate signal (Sibanda & Pretorius 2012:6).

3.4.2. ANOVA for 2FI response Surface

The orthogonal central composite face-centred and Box-Behnken designs confirmed that the mother’s age has the greatest effect on the HIV status of antenatal clinic attendees. The

(17)

73 mother’s educational level was the second most important individual factor. Also of note is the fact that the interaction of the mother’s age with father’s age and educational level significant-ly affects the HIV status of the antenatal clinic attendees (Sibanda & Pretorius 2012:6).

TABLE 4.14:ANOVARESULTS

Source Sum of Squares F-value P-value

CCF BBD CCF BBD CCF BBD Model 0.12 0.014 7.99 88.29 0.001 <0.0001 Mother’s age 0.035 0.054 18.17 301.92 0.001 <0.0001 Father’s age 0.0004 0.0004 0.021 2.25 0.888 0.194 Education 0.012 0.0046 6.09 25.60 0.031 0.004 Parity 0.0003 0.0009 0.13 5.06 0.725 0.074 Mother’s age * Father’s age 0.015 0.023 7.97 126.75 0.017 <0.0001 Mother’s age * Education 0.057 0.017 29.68 94.10 0 0.0002 Mother’s age * Parity 0.005 0.001 2.35 6.51 0.154 0.051 Father’ age * Parity 0.005 0.0009 2.35 5.06 0.154 0.074 Education * Parity 0.000 0.0004 22.56 0.0051

3.4.3. Model adequacy checking

Model adequacy checking is conducted to verify whether the fitted model provides an ade-quate approximation to the true system and to verify that none of the least squares regression assumptions are violated. Extrapolation and optimization of a fitted response surface will give misleading results unless the model is an adequate fit. There are many statistical tools for model validation, but the primary tool for most process modelling applications is graphical re-sidual analysis. The rere-sidual plots assist in examining the underlying statistical assumptions

(18)

74 about residuals. Therefore residual analysis is a useful class of techniques for the evaluation of the goodness of a fitted model (Sibanda & Pretorius 2012:7).

3.4.3.1. Residual Analysis

3.4.3.1.1. Normality Probability Plot of Residuals

A normal probability plot of residuals can be used to check the normality assumption. If the re-siduals plot approximates a straight line, then the normality assumption is satisfied.

The normal plot of residuals also evaluates whether there are outliers in the dataset. All the points lie on the diagonal, implying that the residuals constitute normally distributed noise. The normal probability plot of the central composite design approximates a straight line, hence suggesting normality. However, the normal probability plot of the Box-Behnken design dis-plays marked deviation from normality. This could be a result of a number of factors such as inadequate transformation.

Fig. 4.18: Normal plots of residuals for the central composite and Box-Behnken design

(a) Normal plot of residuals for the Central composite face-centred design

(19)

75 3.4.3.1.2. Plot of residuals vs fitted response

The residuals should scatter randomly suggesting that the variance of the original observations is constant for all values of the response. However, if the variance of the response depends on the mean level of the response, the shape of the plot tends to be funnel-shaped, suggesting a need for a transformation of the response variable (Sibanda & Pretorius 2012:7).

Fig. 4.19: Plots of Residuals vs Fitted Response for the CCF and Box-Benhken designs The plot of the residuals vs fitted response for both the central composite face-centred and Box-Behnken designs suggest that variance of the original observations is constant.

(a) Plot of Residuals vs Fitted Response for the central composite design

(20)

76 3.4.3.1.3. Plot of residuals vs observation order

None random patterns on these plots indicate model inadequacy. This might require trans-formation to stabilize the situation.

Fig. 4.20: Plots of residuals vs observation order for the CCF and Box-Benhken designs 3.4.3.1.4. Influence Diagnosis

Parameter estimates or predictions may depend more on the influential subset than on the majority of the data. It is therefore important to locate these influential points and assess their impact on the model (Sibanda & Pretorius 2012a).

3.4.3.1.4.1. Leverage points

The leverage of points test was used for the influential diagnosis. This is a measure of the dis-position of points on the x-space. Some observations tend to have disproportionate leverage on the parameter estimates, the predicted values and the summary statistics. As shown in Fig.

(a) Plot of Residuals vs Observation order for the central composite face-centered design

(21)

77 4.21, the leverage of points patterns were similar for both the central composite face centred and Box-Behnken designs (Sibanda & Pretorius 2012a).

Fig. 4.23: Plot of leverage of points for the central composite and Box-Behnken designs 4.4.4. FinaI equations of the central composite and Box-Benhken designs

Fig. 4.21: Plots of leverage of points for the CCF and Box-Benhken designs (a) Plot of leverage of points for the central composite design

(22)

78 Table 4.15: FinaI equations of the CCF and Box-Benhken designs

CCF HIV = BBD HIV = Factors +0.33 +0.32 +0.15* +0.18* Mother’s age +0.002* -0.02* Father’s age -0.09* -0.07* Education +0.005* +0.02* Parity +0.04* +0.13* Mother’s age *Father’s age -0.11* -0.10* Mother’s age *Education -0.02* +0.05* Mother’s age *Parity -0.02* +0.02* Father’s age *Parity

The co-efficient plots (Fig. 4.22) derived from the final response surface equations in Table 4.15 clearly indicate that the mother’s age and her educational level are the single most important determinants of the HIV status of an antenatal clinic attendees. Coefficient plots represent the relative importance of each variable on the model equation. In addition, the interaction of the mother’s age with the other demographic characteristics is also an important determinant of the HIV risk (Sibanda & Pretorius 2012:8).

(23)

79

3.4.5. Main effects plot

1 0 -1 0.35 0.30 0.25 0.20 0.15 1 0 -1 1 0 -1 0.35 0.30 0.25 0.20 0.15 1 0 -1 Motherage M ea n Fatherage Education Parity

Main Effects Plot for HIV risk Data Means

Fig. 4.23: Main effects plot

A main effects plot is a plot (Fig. 4.23) of the means of the response variable for each level of a factor, allowing for the determination of which main effects are important. For both the cen-tral composite face-centred and Box-Behnken designs, HIV risk increases steeply as the moth-er’s age and education increase from the low level to the middle level. Thereafter, the HIV risk decreases gradually for the two demographic characteristics (Sibanda & Pretorius 2012:9). 3.4.6. Interaction plot

This research assumes sparsity-of-effects principle that states that a system is usually dominat-ed by main effects and low-order interactions. Thus it is most likely that main effects and two-factor interactions are the most significant responses in an experimental design. This means that higher order interactions such as three factor interactions are rare. This phenomenon is referred to as the hierarchical ordering (Sibanda & Pretorius 2012a).

Central composite face-centred and Box-Behnken designs showed that the interaction of the mother’s age with the other demographic characteristics had a significant effect on the HIV risk of pregnant mothers as shown in Fig. 4.24 (Sibanda & Pretorius 2012a).

(24)

80 1 0 -1 -1 0 1 30 15 0 30 15 0 30 15 0 1 0 -1 30 15 0 1 0 -1 Motherage Fatherage Education Parity -1 0 1 Motherage -1 0 1 Fatherage -1 0 1 Education -1 0 1 Parity

Interaction Plot for HIV risk

Data Means

Fig. 4.24: Interactions Plot

3.4.7. 3D Response surface plot

Fig. 4.25 shows the 3D plots of the influences of mother’s age and education on the HIV risk of pregnant mothers. The response surface plots indicate that the HIV risk increases with the age of the mother, however the increase in HIV risk is lower for the educated woman compared to their less educated counterparts. The latter observation could be attributed to increased HIV/AIDS awareness in the educated groups (Sibanda & Pretorius 2012:9).

(25)

81 Fig. 4.25: 3D Response surface plots of CCF and Box-Behnken designs

a) 3D response surface plot of the central composite face-centered Design

(26)

82

3.5. Comparative Study of the Application of Box Behnken Designs (BBD) and Binary

Logistic Regression (BLR) to study the effect of demographic characteristics on HIV

risk in South Africa (Sibanda & Pretorius 2012).

3.5.1.

Model Fit Statistics

3.5.1.1 Box Behnken Design

3.5.1.1.1. Sequential model sum of squares

This technique shows the effect of increasing terms to the complexity of the total model. Table 4.16: Sequential model sum of squares for the Box-Behnken design

Source Sum of Squares F-value P-value

Mean vs. Total 1.13

Linear vs. Mean 0.097 5.37 0.014

2FI vs. Linear 0.044 49.83 0.000

The Box-Behnken design has the lowest probability (P-value) for the 2-factor interaction model at a significance level of 0.05. This means that the data is best modelled by a main and 2-factor interactions model as compared to a linear model. Therefore the interaction of factors has a definite effect on the risk of acquiring an HIV infection (Sibanda & Pretorius 2012b).

3.5.1.1.2. Model summary statistics

Table 4.17: Model summary statistics for the Box-Behnken design

Source Standard Deviation R2 R2 Adjusted PRESS Adeq. precision

Linear 0.067 0.68 0.56 0.15

2FI 0.013 0.95 0.98 - 31.33

The R-square statistics of the linear models are considerably lower than those of the two-factor interactions (2FI) models. Therefore, the 2-factor interactions model has the lowest standard deviation, high R-squared and low Predicted Residual Sum of Squares (PRESS), implying that the 2-factor interactions model best fits the data. Statistically, high R-square values imply that a large proportion of variation in the observed values is explained by the model.

Adeq. precision is used to measure the signal to noise ratio and a ratio greater than 4 is desir-able indicating model can be used to navigate the design space. The BBD design has an adeq. precision of 31.33 indicating an adequate signal (Sibanda & Pretorius 2012b).

3.5.1.2. Binary Logistic regression 3.5.1.2.1. Goodness-of-fit

In logistic regression, Deviance and Pearson’s chi-squared goodness-of-fit are measures used to compare the overall difference between observed and fitted values. In addition, information criteria such as AKAIKE Information Criterion (AIC), Schwartz Criterion (SC) and negative log-likelihood, are used to measure goodness-of-fit for logistic regression models (Sibanda & Preto-rius 2012b).

(27)

83 3.5.1.2.1.1. Pearson’s Chi-Square test

Table 4.18: Pearson’s chi-square test

Criterion Value DF Value/DF Pr > ChiSq.

Pearson 47.14 25 1.89 0.0047

Pearson’s chi-square statistic includes the test for independence in two-way contingency ta-bles. This technique has been extended from generalized linear model theory to test for ade-quacy of the current fitted model (Sibanda & Pretorius 2012b).

3.5.1.2.1.2. Residual Deviance

Table 4.19: Deviance values

Criterion Value DF Value/DF Pr > ChiSq.

Deviance 46.75 25 1.87 0.0052

The other goodness-of-fit test is the residual deviance. This is the log-likelihood ratio statistic for testing the fitted model against the saturated model in which there is a regression coeffi-cient for every observation. The deviance quantity compares the values predicted by the fitted model and those predicted by the most complete model we could fit. A very large deviance value is evidence for model lack-of-fit. However under specificity regularity conditions devi-ance value converges asymptotically to a chi-square distribution with h degrees-of-freedom, where h is the difference between the number of parameters in the saturated model and the number of parameters in the model being under consideration.

Therefore if the null hypothesis cannot be rejected, it can be concluded that the fitting of the model of interest is substantially similar to that of the most completed model that can be built (Sibanda & Pretorius 2012b).

3.5.1.2.1.3. AKAIKE Information Criterion (AIC)

Table 4.20: AKAIKE Information Criterion (AIC)

Criterion Intercept Only Intercept and Covariates

AIC 18887.57 18189.01

AIC is used for the comparison of models from different samples. The model with the lowest AIC is considered best as it minimizes the difference from the given model to the true model. From Table 4.21, it is evident that the model with intercept and covariates better fits the data compared to the intercept only model (Sibanda & Pretorius 2012b).

3.5.1.2.1.4. Schwarz criterion (SC)

Table 4.21: Schwarz criterion (SC)

Criterion Intercept Only Intercept and Covariates

(28)

84 Schwarz criterion was developed in 1978 as a model selection criterion. The model was de-rived from a Bayesian modification of the AIC criterion. Like AIC, SC penalizes for the number of predictors in the model and the smallest SC is most desirable. Table 4.21 further confirms that the intercept and covariates model is better than the intercept only model (Sibanda & Pre-torius 2012b).

3.5.1.2.1.5. -2logL

Table 4.22: -2logL

Criterion Intercept Intercept and Covariates

-2Log L 18885.57 18179.01

The -2logL is used in hypothesis testing for nested models. The intercept only model is the lo-gistic regression estimate when all variables in the model are estimated at zero. As shown in Table 4.22, the model with independent variables and the intercepts has lower -2log L value indicating that it is better than intercept only model.

It should be noted that the values of the three measures (AIC, SC and -2logL) of model-fit are similar (Sibanda & Pretorius 2012b).

3.5.2. ANOVA for 2-factor interaction response surface 3.5.2.1. Box Behnken Design

The ANOVA table (Table 4.23) for the Box-Behnken design confirmed the adequacy of the 2-factor interactions model. The model F-value of 88.29 for the Box-Behnken design is significant with only a 0.01% chance of this value being due to noise.

In addition, the Box-Behnken design confirmed that the mother’s age had the greatest effect on the risk of acquiring an HIV infection for pregnant women in South Africa. Once more, the pregnant women’s educational level was the second most important individual factor. Fur-thermore, the interaction of the mother’s age with father’s age and educational level signifi-cantly affects the HIV status of an antenatal clinic attendee (Sibanda & Pretorius 2012b).

Table 4.23: ANOVA Results for the Box-Behnken design

Source Sum of Squares F-value P-value

Model 0.014 88.29 <0.0001

Mother’s age 0.054 301.92 <0.0001

Father’s age 0.0004 2.25 0.194

Education 0.0046 25.60 0.004

(29)

85 Mother’s age*Father’s age 0.023 126.75 <0.0001 Mother’s age*Education 0.017 94.10 0.0002 Mother’s age*Parity 0.001 6.51 0.051 Father’s age*Parity 0.0009 5.06 0.074 Education*Parity 0.0004 22.56 0.0051

3.5.2.2. Binary Logistic Regression

3.5.2.2.1. Likelihood Ratio (LR), Wald and Score Tests

Table 4.24: Likelihood Ratio (LR), Wald and Score Tests

Test Chi-Square DF Pr>Chi-Square

Likelihood Ratio 726.07 3 <0.0001

Score 670.25 3 <0.0001

Wald 628.26 3 <0.0001

The likelihood ratio of 726.07 therefore confirms that the fitted model with intercept and co-variates is important and has a significant effect on the basic model with no predictors. The three hypothesis testing techniques (LR, Score and Wald Tests) confirm the effect of the addi-tion of covariates to the basic model with intercept only (Sibanda & Pretorius 2012b). In gen-eral, for large samples the LR is approximately equal to the Wald score, as shown in Table 4.24.

3.5.3. Model Adequacy checking

Model adequacy checking is conducted to verify whether the fitted model provides an ade-quate approximation to the true system and to verify that none of the least squares regression assumptions are violated. Extrapolation and optimization of a fitted response surface will give misleading results unless the model is an adequate fit.

There are many statistical tools for model validation, but the primary tool for most process modelling applications is graphical residual analysis. The residual plots assist in examining the underlying statistical assumptions about residuals. Therefore residual analysis is a useful class of techniques for the evaluation of the goodness of a fitted model (Sibanda & Pretorius 2012b).

(30)

86 3.5.3.1. Residual Analysis

3.5.3.1.1. Box-Behnken design

3.5.3.1.1.1. Normality Probability Plot of Residuals

A normal probability plot of residuals can be used to check the normality assumption. If the re-siduals plot approximates a straight line, then the normality assumption is satisfied. Further-more, the normal plot of residuals as shown in Fig. 4.26, evaluates whether there are outliers in the dataset. Clearly, the plotted points do not lie completely on the diagonal, implying that the residuals are not perfectly normally distributed (Sibanda & Pretorius 2012b).

0.10 0.05 0.00 -0.05 -0.10 99 95 90 80 70 60 50 40 30 20 10 5 1 Residual P e rc e n t N 20 AD 1.259 P-Value <0.005

Normal Probability Plot

(response is HIV risk)

Fig. 4.26: Normal plot of residuals for the Box-Behnken design

3.5.1.1.1.2. Plot of residuals vs fitted response

The residuals should scatter randomly suggesting that the variance of the original observations is constant for all values of the response. However, if the variance of the response depends on the mean level of the response, the shape of the plot tends to be funnel-shaped, suggesting a need for a transformation of the response variable (Sibanda & Pretorius 2012b).

(31)

87 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.010 0.005 0.000 -0.005 -0.010 Fitted Value R e si d u a l Versus Fits

(response is HIV risk)

Fig. 4.27: Plot of residuals vs fitted response for the Box-Behnken design

The plot of the residuals vs fitted response for the Box-Behnken design (Fig. 4.27) suggests that variance of the original observations is constant (Sibanda & Pretorius 2012b).

3.5.1.1.1.3. Plot of residuals vs observation order

Non-random patterns on these plots indicate model inadequacy. This might require transfor-mation to stabilize the situation (Sibanda & Pretorius 2012b).

28 26 24 22 20 18 16 14 12 10 8 6 4 2 0.010 0.005 0.000 -0.005 -0.010 Observation Order R es id ua l Versus Order

(response is HIV risk)

(32)

88 3.5.3.1.2. Binary Logistic Regression

3.5.3.1.2.1. Deviance residuals

Observations with a deviance residual in excess of two may indicate lack-of-fit. Fig. 29, shows that there is no lack-of-fit (Sibanda & Pretorius 2012b).

Fig. 4.29: Deviance residuals from the logistic regression 3.5.3.1.2.2. Pearson Residuals

The Pearson residual is the raw residual divided by the square root of the variance function. The Pearson residual is the individual contribution to the Pearson chi-square statistic. Pearson residuals less than three are acceptable (Sibanda & Pretorius 2012b).

(33)

89 3.5.3.2. Influence Diagnosis

Parameter estimates or predictions may depend more on the influential subset than on the majority of the data. It is therefore important to locate these influential points and assess their impact on the model. The leverage of points test was used for the influential diagnosis (Sibanda & Pretorius 2012b).

3.5.3.2.1. Leverage points of Box-Behnken design

Parameter estimates or predictions may depend more on the influential subset than on the majority of the data. It is therefore important to locate these influential points and assess their impact on the model. The leverage of points test was used for the influential diagnosis for both the logistic and Box-Behnken design (Sibanda & Pretorius 2012b).

27 24 21 18 15 12 9 6 3 1.0 0.8 0.6 0.4 0.2 0.0 Index HI 1

Time Series Plot of HI1

Fig. 4.31: Plot of leverage of points for the Box-Behnken design 3.5.3.2.2. Leverage points of binary logistic regression model

(34)

90 3.5.4. Predictive equations

3.5.4.1 Box Behnken design

Table 4.25: Final equation from Box-Behnken design Box Behnken Design

HIV = Factors +0.32 +0.18 Mother’s age -0.02 Father’s age -0.07 Education +0.02 Parity

+0.13 Mother’s age*Father’s age

-0.10 Mother’s age*Education

+0.05 Mother’s age*Parity

+0.02 Father’s age*Parity

3.5.4.1.1. Main Effects Model

A main effects plot (Fig.4.33) is a plot of the means of the response variable for each level of a factor, allowing for the determination of which main effects are important. From the main ef-fects plot, it is evident that HIV risk increases steeply as the mother’s age and her educational level increase from the low level to the middle level (Sibanda & Pretorius 2012:36).

1 0 -1 0.35 0.30 0.25 0.20 0.15 1 0 -1 1 0 -1 0.35 0.30 0.25 0.20 0.15 1 0 -1 Motherage M ea n Fatherage Education Parity

Main Effects Plot for HIV risk Data Means

(35)

91 The co-efficient plot (Fig. 4.35) derived from the final response surface equation in Table 4.25, clearly indicates that the mother’s age and her educational level are the single most important determinants of the HIV status of an antenatal clinic attendees. Coefficient plots represent the relative importance of each variable on the model equation (Sibanda & Pretorius 2012:36).

3.5.4.1.2. Interaction Effects

Assuming the sparsity-of-effects principle that states that a system is usually dominated by main effects and low-order interactions, an interactions plot as shown in Fig. 4.34 was generat-ed. On the basis of sparsity-of-effects principle, the research assumed that main effects and two-factor interactions are the most significant responses in this experimental design. In other words, higher order interactions such as three factor interactions are rare. This phenomenon is sometimes referred to as the hierarchical ordering principle.

The interactions plot derived from the design shows that the interaction of the mother’s age with the other demographic characteristics has a significant effect on the HIV risk of pregnant mothers. These results are confirmed by the co-efficient plot of the main and interactions ef-fects (Fig. 4.34) (Sibanda & Pretorius 2012b).

1 0 -1 -1 0 1 30 15 0 30 15 0 30 15 0 1 0 -1 30 15 0 1 0 -1 Motherage Fatherage Education Parity -1 0 1 Motherage -1 0 1 Fatherage -1 0 1 Education -1 0 1 Parity

Interaction Plot for HIV risk

Data Means

(36)

92 Fig. 4.35: Coefficient plot of main and interaction effects

3.5.5. Logistic Regression

3.5.5.1 Main Effects Model

The main effects model was produced using HIV status as the response variable at two levels, HIV negative (0) and HIV positive (1). The model generated was based on the binary logit with Fisher’s scoring as the optimization technique. Maximum likelihood technique was employed to develop estimates of the intercept and the model parameters (Sibanda & Pretorius 2012b).

Table 4.26: Maximum likelihood estimates

Parameter Estimate Standard Error Wald ᵪ2 Pr> ᵪ2

Intercept 0.77 0.02 1189 <0.001

Parity -0.06 0.03 3.65 0.06

Mother’s age -0.76 0.06 144.48 <0.001

Father’s age 0.07 0.03 5.64 0.018

Education -0.28 0.05 28.21 <0.001

The co-efficient plot (Fig. 4.36) derived from the maximum likelihood estimate, confirms the results obtained by the Box-Behnken Design, that the mother’s age and her educational level are the single most important determinants of the HIV status of an antenatal clinic attendee (Sibanda & Pretorius 2012b).

-0.1 -0.05 0 0.05 0.1 0.15 0.2

(37)

93 Fig.4.36: Coefficient plot of the main effects

3.5.6. 3D Response surface plot of the Box-Behnken design

Fig. 4.37 shows the 3D plots of the influences of mother’s age and her educational level on the HIV risk of pregnant mothers. The response surface plots indicate that the HIV risk increases with the age of the mother, however the increase in HIV risk is lower for the educated woman compared to their less educated counterparts. The latter observation could be attributed to increased HIV/AIDS awareness in the educated groups (Sibanda & Pretorius 2012:39).

1 0 0.2 -1 0.4 0.6 0 1 -1 Education H IV r isk M other age Fatherage 0 Parity 0 Hold Values

Surface Plot of HIV risk vs Education, Motherage

Fig. 4.37: 3D Response surface plot of the Box-Behnken design

-0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1

(38)

94

3.6. Novel Application of Multi-Layer Perceptrons (MLP) Neural Networks to Model

HIV in South Africa using Seroprevalence Data from Antenatal Clinics (Sibanda &

Pre-torius 2011b).

3.6.1. Number of Neurons in the Hidden Layer

The average prediction (test set) percentages for each configuration are represented in Fig. 4.38. We can see that the performance increases with the number of neurons in the hidden layers, for HIV positive individuals; 66% prediction with one neuron, 69%, 71%, 72% and 74% prediction respectively for two, four, five and ten hidden layers. The prediction performance decreased for HIV negative individuals as the number of hidden layers increased. Based on the different responses to increases in the number of hidden layers between HIV negative and pos-itive individuals, this research resorted to using the only one hidden layer for prediction pur-poses (Sibanda & Pretorius 2011b).

Fig. 4.38: Mean performance as a function of the hidden unit 3.6.2. Number of Iterations

The mean square error (MSE) between observed values and values estimated by the network declined very rapidly from a high starting value to about 0.35 after 150 iterations in the training set (Fig.4.39). In the validation set, a similar variation was observed, with minimum values close to 0.35. Values of MSE stabilized after 150 iterations (Sibanda & Pretorius 2011b).

0 10 20 30 40 50 60 70 80 0 2 4 6 8 10 12 M e an P erf o rma n ce (% )

Number of Hidden Neurons

Mean Performance as a function of number of

hidden units

HIV -ve HIV +ve

(39)

95 Fig. 4.39: MSE as a function of the training iteration number

The percentages of correct classifications increased slowly for HIV positive individuals up to 1 000 iterations (Fig. 4.40). In this study, training of the network was stopped at 150 iterations, to avoid further deterioration in the classification of HIV negative individuals (Sibanda & Preto-rius 2011b).

Fig. 4.40: Performance (percentage of correctly classified) records) as a function of the training iteration number (epoch).

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 100 200 300 400 500 600 700 M SE Epoch

MSE vs Epoch

Training MSE Cross-validation MSE

0 10 20 30 40 50 60 70 80 0 200 400 600 800 1000 1200 M SE Epoch

Performance as a function of training iteration

number

(40)

96

3.6.3. Five-fold validation

Five-fold cross-validation was used. Five-fold cross-validation means that the sample set is di-vided into fifths. One fifth is used as a test set and the neural network is trained on the other four fifths. This is repeated five times with a different fifth used for testing each time. The av-erage error rate is then taken. Cross-validation results on the predictive performance of neural networks are given in Table 4.27. Across the five small test sub-samples, overall classification rate of neural networks ranges from 66% to 74% for HIV positive and 46% to 54% for HIV nega-tive individuals using 1 to 10 hidden nodes. For HIV posinega-tive prediction, neural networks give an average of 66% across the five sub-samples using only one hidden layer compared to 54% for HIV negative prediction (Sibanda & Pretorius 201:30).

Table 4.27: Cross-validation results on predictive performance for the five small subsamples Hidden

nodes

Sub-samples Mean+SD (HIV)

1 2 3 4 5 HI V - HI V + HI V - HI V + HI V - HI V + HI V - HI V + HI V - HI V + HIV - HIV + 1 53 64 54 69 57 65 52 67 53 64 54+1.9 66+2 2 49 70 55 68 52 72 50 70 52 69 52+2.6 69+1.9 3 50 67 51 73 54 68 48 72 51 69 51+2.7 69+3.2 4 48 70 53 72 52 71 47 73 50 71 50+1.6 71+1.9 5 49 68 50 76 51 71 47 73 49 71 49+1.6 72+2.6 6 49 69 59 59 53 70 54 70 54 71 54+0.2 68+0.2 7 46 72 49 77 45 79 46 76 47 77 47+0.2 76+0.2 8 46 73 48 77 48 74 46 75 48 76 47+0.2 75+0 9 47 72 40 82 48 74 45 76 47 79 45+ 77 10 40 73 46 76 47 76 46 75 46 72 46+0.5 74+1.8

(41)

97 3.6.4. Sensitivity

The sensitivity test showed that mother’s age and the father’s age had the greatest effect on the HIV status of the antenatal clinic attendees. Gravidity and syphilis had the lowest effect as shown in Fig. 4.41 (Sibanda & Pretorius 2011b).

Fig. 4.41: Sensitivity test results 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Mother's age

Education Gravidity Parity Father's age Syphilis 0.033 0.005 0.003 0.015 0.035 0.008 Se n si ti vi ty (H IV ) Demographic characteristics

Sensitivity about the Mean

(42)

98

3.7. Using ROC curves to compare neural networks, logistic regression and decision

trees for modelling the causality effects of demographic characteristics on the risk of

acquiring HIV infection

Fig. 4.42: Comparison of modelling techniques using SAS Enterprise Miner

In this research a number of common evaluation methods were used for the comparison of the different modelling techniques. Each of these techniques was used to assess model perfmance. In addition, classification was studied for both coded and non-coded (raw) data in or-der to ascertain the effect of data coding.

SAS Enterprise MinerTM (SAS Inc. 2002) streamlines the data mining process to create highly ac-curate predictive and descriptive models based on large volumes of data from across the en-terprise. It offers a rich, easy-to-use set of integrated capabilities for creating and sharing in-sights that can be used to drive better decisions.

The Interactive Grouping node on SAS Enterprise MinerTM (SAS Inc. 2002) was used for group-ing variables into classes. The Interactive Groupgroup-ing node requires binary target data like the HIV positive and HIV negative data found in antenatal HIV seroprevalence data. Three model

2007 Antenatal Data Transform Data Data Partition Regression Tree Neural Network Reporter Data Attributes Assessment

Insight

(43)

99 nodes were used in SAS Enterprise Miner to develop models, namely; the regression, decision tree and neural networks. The Regression node enables the fitting of both linear and logistic regression models to data. The versatility of this node is its ability to use both continuous and discrete variables as inputs. In addition, the regression node supports the stepwise, forward, and backward-selection methods.

The Tree node enables the performance of multi-way splitting of the database, based on nomi-nal, ordinomi-nal, and continuous variables. This is the SAS implementation of decision trees, which represents a hybrid of the best of CHI-squared Automatic Interaction Detection (CHAID), Classi-fication and Regression Tree (CART), and C4.5 algorithms. The node supports both automatic and interactive training. The tree node facilitates the ranking of the input variables by the strength of their contribution to the tree. This ranking can be used to select variables for use in subsequent modelling.

The Neural Network node enables the construction, training, and validation of multilayer feed-forward neural networks. By default, the Neural Network node automatically constructs a mul-tilayer feed-forward network that has one hidden layer consisting of three neurons. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network node supports many variations of this general form.

The Assessment node provides a common framework for comparing models and predictions from any of the modelling nodes (Regression, Tree, Neural Network, and User Defined Model nodes). The comparison is based on the expected and actual benefits or losses that would re-sult from implementing the model. The node produces the following charts that help to de-scribe the usefulness of the model: lift, profit, return-on-investment (ROI), receiver operating curves (ROC), diagnostic charts, and threshold-based charts.

The Scorecard node processes indeterminate outcome values and enables the selection of spe-cific adverse reason codes to be included in the summary results. Scorecard design is therefore a supervised approach for creating scores. For this current research, the scorecard methodol-ogy was used to assess the risk of each demographic characteristic. The scorecard was devel-oped therefore as table with a set of risk characteristics, each consisting of a set of attributes (such as young, middle age and older women), with points associated with each attribute. The points are summed and compared with a decision threshold to determine the risk of each pregnant woman to contracting an HIV infection. The scorecard is also predictive as it uses a logistic regression to combine risk factors into a predictive scorecard.

(44)

100 3.7.1. Global classification rate for raw (non-coded) data

Fig. 4.43: Response threshold chart for non-coded (raw) data

The response threshold chart provides the accuracy rate of predicting the HIV positive and HIV negative individuals across different threshold levels for four modelling techniques. The global classification rates were calculated by getting the mean of the accuracy rates across a different threshold levels.

Table 4.28: Classification

Modelling technique Global Classification rate (%)

Neural network 51.58

Decision tree 28.00

Logistic regression 28.96

Full Factorial design (main effects) 29.95

If the misclassification costs are known with some confidence to be equal, the global classifica-tion rate can be used as an appropriate evaluaclassifica-tion method. Using this method, the neural network model outperforms the other methods, with a global classification rate of 51.58%. In HIV research, the costs of misclassification are known with a high degree of certainty, mean-ing that the best model is selected based on the classification of HIV positive and negative indi-viduals. I propose for this research that a false negative HIV error represents a greater misclas-sification cost than a false positive result.

(45)

101 It is important to correctly predict HIV positive pregnant women, so as to enrol them on an-tiretroviral treatment to prevent the transmission of the HIV virus to the unborn child. Medical intervention with antiretroviral (ARV) medications during pregnancy involves two related goals: reduction of perinatal transmission and treatment of maternal HIV disease. It is recommended that all pregnant HIV-infected women receive ARVs, regardless of their CD4 T-lymphocyte count or plasma HIV RNA copy number, to prevent perinatal transmission. On the other hand, wrongly predicting HV negative subjects to be HIV positive might result in these individuals en-rolled for unnecessary treatment and on a balance of scales I propose that this misclassification is tragic but considerably less than the misclassification of HIV positive pregnant women.

Fig. 4.44: Diagnostic charts at 25% threshold level

(a) Neural network Diagnostic Chart

(b) Decision Tree Diagnostic Chart

(c) Regression Diagnostic Chart

Predicted Total Actual 0 1 0 2257 3537 5794 1 412 1950 2362 Total 2667 5489 8156

Classification rate of true HIV positive and negative = (1950/8156 + 2257/8156) * 100 = 51.60% Predicted Total Actual 0 1 0 0 5794 5794 1 2362 0 2362 Total 2362 5797 8156

Classification rate of true HIV positive and negative =

(2362/8156 (HIV-) + 0/8156 (HIV +)) * 100 = 28.96% Predicted Total Actual 0 1 0 0 5794 5794 1 2362 0 2362 Total 2362 5794 8156

Classification rate of true HIV positive and negative =

(2362/8156 (HIV-) + 0/8156 (HIV +)) * 100 = 28.96%

(46)

102 A snap shot of the prediction accuracy at a threshold of 25% is given in Fig. 4.44. At this threshold level the neural network model performs better than the decision tree and the lo-gistic regression models. It is important however to note that the prediction accuracy of the models is closely depended on the threshold level as shown by Fig. 4.45. In addition, Fig. 4.45, shows that the simultaneous prediction accuracies of both HIV positive and HIV negative indi-viduals is consistently lower than the individual prediction accuracies of the disease states for the three modelling methodologies across the different threshold levels.

Fig. 4.45: Plot of correctly classified individuals across different threshold levels

(a) Correctly classified individuals by neural networks

(b) Correctly classified individuals by decision trees

(47)

103

3.7.2. Global Classification Rate for Coded data

Fig. 4.46: Threshold based accuracy plot

The global classification rates for the coded data were found to be similar to those obtained for the raw, non-coded data. This proved that coding the data had no significant effect on the modelling outcomes for the three classification techniques.

3.7.3. Lift Charts

This is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model. A lift chart graphically illustrates the improvement that a mining model provides in comparison to a random guess. The comparison of lift scores for various segments of the dataset and for different statistical models provide a measure of which model performs best.

Fig. 4.47: Cumulative lift charts

LogReg. NeuralNet. Tree FullFactor.

(48)

104 A cumulative lift chart shows the neural network model to be better than the logistic regres-sion, full factorial and decision trees. The observations are sorted according to predicted prob-ability from highest to lowest, after which these observations are bagged into ordered bins, each including 10% of the whole data.

The cumulative chart depicts response rates for each decile if the score includes all of the re-sponses for the deciles above it. That is the reason why the lift curves down to the baseline at the 10th decile.

The weakness of the cumulative lift chart is that it obscures the model’s performance at each level of the score. It is therefore important to investigate the noncumulative lift.

Fig. 4.48: Non-cumulative lift charts

The non-cumulative lift chart (Fig. 4.48) reveals that the predictive power of the neural net-work model drops gradually and is equal to both the logistic regression and full factorial mod-els at the third decile (30% threshold level). Between the third and fifth percentiles, the full factorial model outperformed both the logistic regression and the neural network models. The decision tree performed least. From the eighth to the hundredth percentile, the decision tree performed best followed by the logistic regression then the full factorial model. In the latter percentile range, the neural network model performed least. The conclusion that can be drawn from the lift chart is that the prediction accuracy of the modelling technique depends on the threshold level.

(49)

105

3.7.4. ROC Charts

The ROC curves for the model and validation data sets were created and analysed for the 2007 antenatal HIV seroprevalence database. Each curve has 100 points, each corresponding to a threshold. The model data is used to establish the cut-off. However, the superiority of a tech-nique is determined by the model’s ability to classify new data. This is evaluated by the valida-tion data.

Fig. 4.49: ROC curves for: a) the model data; and the b) the validation data set. Of the 32 397 individuals in the database, 16 199 were training individuals, 8 099 were validation and another

8 099 were test individuals. Of the 8 099 validation individuals 3 300 (41%) were HIV positive and the rest HIV negative.

It is difficult to distinguish the difference between the ability of the logistic regression and neu-ral network in predicting the HIV positive individuals. The AUROC for the two modelling tech-niques are marginally different, with a slight advantage for the neural network. The decision tree has the lowest AUROC and maintains a consistent predictive ability for both the HIV

nega-AUC for Logistic regression = 0.59 AUC FullFactorial = 0.59 AUC for Neural network = 0.65 AUC for Decision tree = 0.50

(50)

106 tive and HIV positive individuals. Therefore the neural networks are slightly superior with an AUC of 0.65 compared to AUC values of 0.59, 0.59 and 0.50 for the logistic regression, full fac-torial and decision tree models.

3.7.5. Scorecards

3.7.5.1. Performing Weights of Evidence (WOE) based binning, and selecting variables

that showed sufficient predictive power.

This research used the Interactive Grouping Node in SAS Enterprise MinerTM to analyse the predictive power of each demographic characteristic. Demographic characteristics with low In-formation Value (IV) or Gini were removed. However, characteristics with IV values greater than 0.05 were then binned based on maintaining a reasonable IV, and ensuring a logical rela-tionship to the HIV status of the individual.

The WOE calculation in SAS® Enterprise Miner™ is based on the standard formula below: WOEbin = ln (proportion of HIV negative in bin/proportion of HIV positive in bin).

For the variable selection, the Information value (IV) is used to show the total strength of an input is calculated using the following formula;

IV = ∑ =1( − + +) ∗ 𝑖

Table 4.29: Interpretation of Information values (IV) (Siddiqi, 2006).

<0.02 Not predictive

0.02 – 0.1 Weak predictive strength

0.1 – 0.3 Medium predictive strength

>0.3 Strong predictive strength

As mentioned, using prescribed scorecard scaling formulae and the WOE (Siddiqi, 2006), the posterior probabilities from the logistic regression is translated into an additive scorecard for-mat.

Using the data from the 2007 antenatal HIV seroprevalence survey, the demographic charac-teristics were binned as shown in Table 4.30.

(51)

107 Table 4.30: Binning of demographic characteristics

Grouped Variable

Group Group

Label

Coded Level

HIV +ve HIV -ve Total WOE HIV Rate (%)

Age_Woman 1 0 5280 10752 16032 -0.1841 32.93 Age_Woman 2 1 2698 5410 8108 -0.1995 33.28 Age_Woman 1 -1 1418 6839 8257 0.6781 17.17 Age_partner 1 0 4515 9852 14367 -0.1149 31.43 Age_partner 2 1 3126 6176 9302 -0.2143 33.61 Age_partner 1 -1 1755 6973 8728 0.4843 20.11 Education 1 0 2284 8742 11026 0.4469 20.71 Education 1 1 6761 13570 20331 -0.1986 33.25 Education 1 -1 351 689 1040 -0.2208 33.75 Parity 0 0 3556 6864 10420 -0.2376 34.13 Parity 1 1 2996 5529 5529 -0.2825 35.14 Parity 1 -1 2844 10608 13452 0.4211 21.14

3.7.5.2. Building a model using a logistic regression approach and selecting the best

model, then transform into the scorecard format.

The scorecard node in SAS Enterprise MinerTM was used to fit a logistic regression model and convert the output into a scorecard. A stepwise iteration method was used to obtain a final model with acceptable p-values (Fig. 4.31). The scorecard was scaled using scaling parameters of 10 000 : 1 odds. The final scorecard chosen had three characteristics, each with two attrib-utes, relating the prediction of the HIV status of an antenatal clinic attendee. The selected characteristics are shown in Table 4.32.

(52)

108 Table 4.31: Regression coefficients from the scorecard node

Parameter Coded Level Parameter estimate P value

Intercept -1.0151 <0.0001 WOE_Age_woman 0 0.2705 <0.0001 WOE_Age_woman 1 0.1153 0.0040 WOE_Age_partner 0 0.0360 0.1895 WOE_Age_partner 1 0.0822 0.0266 Education -1 0.1573 0.0155 Education 1 0.1331 0.0003

Using the parameter estimates, WOE and the scorecard scaling formula, a scorecard may be constructed. Scorecards may ease the interpretation of the detection of the HIV status of an individual, however the output from the logistic regression is a probability score, which can al-so be used. In the case of a probability score, the threshold would be set as a maximum: enti-ties with a probability score larger than the threshold would be flagged for investigation. The inputs that were included in the scorecard had significant p values at a significant level of p<0.05. On that basis, the age of the partner coded 0 with a p-value of 0.1895 was not used in the scorecard. As can be shown in the scorecard below (Table 4.35), the age of the partner coded 0, i.e. males of the age between 25 and 33 years were not significant.

Table 4.32: Selected variables from the final scorecard

Variable Attributes Score Points

(WOE * Regression coefficient)

Scorepoints (x-10 000) Actual Coded Age_woman (years) 21-29 0 -0.1841*0.2705= -0.0498 498 Age_woman (years) >30 1 -0.1995*0.1153= -0.0230 230 Age_partner (years) 25-33 0 -0.1149*0.036= -0.0041 41 Age_partner (years) >34 1 -0.2143*0.0822= -0.0176 176 Educational (grades) <8 -1 -0.2208*0.1573= -0.0347 347 Educational (grades) 12-13 1 -0.1986*0.1331= -0.0264 264

The scorecard shows that women around the age of 30 were most at risk of acquiring the HIV infection. As expected the older women seem to be least at risk of HIV infection. If the scores are added for each characteristic, it becomes evident that the age of the mother has the most

(53)

109 influential effect on the risk of acquiring HIV infection. This observation was independently confirmed by the logistic regression, screening design of experiments as well as the response surface methodologies. Amongst men, the age group between 25 and 33 years was found to be most at risk of an HIV infection.

3.7.5.3. Validation and Benchmarking the Scorecard

3.7.5.3.1.

Validation of the Scorecard

The HIV risk scorecard was validated using a 25% hold-out sample. Many fit statistics were used for scorecard validation such as cumulative percent response, cumulative per-cent of responses captured, receiver-of-characteristics (ROC) curve, empirical odds and Kolmogorov-Smirnov plots.

3.7.5.3.2

Cumulative Percent response curve

This lift curve shows the predictive power of the scorecard model, and provides a score that represents the likeliness of a pregnant woman being HIV positive. The scores are sorted into ten deciles, from left to the right along the horizontal axis, with the high-est scoring ten percent of individuals on the left edge of the chart.

Fig. 4.50: Cumulative percentage response chart

(54)

110 After training and validation, the scorecard model is able to predict the HIV status of

a pregnant woman with an accuracy of 43% for the top 10 % of the scores. The accuracy drops to 41% for the top 20% of the scores.

3.7.5.3.3. Cumulative percent of responses captured

Another useful chart compares the cumulative percent of responses captured as each decile is added to the target. In this project, the top 20% deciles captured about 30% of the HIV positive individuals, compared to a random baseline where two deciles cap-tured 20% of the HIV positive individuals and thus emphasizing the advantage of targeting.

Fig. 4.51: Cumulative percent captured response

3.7.5.3.4

Receiver-of-characteristics (ROC) curves

The ROC chart is a graphical plot which shows the performance of the scorecard bi-nary classification model as its discrimination threshold is changed. It is therefore gener-ated by plotting the fraction of true HIV positive individuals out of the positives (sensitivi-ty) against the fraction of false positives out of the negatives (100-specifici(sensitivi-ty), at various threshold settings.

(55)

111 Fig.4.52: ROC curves

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between two diagnostic groups, such as HIV positive and HIV negative. Therefore the ROC curve is used as a test to distinguish between HIV posi-tive and HIV negaposi-tive individuals. The closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test.

3.7.5.3.5. Emperical Odds Plot

This is used to evaluate the calibration of the scorecard. The chart plots the actual odds values as they are found in the validation data against scorecard. The chart is over-laid with a chart of the values that are predicted by the scorecard. The chart therefore de-termines those score bands where the scorecard is, or is not sufficiently accurate. The odds are calculated as the logarithm of the number of HIV positive divided by the number of HIV negative individuals for each scorecard bucket range. Thus, a steep negative slope implies that the HIV negative individuals tend to get higher scores than the HIV positive individuals. As the scorecard points increase, so does the number of HIV negative individ-uals in each score bucket.

Referenties

GERELATEERDE DOCUMENTEN

The study has aimed to fill a gap in the current literature on the relationship between South Africa and the PRC by looking at it as a continuum and using asymmetry

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End

Er zijn meer factoren die ervoor zorgen dat de (meeste huisgenoten van) mensen met dementie zich geen voorstelling kunnen maken van het nut van informele hulp.. De angst voor

In this paper it was shown how for algebraic statisti- cal models finding the maximum likelihood estimates is equivalent with finding the roots of a polynomial system.. A new method

The blind algorithms estimate the channel based on properties of the transmitted signals (finite alphabet properties, higher order statistics, ...) Training-based techniques assume

For OFDM transmission over doubly selective channels, time-domain and frequency-domain per-tone equalizers (PTEQ) are introduced in [11], [12], [13].. Adaptive MLSE is proposed in

We illustrate the importance of prior knowledge in clinical decision making/identifying differentially expressed genes with case studies for which microarray data sets

A Monte Carlo comparison with the HLIM, HFUL and SJEF estimators shows that the BLIM estimator gives the smallest median bias only in case of small number of instruments