Prediction modelling for population conviction data

(1)

Prediction modelling for

population conviction data

(2)

Lay-out & Printing: ProefschriftMaken

Cover illustration: Ilse Ferwerda en Ruben Otte

ISBN: 978-90-393-6724-7

(3)

Prediction modelling for population conviction data

Voorspellingsmodellering voor populatiestrafzaakdata (met een samenvatting in het Nederlands)

Proefschrift

ter verkrijging van de graad van doctor aan de Universiteit Utrecht op gezag van de rector magnificus, prof.dr. G.J. van der Zwaan, ingevolge het besluit van het college

voor promoties in het openbaar te verdedigen op vrijdag 24 maart 2017 des ochtends te 10.30 uur

door

Nikolaj Tollenaar

(4)

Promotoren: Prof. dr. P.G.M. van der Heijden Prof. dr. F.L. Leeuw

Manuscript committee:

Prof. dr. mr. C.C.J.A. Bijleveld (Free University Amsterdam) Prof. dr. S. van Buuren (Utrecht University)

(5)

1.1 Introduction

One of the databases the Scientific Research and Documentation Centre (WODC) of the Dutch Ministry of Justice (now the Ministry of Security and Justice) has developed – since 1999 – is the complete (anonymized) database of all Dutch offenders adjudicated since 1997 and their criminal justice proceedings. It is formally known as the OBJD (“Onderzoek- en Beleidsdatabase Justitiële Documentatie”, i.e. research and policy database judicial documentation). It is internationally referred to as the Dutch Offender Index, or DOI. This database, created for the exclusive purpose of (applied and more fundamental) scientific research, is regularly fed with data from the authority that stores and maintains the Dutch national criminal record information, the Judicial information service ( Justid). These criminal record data are stored in the judicial documentation system of the Netherlands, the JDS (i.e. Justitieel Documentatiesysteem, i.e. judicial documentation system). The OBJD is updated quarterly, and contains the complete criminal case history of all persons who have come into contact with the judiciary, as well as a number of their background characteristics. As of yet, it contains 8,205,862 natural persons from 12 years and older who have ever been adjudicated since 1997. The OBJD was created with recidivism research in mind. To make the data more suitable for offender-oriented research, a new database, the so-called Recidivism monitor database, was created simultaneously (Wartna, 2009; Wartna, Blom & Tollenaar, 2011). In the recidivism database, statutory article information is used to form (main) crime types and maximum penalties, sanction information is summarized, and criminal history variables are formed.

The Recidivism monitor database is regularly remodeled, and is adapted to accommodate changes and extensions of the underlying judicial registration data, and adapted to reflect changes in the penal code over the years. Therefore, this database can be the fundament of a standard measurement method for evaluating recidivism outcomes using judicial data. In principle, recidivism studies that utilize these data are comparable because of the standardisation in methods, data pre-processing, measurement units and timing definitions, (Wartna, 2009) .

(9)

it possible to indicate yearly how well the criminal justice system is doing regarding their ‘clients’ in terms of recidivism (Wartna et al., 2014). Secondly, interventions aimed at offenders can be evaluated in terms of their effectiveness (i.e. contributing to the reduction of recidivism) using data from this monitor (see e.g. Wermink et al., 2010). Thirdly, the rise of evidence based justice policies and interventions also resulted in an increased attention for working with risk assessments and benchmarking by judicial institutions like prisons, probation offices, forensic clinics or district courts. These institutions can be benchmarked on their performance on recidivism outcomes. Finally, using data from this monitor to develop estimates of individual recidivism risk can also be provided to aid decision making or risk categorisation in the criminal justice process.

For attaining the goals associated with the aforementioned aggregate level uses of the data, descriptive statistics (i.e. raw recidivism rates) will not suffice. For instance, because female offenders tend to have a lower recidivism prevalence than male offenders, an increased inflow of female offenders starting from a certain year will dampen the overall recidivism rate afterwards. This shift in offender sex ratio will be unrelated to offender targeted criminal justice policy and should be (statistically) corrected for.

Therefore, in all these applications of recidivism data, forms of prediction are required to account for differences in the risk factors of individuals by adjusting effect estimates for different case-mixes in treatment groups.

An optimal prediction can only be obtained by finding an good-fitting model for the data. That is, the model can adequately capture the most important relations of the background characteristics with the outcome. A prediction model can be formalized in a mathematical model, as follows:

where we try to find

where is the parameter vector, is the matrix of covariate values, is the vector of observed outcomes, and is a function expressing the difference between the observed and the predicted values. In words: for our data we can specify an appropriate function f that relates the inputs X to the outcomes with an unknown parameter vector theta. We find the optimal by minimizing a function with respect to theta that quantifies a criterion related to the prediction error, i.e. the loss function L. The loss function specifies

2

it possible to indicate yearly how well the criminal justice system is doing regarding their ‘clients’ in terms of recidivism (Wartna et al., 2014). Secondly, interventions aimed at offenders can be evaluated in terms of their effectiveness (i.e. contributing to the reduction of recidivism) using data from this monitor (see e.g. Wermink et al., 2010). Thirdly, the rise of evidence based justice policies and interventions also resulted in an increased attention for working with risk assessments and benchmarking by judicial institutions like prisons, probation offices, forensic clinics or district courts. These institutions can be benchmarked on their performance on recidivism outcomes. Finally, using data from this monitor to develop estimates of individual recidivism risk can also be provided to aid decision making or risk categorisation in the criminal justice process. For attaining the goals associated with the aforementioned aggregate level uses of the data, descriptive statistics (i.e. raw recidivism rates) will not suffice. For instance, because female offenders tend to have a lower recidivism prevalence than male offenders, an increased inflow of female offenders starting from a certain year will dampen the overall recidivism rate afterwards. This shift in offender sex ratio will be unrelated to offender targeted criminal justice policy and should be (statistically) corrected for.

𝒚𝒚 = 𝑓𝑓(𝑿𝑿|𝜽𝜽), where we try to find 𝑎𝑎𝑎𝑎𝑎𝑎 𝑚𝑚𝑚𝑚𝑚𝑚𝜽𝜽{𝐿𝐿[𝒚𝒚, 𝑓𝑓(𝑿𝑿|𝜽𝜽)]},

where 𝜽𝜽 is the parameter vector, 𝑿𝑿 is the matrix of covariate values, 𝒚𝒚 is the vector of observed outcomes, and 𝐿𝐿 is a function expressing the difference between the observed and the predicted values. In words: for our data we can specify an appropriate function f that relates the inputs X to the outcomes 𝒚𝒚 with an unknown parameter vector theta. We find the optimal 𝜽𝜽 by minimizing a function with respect to theta that quantifies a criterion related to the prediction error, i.e. the loss function L. The loss function specifies a form of prediction error severity, the discrepancy between 𝒚𝒚 and 𝑓𝑓(𝑿𝑿|𝜽𝜽). Two examples of a loss function are the squared loss for linear regression:

𝐿𝐿[𝒚𝒚, 𝑓𝑓(𝑿𝑿|𝜽𝜽)] = [𝒚𝒚 − 𝑓𝑓(𝑿𝑿|𝜽𝜽)]′_{[𝒚𝒚 − 𝑓𝑓(𝑿𝑿|𝜽𝜽)],}

and the logistic loss function for logistic regression:

2

and the logistic loss function for logistic regression: 2

2

(10)

a form of prediction error severity, the discrepancy between and . Two examples of a loss function are the squared loss for linear regression:

By optimizing the loss function with respect to the parameters theta, one will end up at the optimal prediction according to that loss function and the functional form.

In the sections 1.1.1 to 1.1.4, I will elaborate on the applications of prediction models in the judicial context.

1.1.1 Modelling recidivism trends

In many countries, recidivism statistics trends provide feedback to the criminal justice system with respect to how well they are handling perpetrators that have entered the system (see e.g. Ministry of Justice, 2013; Scottish Government Research Group, 2015; Brå, 2014). Although interesting in themselves, when the actual influence of the judicial system on recidivism rates is of interest, simply calculating raw (unadjusted) recidivism trends is not informative enough. Underlying offender populations may change with respect to their background characteristics over the years. These individual characteristics can be strongly related to recidivism (see, e.g. Gendreau, Little & Goggin, 1996; Cottle, Lee & Heibrun, 2001). Therefore, changes in the background characteristics of the offender population can be a important factor that influences recidivism rates, irrespective of the direct influence of the criminal justice system. To adjust for these changes, (statistical) regression techniques are needed that can estimate the expected individual responses given a set of covariates. These models can be used to estimate the expected recidivism rate by fixing the offender population covariate values on those of a specific year (see e.g. Wartna & Tollenaar, 2006).

Another application in this context, is the modelling of recidivism with the aim to forecast the recidivism rate for years that have not yet been observed. As will be evident, in order to measure recidivism, one has to wait several years. This waiting time is at odds with the interests of criminal justice policies, as these policies require statistics as recent as possible in order to react (just) in time, notwithstanding the need for data on longer historical trends. A solution for this incongruence between data recency and information needs is to

2

3

𝐿𝐿[𝒚𝒚, 𝑓𝑓(𝑿𝑿|𝜽𝜽)] = 𝟏𝟏′_{[𝒚𝒚 − log 𝑓𝑓(𝑿𝑿|𝜽𝜽) + (𝟏𝟏 − 𝒚𝒚)log(1 − 𝑓𝑓(𝑿𝑿|𝜽𝜽)].}

By optimizing the loss function with respect to the parameters theta, one will end up at the optimal prediction according to that loss function and the functional form.

In the sections 1.1.1 to 1.1.4, I will elaborate on the applications of prediction models in the judicial context.

1.1.1 Modelling recidivism trends

In many countries, recidivism statistics trends provide feedback to the criminal justice system with respect to how well they are handling perpetrators that have entered the system (see e.g. Ministry of Justice, 2013; Scottish Government Research Group, 2015; Brå, 2014). Although interesting in themselves, when the actual influence of the judicial system on recidivism rates is of interest, simply calculating raw (unadjusted) recidivism trends is not informative enough. Underlying offender populations may change with respect to their background characteristics over the years. These individual characteristics can be strongly related to recidivism (see, e.g. Gendreau, Little & Goggin, 1996; Cottle, Lee & Heibrun, 2001). Therefore, changes in the background characteristics of the offender population can be a important factor that influences recidivism rates, irrespective of the direct influence of the criminal justice system. To adjust for these changes, (statistical) regression techniques are needed that can estimate the expected individual responses given a set of covariates. These models can be used to estimate the expected recidivism rate by fixing the offender population covariate values on those of a specific year (see e.g. Wartna & Tollenaar, 2006).

Another application in this context, is the modelling of recidivism with the aim to forecast the recidivism rate for years that have not yet been observed. As will be evident, in order to measure recidivism, one has to wait several years. This waiting time is at odds with the interests of criminal justice policies, as these policies require statistics as recent as possible in order to react (just) in time, notwithstanding the need for data on longer historical trends. A solution for this incongruence between data recency and information needs is to extrapolate the expected recidivism for a very recent cohort using a prediction model fitted on earlier data, given the background characteristics of this recent cohort.

1.1.2 Estimating the effect of judicial interventions

The criminal justice system continuously imposes regular sanctions and applies (standard) interventions to offenders. One of the goals of sanctions and interventions is to reduce criminal behavior. Judicial policy government officials regularly devise and implement sanctions and interventions that have are potentially better suited for different

(11)

General introduction

11

extrapolate the expected recidivism for a very recent cohort using a prediction model fitted on earlier data, given the background characteristics of this recent cohort.

1.1.2 Estimating the effect of judicial interventions

The criminal justice system continuously imposes regular sanctions and applies (standard) interventions to offenders. One of the goals of sanctions and interventions is to reduce criminal behavior. Judicial policy government officials regularly devise and implement sanctions and interventions that have are potentially better suited for different (sub) populations of offenders. As these implemented sanctions and interventions may be ineffective or less effective than existing approaches, or than is assumed, these are often evaluated in terms of effectiveness, mostly measured through the recidivism rate.

When offenders are randomly allocated to interventions or sanctions, i.e. using a randomized controlled trial (RCT) design, an unbiased and consistent treatment effect estimate can be obtained. RCT’s are often seen as the gold standard of research designs or even as a moral imperative (Weisburd, 2003), although others disagree (Berk, 2005). By applying random allocation, intervention and comparison group are ensured to be comparable on unobserved characteristics, that may be related to the outcome. Therefore, in ideal circumstances, RCT’s can provide unbiased estimates of the (relative) treatment effect.

However, in the current judicial practice, it can be hard to work with randomized controlled trials due to diverse ethical, political and practical (implementation) barriers (Weisburd, 2000; Sheperd, 2003; Berk, 2005). Some of these are:

y

y Random allocation to sanctions can be seen as a violation of the principles of criminal sentencing (de Roos, 2007), namely equality, proportionality, and individualisation. For instance, in the case where a more severe sanction is administered at random, attorneys might plead for a different sanction than would be administered through randomisation;

y

y When judiciary officials already have an idea who would benefit from the special treatment, they may perceive denying subjects this treatment as being unjust. They may consequently refuse to assign subjects randomly;

y

(12)

Chapter 1

12

y

y The implementation of randomisation may fail because of different reasons of a more practical nature, like the failure to maintain cooperation of policy makers and program staff, or diverse logistical problems (see e.g. Asscher, Deković, van der Laan, Prins & van Arum, 2007; Feder, Jolin & Feyerherm, 2000).

Even when an experiment can be organized, there can be issues related to generalisation over different subject, time, settings, related interventions and related outcomes (Berk, 2005).

Therefore, on many occasions, researchers will end up with a nonequivalent control group that is often formed afterwards, restricting the research design choice to one of the diverse quasi-experimental designs. In this situation, approaches like regression correction, instrumental variable regression (IVR, Angrist & Krueger, 2001) or propensity score matching (PSM, Rosenbaum & Rubin, 1983) are required. All approaches require adequate statistical modelling, although the latter is less dependent on model quality. In PSM, the primary goal is to balance the covariates between the treatment and comparison groups, more so than maximizing predictive power (Heinrich, Maffioli & Vásquez, 2010). In the case a control group is entirely absent, a population recidivism model can be a solution to obtain an effect estimate. A model developed on the total offender population or subpopulation of interest can generate so-called expected recidivism rates to compare the intervention group observed recidivism rate to its expectation (see e.g. Gottfredson, 1987; Schmidt & Witte, 1980).

Another application for recidivism modelling in the judicial field is the estimation of incapacitation effects. When offenders are incarcerated, they are unable to commit new crimes and are said to be incapacitated. This is useful information for policy makers, especially when dealing with high frequency offenders. By combining incarceration data and conviction1_{data, recidivism frequency can be modelled, using models for count data} (discrete distribution outcomes) like Poisson regression or negative binomial regression.

1 The term conviction is usually restricted to penal cases dealt with by a judge. In the Netherlands, a large proportion of the total number of penal cases can be dealt with by the public prosecutor imposing fines, community service orders and training programmes. These disposals are called transacties (i.e.

‘transac-tions’) and concern less severe offenses. Since February 1st_{2008, the public prosecutor is allowed to}

(13)

13

Using these models, the effects of prolonged incarceration can be estimated (e.g. Spelman, 1993, Blumstein et al, 1986).

1.1.3 Benchmarking judicial institutions

Performance benchmarking in the judicial field can be beneficiary. The process can lead to identifying problems as well as areas in which individual institutions excel. Examples of levels at which benchmarking can be applied are probation regions, district courts, prisons or prison divisions (see e.g. Molleman, 2014). When certain organisational units perform better or worse than others, further research can shed light on the underlying causes. The resulting knowledge can be used to improve the performance of all units. An obvious indicator on which these institutions could be benchmarked is recidivism. Therefore, recidivism models would be required to correct for local offender population characteristics. By correcting the local recidivism rates, the institutions are made comparable on this indicator.

1.1.4 Models for optimal risk estimates for individuals

All the previously described uses of recidivism model estimates regard the aggregate or group level. However, the estimation of individual offender recidivism risks are also important for an adequate functioning of the judicial system. Using individual risk estimates can help in the optimal allocation of treatments, sanctions, early release and suspended sentences. For instance, high risk individuals may benefit more from recidivism reduction interventions. Naturally, the optimality of the allocation is highly dependent on how accurate and predictive the individual risk estimate is.

In the early days of risk assessment, risk factors were identified on cross-classification tables of factors of recidivism risk, so-called risk tables. Their limitation was that only a handful of categorical or categorized continuous predictors could be incorporated. If more predictors would be considered, the cells of the cross tables would become too sparse to provide a reliable risk estimate. To remedy the restriction on the number of predictors, one of the first rudimentary composite risk scores was developed by Burgess (1928). The Burgess scale was a sum of indicator variables found to be associated with recidivism, using unit weights. However, using an arbitrary weighting scheme like unit weighting will not result in an optimal risk.

(14)

Chapter 1

14

Rice and Cormier, 1998), which is also the case for general recidivism (Howard, Francis, Soothill & Humphreys, 2009).

These instruments are typically applied to aid decision making with respect to individual offenders or to categorize offenders into risk groups. In the first case, when decision-making is involved, this implies that offenders are classified into recidivists or non-recidivists. For this purpose, a precise estimate of recidivism is of lesser concern; a scale should be able to classify individuals as either expected recidivists or non-recidivists, given a well chosen cut-off point on the scale. In the second situation (regarding the grouping of risks), precise probabilities or actuarial risks are required. Both risk categorisation and classification are generally used to allocate sparse judicial resources like offender treatments. Other applications are curbing risks of early release like in pre-trial releases, or informed granting of early release from prison.

For the case of individualized prediction, the notion of static versus dynamic predictors is important. Static predictors are variables of perpetrators that cannot change (i.e. gender, country of birth) or can only change into one direction (e.g. age, number of previous convictions). Dynamic predictors are predictors that can change over time (e.g. having a job, being addicted). The prediction literature shows that static factors tend to be superior for prediction and that dynamic factors do not add predictive power on top of static factors (see e.g. Philipse, Koeter, van der Staak & van den Brink, 2006). From a treatment or supervision perspective however, static information is less useful. When only static risk assessments are available, there is no way to know whether risks have changed. Static risks will generally only increase or stay constant, whereas dynamic risks are subject to change. Therefore, if changes in risk need to be detected, dynamic predictors are mandatory. Static predictors might then be left out of the equations to allow the dynamic predictors to contribute to the prediction.

1.1.5 Development of prediction modelling

As mentioned in the previous paragraph, the early unweighted risk score of Burgess (1928) was one of the first, though very crude ways to establish (individualized) recidivism risk. Such unweighted risk scores ignore intercorrelation of variables and treat each predictor as equally important, leading to biases in prediction. Statistical models decompose the risk by estimating the size of the unique contribution of each predictor, and yields asymptotically unbiased estimates.

(15)

15

models. The differences between the approaches are summarized in table 1.1 (adapted from Shmueli, 2010).

Table 1.1 Difference in approach of explanatory and predictive models Model type

Explanatory Predictive

Function f represents Causation Association Covariate role Factor Predictor Modelling f Theory Data Variable/model selection Parsimonious Predictive Minimizes Bias Bias + variance Focus Retrospective Prospective

Performance Strength of relationships Holdout sample metrics

In explanatory modelling, causation is the main concern; when coefficients of the factors in the model are large, this is an indication of an adequate model. When there is insight in the causal mechanism(s), it is understood why things work the way they do and manipulation is possible. In explanatory modelling, the most parsimonious model is preferred. However, models should not be too parsimonious, because important left out factors bias the estimates of the parameters in the model. Estimating unbiased relations is the main concern of explanatory modelling. Standard statistical inference like t-tests for coefficient significance and likelihood-based information criteria are used. Only variables (factors) are admitted that are theoretically linked to the outcome. The theory, in combination with statistical inference and parsimony will eventually decide which model is best. The quality of the model is determined by the strength of its relationships and by the fit to the statistical theoretical properties of the model.

(16)

Chapter 1

16

to the predictive performance. Furthermore, parsimony is not required; in principle there is no reason but predictive performance to prefer a simpler model over a complex model. However, including too many variables will result in an overly complex model that does not generalize. The quality of the model is determined by its predictive performance on previously ‘unseen’ or future data, using criteria that reflect the discrepancy between observed and predicted values.

The appearance of non-statistical prediction

Statistics, as a relatively young discipline, was born in the situation in which there were no large data sets. Statistical theory depends heavily on assumed theoretical distributions of random variables, and obtains estimates on the basis of asymptotic (i.e. when

8

the main concern of explanatory modelling. Standard statistical inference like t-tests for coefficient significance and likelihood-based information criteria are used. Only variables (factors) are admitted that are theoretically linked to the outcome. The theory, in combination with statistical inference and parsimony will eventually decide which model is best. The quality of the model is determined by the strength of its relationships and by the fit to the statistical theoretical properties of the model.

On the other hand, prediction models can include any variable (predictor), as long as it is associated with (i.e. is predictive of) the outcome and is also available to the one who needs to predict. A theoretical basis for why a variable should cause the outcome is not required. One does however have to assume that the relation between variables and outcomes is relatively stable over time. Unlike explanatory modelling, predictive models minimize both bias (systematic deviation) and variance (unsystematic deviation) and may thus sacrifice bias for lesser variance. Rules for standard statistical inference are of lesser importance; textbook statistical advice should not be followed to obtain an optimally predicting model (see e.g. the comment of Efron in Breiman, 2001). For prediction models, statistical model assumptions need not necessarily hold, and statistical significance should not guide in- or exclusion of predictors in the model. Significant predictors might

deteriorate predictive performance and statistically insignificant variables may still have a relevant contribution to the predictive performance. Furthermore, parsimony is not required; in principle there is no reason but predictive performance to prefer a simpler model over a complex model. However, including too many variables will result in an overly complex model that does not generalize. The quality of the model is determined by its predictive performance on previously ‘unseen’ or future data, using criteria that reflect the discrepancy between observed and predicted values.

The appearance of non-statistical prediction

Statistics, as a relatively young discipline, was born in the situation in which there were no large data sets. Statistical theory depends heavily on assumed theoretical distributions of random variables, and obtains estimates on the basis of asymptotic (i.e. when 𝑚𝑚 is infinitely large) results. The estimates of the coefficient(s) of the model are based on the likelihood that the data is observed given the presupposed data generating model, the so called maximum likelihood estimates. With the rise of computing power and working memory size since the eighties of the previous century, classical parametric statistics were augmented with more and more computer intensive methods, i.e. computational

statistics. Examples of these are bootstrapping, nonparametric regression and the Gibbs sampler. At the same time, in a branch of artificial intelligence (AI) non-statistical algorithms were developed further like the multilayer perceptron, a form of artificial neural network (ANN). It was developed with the intention to enable computers to induce and make decisions from data instead of having to be explicitly programmed with

knowledge and rules. ANN’s were inspired by the interconnectivity found the brain. The field in the discipline of computer science concerned with learning from data is nowadays known as machine learning. An important attribute of models invented in machine learning is that they maximize prediction error directly instead of a theoretical likelihood that quantifies the likelihood of the data given the model.

is infinitely large) results. The estimates of the coefficient(s) of the model are based on the likelihood that the data is observed given the presupposed data generating model, the so called maximum likelihood estimates. With the rise of computing power and working memory size since the eighties of the previous century, classical parametric statistics were augmented with more and more computer intensive methods, i.e. computational statistics. Examples of these are bootstrapping, nonparametric regression and the Gibbs sampler. At the same time, in a branch of artificial intelligence (AI) non-statistical algorithms were developed further like the multilayer perceptron, a form of artificial neural network (ANN). It was developed with the intention to enable computers to induce and make decisions from data instead of having to be explicitly programmed with knowledge and rules. ANN’s were inspired by the interconnectivity found the brain. The field in the discipline of computer science concerned with learning from data is nowadays known as machine learning. An important attribute of models invented in machine learning is that they maximize prediction error directly instead of a theoretical likelihood that quantifies the likelihood of the data given the model.

(17)

17

In the 1990’s and 2000’s, many different classification (as well as regression and clustering) algorithms for machine learning were developed that basically evolved outside the field of statistics. These were applied to typical AI problems, like computer/robot vision, speech recognition or character recognition. These algorithms all optimize some index of predictive performance, mostly the accuracy (i.e. the percentage classified correctly). These are typically data-oriented instead of model-oriented (Breiman, 2001). From a computer scientist perspective, one seeks a lossy (i.e. irreversible because of loss of information) compression of the data that discards noise and keeps the most salient features of the data (Rissanen, 1978; Rissanen & Tabus, 2004). With the increasing availability of software to actually implement these algorithms, they were increasingly being applied outside the typical AI problems. Moreover, hybrid algorithms appeared that are both statistical and machine learning in nature. Therefore, Friedman, Hastie and Tibshirani (2001) coined the term statistical learning. Nowadays, machine learning and statistics are more and more integrated; machine learning is published in statistical journals and vice versa. The new techniques, basically nonparametric search algorithms, hold promise of improved predictive performance. Many comparison studies of algorithms have been published but have mixed results on which algorithm has the best performance over a range of data sets. The criminological research field has yielded a limited set of comparison studies. These were however limited in the set of techniques used (mostly neural networks and decision trees), the performance indices used (mostly accuracy and/or AUC) and were inconclusive on whether the promise of improved prediction by flexible machine learning methods holds for the criminal justice context. The large scale population data collected by Dutch Judicial system that was described to earlier, makes reliable large scale research on this topic possible.

Survival analysis

(18)

Chapter 1

18

censoring, advantages of regression models for censored data are the ability to estimate the probability of recidivism at every arbitrary point in time along the maximum observed time as opposed to a single time point in models for binary data. Moreover, these models can provide an estimate of the median or mean survival time for subjects or for groups. In more recent years, adaptations of machine learning algorithms have been developed that can handle censored recidivism data. It is still unknown how good machine learning algorithms for censored data will perform on criminological data, especially compared to classical survival models like the Cox regression model or parametric survival time models. 1.1.6 Recidivism modelling versus treatment assignment modelling

As already mentioned in paragraph 1.1.4, when treatment assignment of offenders to interventions is not under the control of the researcher, models can be used for estimating treatment effects, using several approaches. First, the effect can be estimated in an inferential statistical model (i.e. regression correction) with a treatment dummy and recidivism as the outcome. Second, a population prediction recidivism model can be used to generate expected recidivism rates when the treatment would not have been administered (Gottfredson, 1987; Schmidt & Witte, 1980). Finally, another approach is by forming a control group post hoc by matching on characteristics of the offenders that are related to the outcome, recidivism.

In the first case, regression correction, a control group is needed. An explanatory regression model is fitted using the treatment indicator as a factor (i.e. the effect parameter), along with possible confounding variables. This is analogous to doing an analysis of covariance (ANCOVA). The treatment effect is adjusted for initial differences of the experimental and control group on characteristics associated with the treatment outcome, recidivism. This method should yield an unbiased estimate of the treatment effect only when all important covariates are all included and the model is adequately specified.

(19)

19

for the subpopulation may be missing, or the relation of the variables with the outcome may be different for the subpopulation.

To overcome these possible modelling deficits, the third approach entails finding matched control subjects for treatment subjects, using a covariates related to recidivism. The treatment effect can then be simply estimated by univariate statistics as if one has done an experiment. When there are many variables that should be matched on, it becomes impossible to find an exact match for every treatment subject (i.e. the curse of dimensionality). A solution for this problem is propensity score matching (PSM, Rosenbaum & Rubin, 1983). Instead of matching on the separate characteristics, matching is performed on the propensity score. Instead of recidivism being the outcome of modelling, the treatment indicator is the dependent variable. The resulting model is known as the propensity model. Its predicted probability of the outcome is the propensity of being in the treatment condition. The modelling step in this method is not susceptible to omitted variable bias because the model is basically a prediction model, although PSM is still not a solution to important covert confounding factors. The final treatment effect can be estimated in the following three ways:

1. Matching on the propensity score

2. Weighing treatment and control observations by inverse probability weighting 3. Fitting a regression on the outcome (recidivism) with the variables, treatment

indicator and the propensity score as an extra covariate.

In the first and second way, the actual outcomes of the treatment and control condition can be used to estimate the observed treatment effect. The third way provides an unbiased estimate of the treatment effect by estimating a treatment parameter. Matching is the most frequently used method in the PSM-framework. Other than the regression correction approach, there is less concern over the ‘correctness’ of the model, because the ultimate test of a successful matching is the equivalence of the treatment and control groups on the average background characteristics. In practice, this can be attained by using simple linear models (see e.g. Zhao, 2008).

The assumptions one should be willing to make is that treatment allocation is random given the observed covariates and that there is no hidden (residual) bias.

(20)

Chapter 1

20

leaving pre-treatment differences on the t=0 measurements. This however requires a different assumption, the parallel slopes assumption. This assumption states that the trend on pre- and post-treatment outcome measurements would have been equal, had there not been a treatment in the treatment condition.

1.2 Contribution of this thesis

This monograph is concerned with different uses of prediction modelling on (Dutch) Judicial registration data. It covers the determining the usefulness, preservability of a prediction instrument used in the judicial context, determining the prediction model that optimally predicts different types of recidivism using classical and modern models differing in flexibility, and the estimation of the effect of an intervention in a missing values and observational data context.

1.2.1 Performance, validation and preservability of a static risk instrument

Chapter 2 describes the development of a general recidivism risk assessment scale used by Dutch probation, the StatRec. It is an instrument based solely on a sparse set of static predictors and is nationally used as an instrument for categorizing individuals into risk groups. It is calculated from the parameters of a logistic regression model, having a four year reconviction (yes/no) as an outcome. In the chapter the following is demonstrated:

y

y The performance of the instrument; y

y The stability of estimates over time; y

y The performance of the scale in different regions and the performance in different nonrandom subsamples (Schmidt & Witte, 1988);

y

y The additional predictive power of dynamic risk over purely static predictors are demonstrated by utilizing systematically scored criminal case file information. The study shows that when using only a limited set of predictors, it is possible to obtain a reasonable predictive performance of four-year reconviction on Dutch Judicial Data. When false positives are considered as severe as false negatives, 70% of the offenders can be classified correctly.

(21)

21

magnitude of some of the estimated parameters shows a substantial amount of variation over time.

Results indicate that the scale predicts equally well in different court districts, indicating relative independence from local contexts. Furthermore, the results indicate that the inclusion of dynamic risk predictors do not substantially improve the predictive performance. Therefore, the scale is suitable to be utilized as a benchmark for developing other risk assessment instruments.

1.2.2 Finding the best predicting model for developing three risk scales

Chapter 3 considers the development of three models of recidivism. Apart from the general recidivism model, models for violent and sexual recidivism are described. In order to find an optimal prediction, the model selection in the development of these models encompassed a range of models from modern statistical, machine learning and predictive data mining models: classification trees, multivariate adaptive regression splines (MARS), flexible discriminant analysis, adaptive boosting (adaBoost), logitBoost, partial least squares (PLS), k-nearest neighbors, single hidden layer artificial neural networks and support vector machines (SVM). It explores the added value of these approaches to classical statistical modelling, specifically logistic regression and linear discriminant analysis. The performance adequacy of these models is tested on a wide range of performance criteria. This study showed that when classical statistical binary outcome models are specified with care, flexible nonparametric models are unable to find a model with a substantially better performance on the Dutch registration data. Even when one of the dimensions of predictive performance - discrimination, calibration or clinical usefulness - may be considered more important than the other, a simple nonflexible model suffices.

Prediction modelling for population conviction data

Prediction modelling for

population conviction data

Prediction modelling for population conviction data

Table of contents

1.1 Introduction

1.2 Contribution of this thesis