Spatial epidemiology of diseases of the nervous system : A Machine Learning approach

(1)

1

Faculty of Electrical Engineering, Mathematics & Computer Science

Spatial epidemiology of diseases of the nervous system :

A Machine Learning approach

Konstantinos Kotzias Final Thesis

MSc. Business Information Technology October 2021

Graduation Committee:

dr. Bukhsh F.A. (Faiza) dr. Abhishta A. (Abhishta) Faculty of Electrical Engineering, Mathematics & Computer Science University of Twente Drienerlolaan 5 7522 NB Enschede The Netherlands

(2)

Preface

This thesis marks the end of another important chapter of my life so far. Being a student at University of Twente. It is a chapter that started at June 2019, when I was admitted. During this journey I went through many phases : excited, motivated, overwhelmed, tired, amused, stressed, ambitious, desperate and many more. It had a lot of ups and downs, but it helped me realize a lot about myself and my surroundings, improve as a person and now, two years later, I am proud and grateful for this experience.

It was earlier this year when I asked myself ”What do I want my graduation project to be about?”. At that moment I was still feeling being at a very early phase of this process.

However, step by step, I managed to bring the pieces of that puzzle together and produce this result which I believe combines some concepts I really feel passionate about : Healthcare domain, neurological disorders and Data Science.

This study could not have been realized without the valuable support of my supervisor, Faiza. It has been a pretty fruitful collaboration and partnership, that started more than a year ago with my internship, and continued till now. I feel really grateful to have been under such a supervision, which exceeded my expectations and was way above and important than the concept of a ”supervisor professor”. I would also like to express my gratitude for my second supervisor, Abhishta, who joined at a quite crucial phase of this project and with his technical expertise and feedback had a significant contribution. Moreover, to all my professors at UT during those two years for the great job they did, under those hard circumstances.

Of course, a big ”thank you” to all the members of my family for their support. Even though we are so far away, without them I wouldn’t reach the end of that trip. They have been there for me whenever I needed them. Lastly, to all my friends in Greece, the Netherlands and many more places across the world. For their kind words, their mental and physical support, for making my time in Enschede enjoyable, for motivating to push my limits and for creating and sharing memories.

Enschede, 09-10-2021 Konstantinos Kotzias

ii

(3)

Abstract

The prevalence of diseases of the nervous system has increased over the last few decades.

To this end, it is of utmost importance to determine various non-clinical risk factors for such diseases, as well as the expected number of hospital admissions in any area of interest.

There have been various epidemiological studies which have attempted to address the above two objectives, applying different Machine Learning techniques. However, most of them may fail to overcome common challenges of such studies. Those include handling missing values, dealing with imbalanced data or performing Regression on data where many principal assumptions are violated.

Taking into account the above reasons, we propose a Machine Learning modelling pro- cess utilizing spatial analysis. It applies 4 Feature Selection algorithms to select the sociode- mographic features with the greatest importance, followed by 14 variations of predictive mod- els based on 6 Regression algorithms to predict the annual number of hospital admissions per 10,000 inhabitants in Dutch municipalities by imputing all missing data.

It was concluded that features such as the average income, the share of businesses active per sector and the share of electricity or may be associated with such diseases.

The model with the best trade-off between simplicity and performance (Sequential Float- ing Forward Selection followed by Random Forest Regression) has an Adjusted R-squared of 0.3257, MAPE of 22.18% and RMSE of 14.33. Furthermore, it is argued that local models may be occasionally preferred over global ones, while clusters of spatial autocorrelation and non-stationarity of certain variables are identified as well.

iii

(4)

List of Figures

1.1 Workflow Chart . . . . 4

1.2 Thesis Outline . . . . 6

2.1 Feature Selection phases . . . . 12

2.2 Overview of Random Forests Method . . . . 14

2.3 Literature Review Process . . . . 19

3.1 The CRISP DM cycle . . . . 31

3.2 Data Science Trajectory Map . . . . 33

3.3 CRISP DM Trajectory . . . . 34

3.4 Machine Learning Process . . . . 35

4.1 Histograms of hospital admissions per 10,000 inhabitants with Gamma distri- bution overlay . . . . 40

4.2 Q-Q plots of hospital admissions per 10,000 inhabitants . . . . 41

4.3 Spatial distribution of hospital admissions per 10,000 inhabitants . . . . 42

5.1 Evalution of Data Imputation Methods . . . . 47

6.1 K-fold Cross Validation . . . . 56

6.2 Grid and Random Search for Hyperparameter Tuning . . . . 57

6.3 Residuals vs fitted plot . . . . 61

6.4 Residuals vs fitted plot, transformed variables . . . . 63

6.5 Map of Local Spatial Autocorrelations . . . . 64

7.1 Elastic Net Regression, RMSE by hyperparameters combination . . . . 72

7.2 Observed vs. predicted values plot, Multiple Linear Regression . . . . 73

7.3 Variable Importance, Elastic Net Regression . . . . 74

7.4 Observed vs. Predicted Values Plot, Elastic Net Regression . . . . 74

7.5 RMSE by number of components, Partial Least Squares Regression . . . . . 77

7.6 Observed vs. predicted values plots, Partial Least Squares Regression . . . 77

7.7 RMSE by number of predictors, Random Forests . . . . 78

7.8 Observed vs. predicted values plots, Random Forest Regression . . . . 79

7.9 Variable importance, Random Forests . . . . 79

7.10 Local coefficients map . . . . 80

vii

(8)

VIII LIST OF FIGURES

7.11 Local p-values map . . . . 81

7.12 Local R-squared values map . . . . 82

7.13 Observed vs. predicted values plot, Spatial Lag Models . . . . 82

7.14 Local MAPE’s map, Spatial Lag Models . . . . 83

A.1 Hospital admissions per 10,000 inhabitants, boxplots . . . 120

A.2 Multiple Linear Regression, Q-Q Plots . . . 121

A.3 Multiple Linear Regression, Residuals vs. fitted Plots . . . 121

A.4 Multiple Linear Regression, Residuals vs. Leverage Plots . . . 122

A.5 Multiple Linear Regression, Scale Location Plots . . . 122

A.6 Multiple Linear Regression, Cook’s D Bar Plot . . . 123

C.1 Variables Coefficients, Multiple Linear Regression . . . 137

C.2 Variables Coefficients, Partial Least Squares Regression . . . 137

C.3 Local Standard Errors map, Basic GWR . . . 138

C.4 Local Residuals map, Basic GWR . . . 138

C.5 Local Residuals map, Spatial Lag Models . . . 139

(9)

List of Tables

2.1 Related Work Summary . . . . 22

4.1 Gamma Distribution Parameters per Dataset . . . . 39

4.2 Number of features per category of sociodemographic variables . . . . 41

5.1 Number of municipalities per dataset . . . . 49

6.1 Machine Learning Modelling Techniques Selected . . . . 52

6.2 R Packages Used . . . . 58

6.3 P-values of tests for major assumptions of Linear Regression . . . . 60

7.1 Top - 10 features, as selected by Pearson’s/Spearman’s Correlation Coefficient 70 7.2 Most optimal features, as selected by Sequential Floating Forward Selection 71 7.3 Top - 10 features, as selected by Elastic Net Regression . . . . 73

7.4 Top-10 features and relative importance, 1st PLSR model vatiation . . . . 75

7.5 Top-10 features and relative importance, 2nd PLSR model variation . . . . . 76

7.6 Top-10 features and relative importance, 3rd PLSR model . . . . 76

7.7 Regression models variations performance on test data . . . . 84

A.1 Demographics Variables Overview . . . 112

A.2 Hospital admissions, descriptive statistics . . . 120

B.1 K-Nearest Neighbours Imputation Setup . . . 124

B.2 SMOGN Setup . . . 124

B.3 SFFS for Feature Selection Setup . . . 124

B.4 Ant Colony Optimization for Feature Selection Setup . . . 124

B.5 Elastic Net Regression Setup . . . 125

B.6 Partial Least Squares Regression Setup . . . 125

B.7 Random Forest Regression Setup . . . 125

B.8 Bandwidth Estimation for Basic GWR Setup . . . 125

B.9 Basic GWR Setup . . . 125

B.10 Spatial Lag Model Setup . . . 125

C.1 Mean Imputation, RMSE by repetition . . . 126

C.2 Hot Deck Imputation, RMSE by repetition . . . 127

ix

(10)

X LIST OF TABLES

C.3 k-Nearest Neighbours Imputation, RMSE by repetition . . . 127

C.4 MICE Bayesian Linear Regression Imputation, RMSE by repetition . . . 128

C.5 Stochastic Regression Imputation, RMSE by repetition . . . 128

C.6 Elastic Net Regression, Hyperparameter Tuning Results . . . 129

C.7 Ant Colony Optimization for Feature Selection, Hyperparameter Tuning Results129 C.8 SFFS for Feature Selection, Hyperparameter Tuning Results . . . 130

C.9 Pearson’s/Spearman’s Correlation Coefficient Results . . . 131

C.10 Ant Colony Optimization, Selected Features . . . 133

C.11 Monte Carlo test p-values . . . 140

(11)

Chapter 1

Introduction

One of the areas drawing big and continuing interest in the field of healthcare is identifying patients at risk for hospitalization. Early prediction can help healthcare systems act proac- tively and improve the quality of services provided by reducing response times, preventing Emergency Department or ICU overcrowding and achieving better resource management.

In addition, it can give an indication of which patients should be prioritized over others.

With the increase of healthcare costs, healthcare providers seek new ways to use pre- dictive modelling for this purpose. In the era of Industry 4.0, the availability of large volumes of data, known as Big Data, has initiated the development of various models whose objec- tive is to provide accurate estimations about the expected workload of healthcare systems.

Sources such as Electronic Health Record (EHR), Electronic Medical Record (EMR), Inter- net of Things devices and physicians’ notes offer plenty of data to scientists to work towards this direction. Over the last decades, multiple modelling efforts have taken advantage of it in order to determine the profile of patients at risk. However, a number of challenges are present when examining individual-level medical data. The most notable ones are the privacy and security of patients, followed by others such as the lack of standardization or complexity of data.

A discipline of public health that finds many applications, including predictive modelling, is epidemiology. Unlike clinical medicine, it focuses on the analysis of patterns, risk factors and causes of diseases by examining specific population groups. Thus, the latter is one of the greatest contributions of epidemiology to health sciences. Several studies have attempted to determine the association between demographic characteristics and certain diseases.

Epidemiological studies use different levels of focus for causational factors, starting from the age or gender of patients and moving to external ones, such as socioeconomic and environmental. Due to its wide variety of factors the last level provides an ample space for research. Examples of factors that comprise causal models include the average income, the migration background, the participation level in sports activities and others. Their analysis can shed light on which dimensions are highly associated with certain diseases and which populations could be defined as high-risk.

One of those categories of diseases is neurological disorders. Nowadays, they are con- sidered amongst the main causes of death or burden of disease and have been showing an

1

(12)

2 CHAPTER1. INTRODUCTION

increasing trend in their prevalence. The burden of disease is expressed in DALYs (Disability- Adjusted Life Years), a measure of healthy life years that are lost due to disease (years lived with disability) or premature death (years of life lost). In 2016, the total number of DALYs caused by disorders of the nervous system and senses worldwide was 276 million [1], with such disorders being the second leading cause of deaths.

In the Netherlands, the neurological DALYs are expected to rise from 374,300 in 2015 to 519,500 in 2040 [2], showing the greatest increase rate amongst all categories of diseases.

Two of the main causes are the ever increasing ageing of the population and the large increase in the cases of dementia. Certain types of the latter fall into the category of diseases of the nervous system, since Alzheimer is considered the most common cause of dementia.

From 2005 to 2014, the share of Dutch citizens aged over 80 years increased from 4.5% to 5.4%. As a result, there has been both an absolute and relative increase in the number of hospital admissions accredited to this age group (from 9.0% in 2005 to 10.6% in 2014) [3].

Different clinical and lifestyle risk factors have been reported in literature to be associated with the development of neurological disorders. [4] has identified history of Type 2 diabetes mellitus, chronic renal failure and chronic obstructive pulmonary disease amongst the main causes of Parkinson’s disease in elderly people, while referring to the view of [5], Multiple Sclerosis is less prevalent in male, obese and smoking population. Nevertheless, less re- search has been made on identifying non-clinical or lifestyle risk factors or at epidemiology of neurological disorders.

1.1 Problem Statement

An indicator frequently used in the healthcare domain to measure the burden of a disease or performance of a hospital is the admission rate, usually expressed as the total number of hospital admissions annually per person. Although as a measure it doesn’t always pro- vide an indication about the quality of the health care provided by hospitals (as high hospital admission rates might be associated with poor evaluation or limited access to general prac- titioners [6]), it is the most common measure of outcome.

Therefore, it is of great interest for policy makers to determine the correlation between certain non-clinical factors and diseases of the nervous system and how this can lead to hospital admissions. This could highlight areas of high and low risk and provide them the opportunity to predict, prevent or mitigate to a certain extent, such cases in certain popula- tions. Furthermore, it can contribute towards the design of healthcare systems which can satisfy the hospitalization demand based on the characteristics of their catchment areas.

The prediction of hospitalization rates at any level is a task that can be quite demanding.

One of the main challenges is the properties of the demographic and medical data. Quite of-

ten they are incomplete, especially in the case of medical data, which can lead to inaccurate

or biased results. What is more, the vast majority of epidemiological studies fail to consider

non-stationary variables and local relationships. Usually they examine administrative units

(13)

1.2. RESEARCHQUESTIONS 3

as separate entities, without taking into account that the special characteristics of an entity might affect another one within close proximity; two nearby cities are more probable to have similar average household income than two cities in greater distance to each other.

With the rapid development of technology, Computer Science nowadays has became an indispensable part of our society. Amongst its fields, Data Science has been gaining in- creasing popularity and finds multiple applications. One of the most promising areas is this of public health and epidemiology. It can find a great applicability, since Data Scientists are not required to be experts at this discipline, while at the same time epidemiologists can use advance scientific techniques to extract value from large volumes of unstructured or uncer- tain data. Quick and complex computations, advanced predictive models, data engineering methods and discovery of various hidden patterns are only some of the contributions of Data Science, and especially Machine Learning and Data Mining, to epidemiology.

In the above basis, the main objective leading this research is to identify sociodemo- graphic, socioeconomic and other non-clinical factors highly associated with hospitalizations accredited to diseases of the nervous system. Leveraging Machine Learning and spatial analysis, we aim to deploy a Feature Selection process to define the risk factors with the greatest importance, followed by the design of Regression models to predict the annual num- ber of hospital admissions. Moreover, another objective is to assess and compare the per- formance of those models and their generalizability, as well as evaluate in practice Machine Learning techniques to overcome common challenges in such problems, like non-linearity, data missingness and imbalance.

1.2 Research Questions

The main objective of this study is to determine the association between non-clinical fea- tures and diseases of the nervous system in The Netherlands, while considering as well the special characteristics of each spatial unit. Hereby, it can be defined as :

To what extent can Machine Learning and spatial analysis be used to identify the main non-clinical factors associated with development of diseases of the nervous system and predict the number of annual hospital admissions?

To this end, the following Research Questions have been formulated to accomplish the aforementioned research objective and provide room for further discussion, based on the findings.

RQ1 : Which of the current Machine Learning techniques and methods are more appro- priate and efficient to apply in the field of epidemiology?

RQ2 : Which of the examined factors have the greatest impact on determining the num- ber of annual hospital admissions and can be used for the purposes of development of relevant causal models?

RQ3 : How can Machine Learning techniques and algorithms be used to predict the

hospital admission rates per municipality and which are the most efficient?

(14)

RQ4 : How can we utilize and combine demographic and medical data properly with the support of Machine Learning?

RQ5 : Can the models developed be generalized and provide accurate predictions at na- tional level, or smaller models which focus on specific regioρns or municipalities are required instead?

In Figure 1.1 we present the research plan to extract the required knowledge and in- sights to achieve this objective and answer the research questions. We distinguish six main phases. In the first phase, we introduce the topic by briefly discussing it, stating the problem, together with the potential aims and objectives. Then, it is further explored (in its theoretical aspect), while the research methodology and design are defined as well. In the third and fourth phase we implement our practical research, split into two separate steps. Then, the results obtained are presented and explored. In the final phase, their discussion takes place with the purpose to gain value from results and provide additional research directions.

Figure 1.1: Workflow Chart

1.3 Contribution

The contribution of this study to research and the practice is as following :

Contribution to research

1. In this thesis, regression models for predictive purposes in epidemiology that take

into account the factors of spatial dependence and autocorrelation, such as Basic

(15)

1.4. OUTLINE 5

Geographically Weighted Regression (GWR) and Spatial Lag Model (SLM) are devel- oped. To our knowledge, this is one of the few studies that applies the above Machine Learning (ML) techniques in practice to map the spatial epidemiology of diseases of the nervous system.

2. Furthermore, the current thesis proposes and evaluates multiple ML techniques for dealing with missing data at different levels. Very often, Data Scientists come across with data that is characterized by incompleteness or uncertainty in their research and can cause biased or misleading results. Hence, in this study we expect to extend the knowledge required to overcome this common challenge.

3. Moreover, it is expected to provide insight on which demographic, economic, educa- tional and other factors are highly associated with each other and can be omitted for predictive purposes. This can lead to the development of simpler and less resource demanding ML models for Regression or Classification, with fewer dimensions.

Contribution to practice

1. To our knowledge, this is the first attempt to associate such factors with diseases of the nervous system in the Netherlands that performs a small area analysis with the pur- pose to generalize outcomes and predict the hospitalization rate for each municipality separately. Any previous relevant research has focused either on larger spatial units (such as province- or country wise) or different categories of diseases.

2. This thesis is one of the few studies that investigates several non-clinical population characteristics as potential risk factors. Unlike the majority of such studies which focus mostly on basic sociodemographic or socieconomic indicators, such as age, migra- tional or educational background and income, more than 200 indicators are explored.

They span into different categories, from work and income to environment and educa- tion. Hence, it could be a starting point for further relevant research and development of causal models in epidemiology.

3. The forecasting models developed in this study can be utilized by healthcare providers such as hospitals and insurance companies to gain an estimation of the number of hos- pitalizations for diseases related to the nervous system and senses in any given year.

This can be performed by combining the expected outcomes with other methodologies (i.e. time series analysis). As a result, it can contribute towards reducing overcrowding, allocating personnel and budget properly and determining catchment areas.

1.4 Outline

The rest of the study is structured as follows. Chapter 2 discusses the background, both the-

oretical and practical, required for the purposes of the current thesis, as well as the outcomes

(16)

of the literature review for the related work done in the discipline of epidemiology and the directions for further research. In Chapter 3, the methodology followed for the conduction of this research is presented. Chapter 4 describes the main findings of the exploratory analysis of the preliminary data, while in Chapter 5 the main actions undertaken for the preparation of the above data for the analysis are illustrated. In Chapter 6, a detailed overview of the modelling setup and process of the research is provided. Chapter 7 presents the results of the modelling process. In Chapter 8, the outcomes of the previous chapter are interpreted, followed by a review of the main challenges and limitations faced and suggestions for future work. Chapter 9 discusses the major conclusions drawn from this research.

Figure 1.2: Thesis Outline

(17)

Chapter 2

Background and Context

This chapter provides an overview of the business and scientific background required to understand the context of the current study. First, in Sections 2.1 and 2.2 we give a brief description of the Dutch national healthcare and public administrations systems respectively, as we will focus on their study in the later phases of the research. Afterwards, in Section 2.3 we present the discipline of Machine Learning in Computer Science, where the scientific concept of this study lies on. Next, in Section 2.4 the related work conducted in the field of (spatial) epidemiology, with additional focus on diseases of the nervous system, is given.

The chapter concludes with Section 2.5, where the main challenges which were identified during the state-of-the-art review and will be further explored, are elaborated.

2.1 The National Healthcare System in the Netherlands

The healthcare system in the Netherlands consists of three main divisions :

• Primary care. It is offered by general practitioners (huisarts) and/or specialists who reside in hospitals or practice independently. It covers basic and routine health issues, as well as more essential which require brief hospital visits or provision of specialized healthcare services.

• Long-term care. It is provided by hospitals or other institutions and covers chronic diseases.

• Supplementary care. The rest of healthcare services that don’t belong in any of the above two categories fall into this category. For instance, this includes dental care and physiotherapy.

With the reform of Dutch healthcare and health insurance system in 2006 with the aim to im- prove the quality of services provided and reduce waiting times, the system became private but highly regulated. As a result, the funding of the healthcare system mainly comes from private health insurance, which is mandatory in the Netherlands. The Health Insurance Act

7

(18)

8 CHAPTER 2. BACKGROUND ANDCONTEXT

(Zorgverzekeringswet) covers primary care, and the Long-term Care Act (Wet langdurige zorg - WIz ) covers the expenses for long-term treatment.

General practitioners are the main point of primary health care. In most cases patients visit their general practitioner for physical complaints, minor treatments and preventative medicine, who based on his diagnosis decides the upcoming actions. This can include the prescription of medicines or further reference to a specialist or hospital. Furthermore, since 2007 some of the GP’s responsibilities have been transferred to specialists or nurses.

Hospital care in the Netherlands is provided by three types of hospitals :

• Academic hospitals. They are associated with public universities and offer state- of-the art and more specialized care. Moreover, they usually have a greater number of physicians and specialists than the rest of the hospitals. Currently, 8 academic hospitals can be found in the Netherlands.

• General hospitals. The majority of Dutch hospitals fall into this category. They provide care for less specialized and complicated issues and diseases. However, they can refer patients for more specialized care to other institutions if required.

• Teaching hospitals. It is a special category of hospitals. They offer specialized treat- ment and care for more complex issues, although they are not directly associated with the academic ones. They are also facilitated for the training of nurses or other medical staff. As of 2021, the number of Dutch teaching hospitals is 26.

Hospitals in the Netherlands provide secondary care. In the majority of cases this happens after referral from the GP or in emergency cases under 24-hour emergency ward.

2.1.1 National Basic Registration of Hospital Care

The National Basic Registration of Hospital Care (Landelijke Basisregistratie Ziekenhuis- zorg - LBZ ) was launched in 2013 as the successor of the National Medical Registration (de Landelijke Medische Registratie - LMR). It is the central database for patient care and organizational data of all healthcare providers in The Netherlands. It has been developed and maintained by DHD and is the main source for hospital data.

Every admission is registered and encoded based on the diagnosis made. The system

used is known as ICD-10 (International Statistical Classification of Diseases and Related

Health Problems, 10th revision, World Health Organization) and is the most widely used

system for classification of diseases and disorders worldwide. It classifies them into 22 main

groups (known as chapters), each one with its own codes (known as blocks). An admission

can refer to a diagnosis made that concerns more than one disease or disorder. In this case,

the details of the registration include both the main and the secondary diagnosis. ICD-10

has been used in more than 150 countries worldwide since its endorsement in 1990 and is

expected to be replaced by ICD-11 in 2022. Diseases of the nervous system constitute the

6th chapter of ICD-10 and are denoted by codes that start with the latin letter “G”.

(19)

2.1. THENATIONAL HEALTHCARESYSTEM IN THENETHERLANDS 9

The International Shortlist for Hospital Morbidity Tabulation (ISHMT) format is used by Eurostat, WHO and OECD and classifies diseases of ICD-10 into 20 groups and 128 sub- groups. It distinguishes diseases of the nervous system and senses in 5 sub-categories, namely Alzheimer, Multiple Sclerosis, Epilepsy, Transient Cerebral Ischemic Attacks and related syndromes and other relevant diseases (inflammatory diseases, atrophies, sleep disorders, disorders of the peripheral nervous system).

Admissions are split into three categories, based on their severity, the type of treatment and length of stay.

• Day admissions. Most of the hospital admissions are considered day admissions.

They concern a short-term visit in a hospital lasting less than 24 hours, for an appoint- ment with a specialist or to receive treatment.

• Clinical admissions. They refer to admissions at hospitals where an overnight stay of one or more days is registered.

• Observations. This is the most recent category, introduced for the first time in 2015. It concerns the stay of a patient in a hospital or clinic for at least four consecutive hours without overnight stay, for the purpose of undertaking examinations and observing him.

Hospital admissions are divided into the various municipal divisions based on the postal code of the registered address of the patient. This can occasionally cause variations in the regional figures, as patients might be admitted to hospitals located in cities other than that of their main residence.

Moreover, the age of the patient is also registered. At an individual level this doesn’t raise any issue, yet can make the comparison of different population groups harder and quite mis- leading. For this reason, in the discipline of epidemiology a common technique for aggrega- tion and display of results is standardization. Its purpose is to mitigate the inconsistencies caused by comparing epidemiological figures of populations with different distributions of other factors which might affect the frequency and severity of a disease, such as the age or gender. By standardizing the disease rates, comparisons between different populations are more accurate, even though the standardization results might deviate from the actual ones.

We distinguish two main standardization methods :

• Direct, where the rates are calculated on the assumption that all populations of the sample have the same composition according to the standardized factor.

• Indirect, where certain rates from a standard population are used to calculate the rates for the rest of the populations. It is more common when rates for specific population groups are unknown.

Direct standardization is preferable over indirect, but indirect quite often is more feasible and

can provide more accurate results in the case of imbalanced population groups.

(20)

2.2 Public Administration in the Netherlands

Municipalities (gemeenten) are the second tier of local administrative division in the Nether- lands, just below provinces and third overall, central government included. Over the years less populous municipalities have merged, leading to a reduction in their number from 1209 in 1850 to 537 in 2000 and further to 352 in 2021. In addition, there are three special munic- ipalities (bijzondere gemeenten) in three out of six overseas territories of Dutch Caribbean, namely the islands of Sint Eustatius, Saba and Bonaire.

Since 1848 the Netherlands has been a decentralized state, which means that although a central government exists, municipalities carry out most of the tasks at local level. It fol- lows that they are responsible for processes such as public housing, construction activities, licensing, environmental policies, waste management and others.

2.3 Machine Learning

ML is a branch of Artificial Intelligence which enables systems to automatically learn, im- prove and perform actions by studying and analyzing data without human intervention. It was firstly introduced in 1959 by AI pioneer Arthur Samuel, as “the field of study that gives computers the ability to learn without being explicitly programmed” [7]. The concept of ML is to recognize patterns in data that can be used to train models for different purposes, from classification to natural language processing. It can have three main functionalities : descrip- tive (where it uses the data to provide answers about what happened), predictive (where it predicts what will happen in the future) and prescriptive (where based on the data not only it predicts what will happen, but can also provide suggestions for actions).

Nowadays, there are three main types of ML. Those are :

• Supervised Learning. Being the most popular type of ML, Supervised Learning uses datasets with labels and makes predictions for their outcome. Then it evaluates its prediction and gradually learns to provide more accurate approximations for it.

• Unsupervised Learning. Unlike Supervised Learning, in this type of ML the data is unlabeled. Thus, Unsupervised Learning algorithms are developed in such a way so they can find hidden patterns.

• Reinforcement Learning. Reinforcement Learning is linked to the concept of “trial and error”. Such models are trained by allowing them to perform different actions. At the beginning they are expected to commit many mistakes or provide inaccurate results, till they gradually learn which actions are considered correct or efficient.

The majority of ML algorithms fall into one of the above types. Classification is a subcate-

gory of Supervised Learning. Such algorithms analyze and label existing data and predict

the label of new, unknown data. The second subcategory of Supervised Learning is Re-

gression. In Regression, the correlation between different variables is found with the aim to

(21)

2.3. MACHINELEARNING 11

identify how a dependent variable changes corresponding to independent variables. Finally, in Unsupervised Learning we identify Clustering algorithms. Clustering bears similarities with classification, but with the difference that data points are divided into groups based on their proximity in properties with other points of this group.

2.3.1 Data Mining

Data Mining is an interdisciplinary field of study of Computer Science that lies at the inter- section of ML and Statistics. It can combine elements from other fields as well, such as Database Systems or Business Intelligence. Similar to ML, its purpose is to process data in order to extract useful insights and identify patterns and trends which can enhance business processes. Consequently, very often those two terms tend to be considered identical or overlapping. The main difference however is that Data Mining is applied to and extracts in- formation from a specific dataset only, while ML has a wider scope, as its models are trained on a dataset and then generalized to produce results on unknown ones too.

2.3.2 Feature Selection

Feature Selection refers to the process of reducing the number of variables used for building a model for predictive purposes. Four major advantages of Feature Selection for regression or classification analysis are the following :

1. It allows the user to train Machine Learning models faster.

2. By reducing the number of variables, the complexity of the model is further reduced and it’s easier to interpret, especially by non-experienced users.

3. It makes the Machine Learning models less sensitive to noise, which can mitigate the risk of the curse of dimensionality and improve their accuracy. Very often the different features of a dataset are highly autocorrelated. Hence it would make sense to drop features which do not add any additional value to the model.

4. It reduces overfitting, the issue of the model performing quite well on the known dataset but very low on new, unseen data.

Feature Selection is similar to Dimensionality Reduction. Both are used to reduce the num- ber of variables. While Feature Selection does this by including and excluding certain fea- tures without transforming them, Dimensionality Reduction transforms them into a lower dimension by creating new combinations. It consists of four main steps, as shown in Fig- ure 2.1 [8].

Its techniques can be classified in two main categories, namely Supervised techniques

and Unsupervised techniques. The former are used for labeled data, while the latter for

unlabeled data and don’t make use of the target variable. Supervised Learning techniques

can be further classified into three groups :

(22)

Figure 2.1: Feature Selection phases

1. Filter methods. Filter methods are quite fast, don’t require significant computational resources, are not subject to overfitting and can be applied on high dimensional datasets.

They select features before classification and clustering based on their characteristics and try to predict the relationship between the dependent and each of the indepen- dent variables separately, assigning a score to the latter and selecting the most highly ranked. Common filter methods are the Chi-Square test, the ANOVA test, the Fisher’s score, the Information gain and Pearson’s or Spearman’s rank correlation coefficient.

Compared to the other two approaches, their results are more generalizable.

2. Wrapper methods. Wrapper methods were popularized by [9]. They create multiple subsets of features and evaluate them based on a selected algorithm. They are also known as greedy algorithms, since they examine several possible combinations of fea- tures and tend to be slow and demanding. The main difference from filter methods is that they find the optimal subset of features and therefore lead to better predictive ac- curacy. Examples of Wrapper methods for Feature Selection are Genetic algorithms, Recursive Feature Elimination, Backward Feature Elimination and Sequential Feature Selection amongst others.

3. Embedded methods. Embedded methods combine advantages of both filter and wrapper methods but Feature Selection is performed during the model training. They select some candidate features to train learning models which later use features by Feature Selection selected during the design of the models. There are three types of search strategies in embedded methods : exhaustive, heuristic and randomized. The algorithm complexity ranges from exponential to quadratic [10]. Noteworthy examples of Embedded methods include LASSO Regularization, Ridge Regression, Elastic Net, Random Forests and Decision Trees.

2.3.3 Regression

Regression is a Supervised Learning technique mainly used for predictions, forecasting and

time series analysis or to find the causal-effect relationship between one variable known

as target or dependent variable, and one or more independent variables which are called

predictors. In Regression, the different observations and a line or curve that passes through

them are plotted in the same graph. The proximity of the line to the observations in terms of

vertical distance indicates how strong the relationship between the variables is. Regression

(23)

analysis leads to the generation of a formula that describes the slope of the line, which can further be used to approximate the dependent variable from the independent variable(s), by a deviation of a value known as standard error.

An important aspect to take into account for Regression analysis is that it only reveals the relationship between variables. The researcher should carefully interpret its results and combine them with other Data Science methods to develop predictive models.

We identify different families of regression algorithms in literature and research, based on the assumptions made about the data, the type of the dependent variable and the num- ber/type of the independent ones.

Linear Regression

Linear Regression is one of the simplest, yet widely used regression algorithms. It consists of a dependent variable and one or more independent variables. When more than one independent variable is used, Linear Regression is called Multiple Linear Regression (MLR).

Those two different types of variables are said to be related linearly to each other. The equation that describes Linear Regression can be modeled as :

y = b

₀

+ b

₁

∗ x

₁

+ b

₂

∗ x

₂

+ ... + b

_n

∗ x

_n

+ e (2.1) The best fit line is determined by varying the values of the y-intercept and the regression coefficients to obtain a line in which as many observations as possible lie on. This is done by following the concept of Ordinary Least Squares (OLS), which calculates this line by minimizing the sum of the squares of the absolute distance of each observation from the line. It is based on four principal assumptions : homoscedasticity, independence of residuals, normality of residuals and linearity of residuals.

Random Forests

Random Forests (RF) is a Bagging Ensemble Learning method. It splits the dataset into n samples and creates a random Decision Tree for each one, with k records. Those trees are run in parallel and independently and generate a result. Then the final output is calculated by averaging the results. All predictive features in Random Forests are used equally or nearly equally, while the introduction of some randomness during the generation of splits ensures that the model will not be overfitting. Even though Random Forests perform better than many other regression algorithms, they can’t extrapolate to provide estimations for values falling outside the range of the data that is used to train the models.

Elastic Net Regression

It is a technique that combines concepts of Ridge and LASSO Regression. Similar to the

other two methods, it is quite efficient for datasets where the independent variables are

(24)

Figure 2.2: Overview of Random Forests Method

highly correlated. It uses the penalization terms from both methods to regularize the mod- els. It first finds the coefficients of Ridge Regression and afterwards applies the LASSO methodology to shrink them. Consequently, it performs regularization and feature selection simultaneously. It attempts to minimize the following factor :

1 2N

N

X

i=1

(y

i

− β

0

− x

^T_i

β)

²

+ λP

α

(β) (2.2) where

P

_α

(β) = 1 − a 2

β ~

₂²

+ α~ β

₁

=

ρ

X

j=1

( 1 − a

2 β

_j²

+ α|β

_j

|) (2.3)

Partial Least Squares Regression

Partial Least Squares Regression performs regression on highly correlated variables, after reducing them to a smaller set of uncorrelated. It is also efficient when the number of inde- pendent variables is large. It was developed to overcome the issue that the OLS method can suffer from large variance and thereupon its application might produce coefficients with high standard errors. It combines elements of two other ML techniques, Principal Component Analysis and (Multiple) Linear Regression. It follows many of the assumptions of the latter but softens them. It doesn’t perform a selection of features but creates new independent variables which are linear combinations of the existing ones by considering the dependent variables as well. It is more suitable as a predictive than explanatory technique [11].

Spatial Regression

Spatial Regression is a special family of regression algorithms aimed to model and explore

relationships between data with spatial patterns. They are used quite often in disciplines

(25)

such as natural sciences, epidemiology and agriculture. Unlike other regression algorithms, Spatial Regression assumes that the value of the dependent variable in a location depends not only on the values of the predictors variables in the same location, but also on others.

Hence, one of the main assumptions of most regression models (independence of obser- vations) is relaxed in Spatial Regression models. Most of the techniques of this family are based on the OLS method. Spatial Regression models often referred to in the literature are the Spatial Lag Model and the Spatial Error Model.

Geographically Weighted Regression

Geographically Weighted Regression is used for geospatial analysis. It assumes that the relationship between the variables doesn’t remain constant, but varies according to the lo- cation. This is known as non-stationarity, common for environmental or climate variables.

Instead of creating a global model, it creates multiple local models which are applied sepa- rately. In addition, it focuses on the issue of spatial autocorrelation. In line with Tobler and his first law of geography, “everything is related to everything else, but near things are more re- lated than distant things” [12]. Therefore, it accounts for the fact that nearby spatial units are more likely to be correlated than units which are far from each other. Different GWR models have been developed. The simplest one is the Basic GWR (and its extensions, Robust GWR and Mixed GWR), that follows the concept of Linear Regression and OLS.

2.3.4 Evaluating Regression Models

The evaluation of regression methods is one of the most essential parts of the engineering cycle, as it can provide insights on how well they perform not only on the training data, but also on unknown data.

Unlike classification, the performance aspect of accuracy is not used, since the exact calculation of the target variable is difficult. Instead, regression evaluation metrics focus on estimating how close is predicted to the actual value. Thus, they are called error metrics.

We distinguish three main evaluation metrics, namely Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Other common metrics in- clude Mean Absolute Percentage Error (MAPE) and R-squared.

Mean Absolute Error

Mean Absolute Error is equal to the average absolute distance of the observations from the regression line. This distance, also known as error, is the difference of the observed value from the predicted value for the dependent variable. It is the simplest evaluation metric, yet not as popular as the rest of metrics. Values close to zero suggest a good fit of the model to the data and are expressed in the same unit as the dependent variable.

M AE = P

n

i=1

|y

_i

− x

_i

|

n =

P

n i=1

|e

_i

|

n (2.4)

(26)

where y

i

the predicted value and x

i

the observed value.

Mean Squared Error

Mean Squared Error is a metric similar to MAE. The main difference is that it is calculated as the average of the squared differences between the predicted and observed value. Because of that, outliers have a greater impact on its calculation and large errors are “penalized” more compared to MAE. Like MAE, smaller values of MSE are a sign that the regression line is quite close to the line of good fit.

M SE = 1 n

n

X

i=1

(x

_i

− y

_i

)

²

(2.5)

Root Mean Squared Error

Root Mean Squared Error follows the concept of MSE, since it is equal to its square root.

By doing so, it has the same unit as this of the target variable. It can be interpreted as the standard deviation of the errors. As MAE and MSE, lower values indicate better fit.

RM SE = v u u t 1 n

n

X

i=1

(x

_i

− y

_i

)

²

(2.6)

Mean Absolute Percentage Error

Mean Absolute Percentage Error is the equivalent of MAE expressed as a percentage and shows how far on average the predicted from the observed values are. The use of percent- age as a unit makes its interpretation easier compared to MAE. It has some disadvantages on the contrary, as it tends to be biased towards predictions which are lower than the ob- served value or not defined for observations where the value is neither positive nor negative.

M AP E = 100 n

n

X

i=1

| x

_i

− y

_i

x

_i

| (2.7)

R-squared

R-squared, which denotes the coefficient of determination, is a goodness-of-fit measure that

explains the proportion of the variance in the dependent variable that can be predicted from

the independent variables. Hence, the values it takes ranges from -∞ to 1, with higher values

usually underlining better model performance. A serious limitation of R-squared is that it

doesn’t account for bias in models. In addition, it can be inflated because of overfitting and

always increases when more independent variables are added as predictors. To solve those

issues, an adjusted alternative metric has been suggested, namely Adjusted R-squared,

which is independent of the number of predictor variables.

(27)

2.4. STATE-OF-THE-ART INAPPLICATION OF MACHINE LEARNING TO EPIDEMIOLOGY 17

2.4 State-of-the-art in Application of Machine Learning to Epidemiology

In this section we review the body of related work available in the field of Data Analytics and Machine Learning for epidemiology, with special focus on diseases of the nervous system and use of spatial analysis techniques.

2.4.1 Methodology

With the aim of exploring related work on this topic, a systematic literature review inspired by the work and guidelines of [13] was performed. In the upcoming paragraphs, the research method for searching, retrieving and analyzing relevant studies is presented. Since an ex- haustive search covering all aspects or keywords of interest for the current topic would not be feasible due to time and resources limitations, certain criteria were set.

Database

Only databases accessible via the University of Twente were considered. The list comprises 105 databases, including some of the most commonly used for multidisciplinary fields such as Scopus and Google Scholar and databases covering various disciplines of Computer Science (Science Direct, IEEE). Scopus was selected as the main source as it fulfills the majority of quality criteria mentioned by [14] and can be used for principal literature review.

Search Terms

Scopus allows the combination of different search terms using logical operators. A common challenge in literature review is the difficulty to define a search term as inclusive as pos- sible, while at the same time returning results that would neither be irrelevant to the topic nor reaching a number that would make their screening process demanding. Hence, the following search term was generated which includes many of the main keywords that are the basis of this thesis, such as Machine Learning and epidemiology :

(“Machine Learning”) AND

(“prediction” OR “predictive” OR “forecasting” OR “regression) AND

(“epidemiology” OR “epidemiological” OR “sociodemographic” OR “socioeconomic”) AND (“admissions” OR “cases”)

To retrieve epidemiological studies utilizing spatial analysis, the following combination of search terms was used :

(“Machine Learning”) AND

(“prediction” OR “predictive” OR “forecasting” OR “regression”) AND

(“epidemiology” OR “epidemiological” OR “sociodemographic” OR “socieconomic”) AND

(28)

(“admissions” OR “cases”) AND

(“spatial” OR “geospatial” OR “geographic”)

Finally, to identify related work focusing specifically on diseases of the nervous system, the combination of search terms that was used was :

(“Machine Learning”) AND

(“demographic“ OR “sociodemographic” OR “socioeconomic” OR “epidemiology” or “epi- demiological” OR “association” OR “risk factors” OR “variables”) AND

(“admissions” OR “cases” OR “risk”) AND (“nervous” OR “neurological”)

The search process was performed by querying the title, abstract and keywords for the above terms. To define the boundaries of the study, the following restrictions were used :

• Year of publication. Only publications from 2017 onwards were considered eligible, since older works especially in the scientific and technological field tend to be outdated within a short period.

• Type of publication. Only conference papers and articles which are published and not under review.

• Subject area. This includes Medicine, Computer Science, Engineering, Health Pro- fessions and Multidisciplinary.

After defining the above 3 limitations, 161 and papers were yielded in total for the first combination of search terms, 23 for the second and 128 for the third.

Selection Process

To further limit down results, a series of inclusion and exclusion criteria were set that would be used to find the most relevant and valuable studies. Those criteria are :

• IC1 : The studies should be relevant to the topic. This ensures that they focus on fields such as Machine Learning, epidemiology, predictive modelling and healthcare.

• IC2 : As predictor socioeconomic or demographic variables they use other or additional variables than the four basic demographic variables that the majority of relevant studies use (age, gender, race and years of education).

• IC3 : The studies are in English.

• IC4 : The full text of the studies is available for online access or download.

• IC5 : The paper doesn’t belong in any instantiation of grey literature (reports, white

papers, notes and others).

(29)

Based on the above criteria, the title, abstract and keywords of all papers were screened.

This resulted in 9 papers meeting the selection criteria, 31 for which there were doubts and 121 being discarded as they didn’t meet one or more of them. For those falling into the second category, a more thorough review process was undertaken by examining additional chapters. After performing this process and evaluating the respective papers, 18 remained as final.

An overview of the process is portrayed in Figure 2.3.

Figure 2.3: Literature Review Process

2.4.2 Related Work on Determining Risk Factors and Predicting Hos- pital Admissions

Over the last years, there have been several attempts to associate various demographic, so- ciodemographic, socioeconomic and other factors with the development of certain diseases and the severity they can cause, which could possibly lead to hospitalization as well.

In its work, [15] presents an analysis of various demographic, social and environmental factors aggregated at county-level to identify those that are highly associated with COVID-19 in the USA. A semi-parametric approach using SuperLearner fits 19 different Machine Learn- ing algorithms to the data. The authors have stated that factors which contribute most to COVID-19 cases or deaths are the population’s literacy, the proportion of black and African- American individuals and the share of individuals using public transportation frequently.

In a similar work, [16] analyses multiple demographic, sociodemographic and environ- mental variables to understand the main factors of spatial clustering and spread of COVID- 19 across China. The study has been performed using 2 major techniques, namely K-means algorithm and Principal Component Analysis to create different clustering schemes. 3 clus- tering regions were created, while the results indicate that higher air temperatures and share of elderly people are the two most important predictors.

[17] performed a study to associate Major Depressive Episodes with Severe Impairment

(MDESI) in adolescents with sociodemographic characteristics, with the support of ML. The

(30)

Logistic Regression model developed successfully identified 66% of the MDESI cases (re- call rate) and 72% of the MDESI and non-MDESI cases (accuracy rate) in the cohort of study, with variables such as living in a single-parent household and having negative school experience increasing the risk of such episodes.

[18] investigated the impact of pollution and environmental variables on tuberculosis with the purpose of identifying relevant high-risk areas as well. 5 different regression techniques (Gradient Boosting, Random Forest, Decision Tree, Extreme Gradient Boosting and Artificial Neural Networks) were used to predict the annual number of tuberculosis cases. Artificial Neural Networks achieved the best performance in all evaluation metrics. Results indicate that higher air temperature and precipitation combined with wet climates or increased SO

₂

and NO

₂

emissions lead to an increase in tuberculosis cases.

Conceptually similar work has been carried out by [19], which examines the impact that sociodemographic and environmental conditions have. Factors such as the unemployment rate, the average NO

₂

concentration, the population density and the social vulnerability are used for the purposes of this study. Random Forests and Gradient Boosting models had the best performance in terms of Area Under the Curve (AUC) and accuracy. The study showed that the risk was significantly increased among census groups with more disabled and elderly population, single-parent households and higher unemployment rate.

A different study by [20] aims to predict the utilization of the public healthcare system in the USA. A decision-tree based Machine Learning algorithm was used that achieved an AUC score of 0.84 in the training and 0.83 in the test set at predicting inpatient hospitalization, ED visits or avoidable inpatient hospitalization. 19 different categories of socieconomic variables were considered, with outcomes indicating that air quality and income have the greatest impact, followed by neighborhood in-migration and transportation.

In its work, [21] develops a predictive model for Crimean-Congo haemorrhagic fever to identify high-risk regions by performing a spatiotemporal analysis. The model is based on Gaussian process regression and predicts the annual number of cases for a given year using the covariates of the previous year and Pearson’s correlation coefficient and Normal- ized RMSE as performance metrics. The most explanatory variables for the predictions were found to be demographics (number of settlements with less than 25,000 inhabitants), geospatial (longitude and latitude of regions) and climate (evapotranspiration levels).

[22] applies an Elastic Net regression modelling technique to determine significant fac- tors causing schizophrenia. The study examines 9 sociodemographic factors and their in- teractions. On the basis of the evidence available, it seems fair to suggest that parental psychosis, degree of urbanization and paternal age are three variables that increase risk.

Environmental socioeconomic variables associated with involuntary psychiatric hospital-

ization are also examined by [23]. Major ML techniques used are CART and Chi-square

automatic interaction detection. 5 different models were compared and CART with hyper-

parameter tuning with/without environmental socioeconomic data (ESED) outperformed the

others in terms of Area Under the Receiver Operating Characteristics (AUROC). Determin-

ing socieconomic factors include the number of buildings and households per 100 inhabi-

(31)

tants and the household size.

2.4.3 Related Work on Spatial Epidemiology

The study of [24] applies several ML methods (Generalized Linear Model, Generalized Boosted Model, Artificial Neural Networks etc.) to explore spatial variation in lymphatic filari- asis across Ghana and create probability maps. Random Forests and Generalized Boosted Models achieved very high AUC scores of 0.98 and 0.95 at predicting high risk regions for transmission of lymphatic filariasis. Variable selection suggests that environmental and so- ciodemographic factors with moderate or significant importance are the terrain slope, the proximity to water bodies, the poor housing conditions and the high population density.

In [25], the authors present 3 models (based on Random Forests and Artificial Neural Networks) to predict the number of dengue cases, based on environmental, meteorological and sociodemographic indicators. Sociodemographic indicators like high population density, poor quality housing and sanitation management were found to be more significant in long- term than short-term forecasting. In addition, the national model was more accurate at predicting the number of cases than the local model.

In the study of [26] the discussion centers on COVID-19 in Germany. Several potential risk factors are examined using Bayesian Additive Regression Trees. Furthermore, an ex- ploratory spatial modelling process is conducted to identify high or cold risk areas, as well as spatial outliers. 31 out of the initial 366 variables were identified by the model as significant, with the most remarkable ones being the church density, the latitude and longitude, the voter participation, the number of foreign tourists and the youth employment rate.

Likewise, [27] attempts to identify hotspots for spread of COVID-19 in the United States based on socioeconomic, sociodemographic and geographic factors with the support of Ar- tificial Neural Networks. In total, 217 counties were identified as hotspots by the spatial analysis, mostly in the Northeastern and Southern states. In addition, using Boruta algo- rithm and Pearson’s correlation coefficient 34 variables were selected as important for the model (i.e. road density, precipitation, share of Asian population, median household income and total number of nurse practitioners). The correlation coefficients for those variables range between 0.3 and 0.65.

2.4.4 Related Work on Epidemiology of Diseases of the Nervous Sys- tem

[28] performed a study for early identification of epilepsy patients who are likely to undertake a surgery. The input variables were collected from EHR, physicians’ notes and others and a feature selection was performed, followed by the development of predictive models (Logistic regression, LASSO regression, Gradient Boosting machine, Random Forests and others).

The latter one had the best performance in both datasets of study, while no significant asso-

ciation was found with the majority of demographic characteristics examined.

Spatial epidemiology of diseases of the nervous system : A Machine Learning approach

Faculty of Electrical Engineering, Mathematics & Computer Science