Data-efficient Convolutional Neural Networks for Treatment Decision Support in Acute Ischemic Stroke

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

T

RACK

:

D

ATA

S

CIENCE

M

ASTER

T

HESIS

Data-efficient Convolutional Neural

Networks for Treatment Decision Support

in Acute Ischemic Stroke

by

A

DAM

H

ILBERT

11138955

July 20, 2018

36 EC

Daily Supervisor:

Supervisor:

Assessor:

Bastiaan S. Veeling

Dr. Henk A. Marquering Prof. Dr. Max Welling

(2)

I would like to thank my daily supervisor Bas Veeling, who caught up my supervision at the beginning of June, 2017, after I had to look for a new supervisor. Bas has provided tremendous help in guiding my research with brilliant ideas and open critics. I very much appreciate all the effort he made to understand the medical background and the goal I have been running after for so long. Moreover, I thank Bas all the help I got in extracting, fine-tuning and submitting an Extended Abstract of my thesis to the Medical Imaging with Deep Learning (MIDL) 2018 conference, which received acceptance for a poster session. I would like to thank my co-supervisor Henk Marquering for his thorough guidance regarding clinical implication and relevance of my research, as well as for his professional participation in my application for the MRCLEAN Registry dataset and for critical eye when reviewing all my writings. Henk has played a big role in my professional career since April, 2016 and on the first hand provided me the opportunity to work with the AMC, which I am very thankful for. In the same line, I would like to thank Nico-lab for sponsoring my work and all my colleagues for their support and helpful discussions. I also thank the MRCLEAN investigators to provide me with the dataset and NVIDIA for granted a TITAN X GPU for my research. Last, but very much not least, I would like to thank my girlfriend Imola Nagy for her infinite support and patience towards me working long hours on this thesis. Gladly, I can finally answer THE question of: When are you going to finish your thesis?

(3)

In this thesis we explore possibilities of training deep Convolutional Neural Networks (CNNs) from scratch on limited datasets. In particular, many medical image classification tasks – e.g. stroke outcome prediction – involve patient-level labels of images. In this case, breaking the images into numerous patches and re-assigning labels to them does not keep validity of output labels. Networks can only be trained with whole-sized images making conventional architectures overfit. In the following work we focus on infusing prior-knowledge about imaging features to tackle the mentioned challenge. We hypothesize that this serves better generalization and mitigates overfitting. We show how the multi-scale and multi-orientation nature of biological receptive fields can be incorporated in CNNs. We propose an extension of Structured Receptive Field Neural Networks, which enables to recognize multiple scales and multiple orientations of imaging features. We benchmark the proposed methods on the CIFAR10 dataset and 2D Maximum Intensity Projection images of ischemic stroke patients. The model is evaluated against a conventional CNN as baseline on a total of 4 image classification tasks, which all have prognostic value for stroke decision support. We find that our method achieves equal or higher performance on all benchmarks, while reducing model-parameters and overfitting on limited datasets. Furthermore, we use clinical variables of the same population of patients to be combined with imaging features extracted by the best CNNs. We evaluate the combination models on the same stroke classification tasks and find no significant improvement over the best CNN models. Finally, we conclude by discussing potential ways of further development.

(4)

Acknowledgments ... 2 Abstract ... 3 Glossary ... 7 Chapter 1 ... 8 Introduction ... 8 1.1 Stroke... 8

1.1.1 About the disease ... 8

1.1.2 Treatment decision – role of imaging ... 9

1.2 Deep Learning ... 10

1.2.1 The new era of data-processing ... 10

1.2.2 Impact on healthcare ... 10

1.2.3 Potentials and challenges in Ischemic Stroke ... 11

1.3 Contribution ... 12

Chapter 2 ... 14

Related work ... 14

2.1 Image interpretation and functional outcome in Ischemic Stroke ... 14

2.1.1 Infarction recognition, quantification ... 14

2.1.2 Collateral circulation ... 15

2.1.3 Thrombus characteristics ... 17

2.2 CNNs in Medical Domain ... 18

2.2.1 Dimension of learnable features ... 18

2.2.2 Transfer learning ... 18

2.2.3 Unsupervised learning ... 19

2.3 Data-efficient CNN architectures ... 19

2.3.1 Parameter efficient architectures... 20

2.3.2 Decreasing flexibility... 20

2.3.3 Increasing functionality ... 20

Chapter 3 ... 22

Preliminaries ... 22

3.1 Convolutional Neural Networks ... 22

3.2 Maximum Intensity Projection ... 23

Chapter 4 ... 25

(5)

4.1.2 Multi-scale representation ... 27

4.1.3 Multi-orientation representation ... 28

4.1.4 Scale / orientation aggregation ... 32

4.2 Architectures ... 33 4.2.1 DenseNet ... 33 4.2.2 CIFAR-Net ... 33 4.2.3 StrokeNet ... 34 4.2.4 Combination of models ... 35 4.3 Initialization ... 35 4.4 Prediction tasks ... 36 Chapter 5 ... 38 Experiments ... 38 5.1 Experimental setup ... 38 5.1.1 Datasets ... 38 5.1.2 Data Preprocessing ... 38 5.1.3 Implementation ... 40 5.1.4 Baselines ... 41 5.1.5 Evaluation... 41 5.2 Experiments ... 42 5.2.1 CIFAR10 ... 42

5.2.2 MRCLEAN Registry MIP ... 42

5.2.3 MRCLEAN Registry NCCT ... 43

5.2.4 MRCLEAN Registry MIP + clinical variables ... 43

5.3 Results & Evaluation ... 43

5.3.1 CIFAR10 ... 43

5.3.2 MRCLEAN Registry MIP ... 45

5.3.3 MRCLEAN Registry NCCT ... 47

5.3.4 MRCLEAN Registry MIP + clinical variables ... 47

5.3.5 Summary ... 47

5.4 Evaluation ... 49

Chapter 6 ... 50

Discussion ... 50

6.1 Future works and practical considerations ... 50

(6)

6.1.5 Multinomial classification ... 51 6.1.6 Image representation ... 51 6.1.7 Interpretation ... 51 6.2 Concluding words ... 52 Chapter 7 ... 53 Appendix ... 53 7.1 Multi-orientation representation in 3D ... 53

7.2 Visualization of 3D Gaussian derivatives... 57

7.3 Example images from the MRCLEAN Registry 2D MIP dataset ... 58

7.4 Poster from an Extended Abstract of this thesis, which got accepted to Medical Imaging with Deep Learning (MIDL) conference 2018 ... 59

(7)

Anterior Circulation Ischemic Stroke (ACIS) Occlusion in the anterior circulation of the brain, including the Anterior communicating artery, Middle cerebral artery, internal carotid artery and anterior choroidal artery

Door-to-needle Time between arrival to emergency department and the administration of treatment

Baseline imaging First image acquisition after admission to hospital

Perfusion The transition of blood through a vascular system

Recanalization / Reperfusion The process of restoring blood flow of an occluded part of the brain

Hypoattenuation / hypodensity Sign of less dense tissue (darker on the gray scale)

Hyperattenuation / hyperdensity Sign of denser tissue (brighter on the gray scale)

Hypoxia Deficiency in oxygen supply of tissue Ischemia Restriction of blood supply of tissue

Necrosis Death of cells in tissue Infarction Dead tissue

Penumbra Salvageable tissue bed around the ischemic core, which is not irreversibly injured

Ischemic core Area of severe ischemia, most of the tissue in the core is irreversibly injured

Ischemic lesion The whole area affected by the ischemia, including ischemic core and penumbra

(8)

Chapter 1 Introduction

1.1 Stroke

1.1.1 About the disease

Stroke has remained the 2nd_{leading cause of death in the last 15 years accounting 6.24 million}

deaths yearly and approximately the same amount of permanent disability worldwide [1]. Common risk factors are high blood-pressure, smoking, diabetes, and high cholesterol level, which are all increasingly representative attributes of today’s population. Stroke can be divided into 2 types: Ischemic and Heamorrhagic. This thesis focuses on Ischemic Stroke, which accounts for 87% of all incidents.

Ischemic Stroke is caused by a blood clot (thrombus) that blocks an intracranial vessel and hinders the necessary blood supply to a part of the brain. Brain ischemia can potentially result in serious hypoxia, as the lack of blood flow starves the neurons of oxygen. Neurons are extremely vulnerable. The complete interruption of supply through a large vessel leads to rapid necrosis, i.e. irreversible damage of brain tissue and the brain loses as many neurons in 1 hour as it would in 3.6 years of normal aging [2], as shown on Figure 1.1. This conveys a clear message; early recognition of ischemia and quick intervention is crucial. As the well-known phrase of the field reflects: Time is brain!

Symptoms are commonly matched with the FAST scale, which examines the weakness or dysfunction of F-Face, A-Arms and S-Speech, whereas T refers to the importance of time. A patient with any symptoms has to be taken to the closest hospital as soon as possible. Ischemic occlusions can be treated within a certain time window. Since 1997 IntraVenous tissue Plasminogen Activator (IV-tPA) has been the only Food and Drug Administration (FDA) approved medical therapy. IV-tPA is a clot-dissolving agent, which aims to recanalize (restore the blood flow) the occluded artery and to restore perfusion to the brain by dissolving the occlusion. However, in 2013 Intra-Arterial Therapy (IAT – also referred as Mechanical Thrombectomy or Endovascular Treatment), a well-established procedure for cardiac occlusions was introduced to stroke with great potentials. The intervention employs a catheter,

Figure 1.1. Estimated pace of brain damage in case of most common large vessel ischemic stroke [2]

(9)

which is led up to the occluded vessel and removes the clot with a stent retriever. The exact procedure is illustrated on Figure 1.2. In 2015 the Multi-center Randomized CLinical trial of Endovascular treatment for Acute ischemic stroke in the Netherlands (MRCLEAN) pursued to investigate the effect of IAT compared to standard care [3]. This important trial was the first to show clear benefit of IAT and allowed further analysis of its allocation criteria. The current therapeutic window for IAT lies around 6 hour [4], in contrast to 4.5 hours for IV-tPA [5]. If we take into account that these time windows include the request for an ambulance, transport of the patient, examination, imaging, and treatment selection, the conclusion is clear: no time can be wasted. In this thesis, we focus on the support and development of treatment selection.

1.1.2 Treatment decision – role of imaging

Clinical decision making in ischemic stroke is a complex process embracing examination of multiple clinical variables and radiological characteristics. Stroke imaging incorporates primarily Non-Contrast Computed Tomography

(NCCT) in search of alternation of tissue attenuation and Computed Tomography Angiography (CTA) for inspection of the vasculature of the brain by using contrast agent. In favor of standardized prognosis, CT scans are quantified by various stroke scores in both modalities relying on specific, visually observable phenomena, which can predict possible outcomes thus imply treatment allocation.

Such predictors on NCCT are hyper- and hypodensity (aka. hyper- and hypoattenuation) signs. Hyperattenuation usually indicates the location of the occlusion whereas hypoattenuation – also referred as Early Ischemic Changes (EICs) – can be associated with infarcted brain tissue. For instance, the Alberta Stroke Program Early CT Score (ASPECTS) quantifies the hypoattenuation as described in Box 1.1. ASPECTS smaller than 6 indicates a severe stroke and is used in current practice as exclusion criterion for IAT (beside others). Although the method sounds straightforward, EICs are usually very subtle and extremely complicated to recognize, leading to a low agreement ratio between observers, even between multiple observations of the same observer.

Parallel to analysis of NCCT, CTA imaging can be acquired to identify the properties of a thrombus (e.g. location, volume, perviousness) and to examine collateral circulation. Collateral circulation means an alternative blood supply to parts of the brain, where the normal flow has been blocked by the occlusion. A common way to quantify the presence of collateral flow is described in Box 1.2. A common practice is to compute the Maximum Intensity Projection (MIP) from CTA, which enhances vessels due to high intensity of contrast

Initialize: ASPECTS = 10

Input: Consider the territory of the Middle Cerebral

Artery (MCA), divide into 10 area / hemisphere

For every area do:

If contralateral hypoattenuation observed do:

ASPECTS = ASPECTS - 1

Output: Final ASPECTS → # of unaffected areas

Box 1.1. ASPECT Scoring

 0 – Absent collateral flow.

 1 – Poor collaterals, < 50% filling.

 2 – Moderate collaterals, 50 - 100% filling.

 3 – Good collateral flow, 100% filling. Box 1.2. Collateral Score Figure 1.2. Procedure of IAT -

(10)

agent and makes observation easier and more accurate. Several studies suggested the important role of collateral flow in determination of the exact safety treatment window [6], [7], [8]. Due to recent advances in treatment options, clinicians have to deal with great uncertainty around treatment effect. Even though IAT proved to improve functional outcome in a large population [3], individual prognosis varies.

Functional outcome in stroke is primarily measured by the modified Rankin Scale examined at 90 days after treatment, a scale running from 0 to 6 indicating the degree of disability (see Box 1.3 for a detailed explanation). Clinical trials continuously pursue to identify specific subgroups of patients associated with favorable outcomes. Analysis about potential correlations between baseline scores and outcome measures is however usually subject to high inter-observer variability and subjectivity of clinicians due to the complexity of image-interpretation. We hypothesize that deep learning methodologies can provide a solution and can identify essential predictive characteristics of ischemic stroke outcome.

1.2 Deep Learning

1.2.1 The new era of data-processing

Origins of deep learning date back to the ‘60s and ‘70s when the conceptual model of Artificial Neural Networks (ANN) emerged. Inspired by biological neural networks, ANNs consist of a great number of inter-connected neurons organized into layers. Consecutive layers extract increasingly higher-level features leading to a complex yet very compact encoding of the input. ANNs are usually trained by Stochastic Gradient Descent, which comprises iterative adjustments of neural connections driven by error Backpropagation. This procedure is computationally demanding, thus in early stages it meant the biggest hurdle for deep learning. In 2012 however, Alexnet [9], the only deep learning based algorithm participating won the ImageNet Large-Scale Image Classification Challenge [10] by a huge margin. The infamous architecture was trained on 2 GPUs and introduced ReLU activation, Dropout [11] and data augmentation to CNNs to avoid overfitting. Since then a new era of machine learning has started, in which advanced learning algorithms [12], deeper architectures [13] and novel techniques [14] have been developed. All these advancements allowed deep neural networks to gradually achieve superhuman performance in a broad range of tasks [15], [16]. Deep learning, termed as one of the 10 breakthrough technologies of 2013 [17], is being applied to a wide variety of fields1_{and is rapidly proving state-of-the-art performance in most applications.}

1.2.2 Impact on healthcare

Substantial part of medical diagnosis relies on imaging, which has gone through great improvements in recent decades with devices being capable of faster acquisition and higher resolution [18]. To date however, most of image interpretations are performed by physicians

1_{Some examples: http://www.yaronhadad.com/deep-learning-most-amazing-applications/}

 0 - No symptoms.

 1 - No significant disability. Able to carry out all usual activities, despite some symptoms.

 2 - Slight disability. Able to look after own affairs without assistance, but unable to carry out all previous activities.

 3 - Moderate disability. Requires some help, but able to walk unassisted.

 4 - Moderately severe disability. Unable to attend to own bodily needs without assistance, and unable to walk unassisted.

 5 - Severe disability. Requires constant nursing care and attention, bedridden, incontinent.

 6 - Dead.

(11)

and thus limited to subjectivity, variations across interpreters and fatigue. Deep learning, in particular Convolutional Neural Networks (CNNs), could be key enablers of improving diagnosis and supporting experts’ workflow. CNNs exploit spatial structure of images by utilizing local feature enhancement in hierarchical layers, a process resembling the human perception. They learn highly effective descriptors of images, can detect abnormalities, and realize efficient inference, which are essential parts of diagnostic processes. Therefore, it is no surprise that applications of CNNs and other deep learning methodologies are flooding the medical field rapidly; for reference, Figure 1.3. shows the number of published articles concerning deep learning or CNNs listed on PubMed2_.

Figure 1.3. Number of articles about “Deep Learning”, “Deep Neural Network”, “Convolutional Neural Network” or “CNN” according to their title. Data about the current

year has been retrieved on 15. July, 2018., the upper part indicates a proportionate estimation for the rest of the year.

1.2.3 Potentials and challenges in Ischemic Stroke

Deep learning in general offers great potentials to achieve the goal of Precision Medicine: ensuring the right treatment, to the right patient, in the right time. The impact of the application of deep learning in stroke care can be tremendous. Minimizing time elapsed from stroke onset to treatment has become a priority target of current clinical practice [2] and proved to directly lead to improved patient outcomes [8], [19]. The median door-to-needle times in North America are typically 60 minutes [20]. As human interpretation of imaging takes up a third of this time [21], current workflow can be highly optimized by efficient computerized assistance. CNNs can undoubtedly serve as viable solutions for various imaging tasks and moreover explore unknown prognostic properties of images. Nevertheless, there are certain challenges to deal with before deep learning can truly bloom in this field.

The common issues with medical domain such as the lack of large scale annotated datasets and unbalanced class distribution are valid in stroke imaging as well. Furthermore, CT scans are high resolution 3D images and consequently require models with higher dimensions than natural images. In case of parametric classifiers such as CNNs as the dimensionality of data increases the classifier’s predictive power decreases [22]; a phenomenon referred to as curse of dimensionality. All these issues lead to the lack of sufficiently large amount of training samples, a crucial requirement of traditional CNNs to prosper. Transfer learning, a method of making

2_{Online platform of the US National Library of Medicine, comprising more than 27 million citations} 561 6 10 22 51 135 422 1050 2012 2013 2014 2015 2016 2017 2018 0 200 400 600 800 1000 1200

(12)

use of pre-trained networks from natural domain is commonly used as a workaround. However, this technique restricts architectures to the use of 2D images or ensemble of 2D views, thus not a feasible approach for 3D networks. We hypothesize that infusing a-priori knowledge about features directly into CNNs alleviates effects of the mentioned limitations.

On the other hand, when developing algorithms to aid decision making of human experts, transparency and interpretability has to be taken into account. Accurate prediction systems can provide significantly better assistance when they provide direct insights into their reasoning process beyond outputting a probability distribution. Classical solutions in medical image classification such as plotting weight distributions of classifiers [23], [24] are not specific to an input image and their interpretation can be misleading [25]. Inspired by recent work [26], [27] we imagine the best way to interpret outcome predictions is to visualize traits of input scans that led the model to the predicted class. Nevertheless, understanding such complex systems is 2 fold and explaining outputs might not bring a sufficient level of transparency depending on the application area. Deep Neural Networks are usually referred as black boxes because of their complex structure utilizing millions of parameters in contrast to classical, structured image processing techniques such as filtering or oriented gradients. Thus, we drive motivation from novel CNNs, which structure image processing for a certain degree and are more transparent while retaining competitive performance.

1.3 Contribution

In the previous section, we discussed the advantages of precision medicine in ischemic stroke and highlighted the crucial importance of fast decisions as well as the key role of imaging in realizing them. We identified two main challenges of utilizing CNNs in stroke imaging, namely limited size and unbalance of datasets and necessity of model interpretability.

In this thesis, we aim to design and validate a data-efficient CNN architecture tailored for CT imaging with the focus on treatment outcome prediction of ischemic stroke patients. We hypothesize that replacing the crafted attention of current scores by learning features directly from baseline imaging is able to lead to comparative predictive power and provides better understanding of important imaging characteristics. Furthermore, we hypothesize that utilizing receptive fields that has an extended functionality to recognize features of multiple scales and orientations yields improvement of classification performance.

To the best of our knowledge, CNNs have never been used for whole-image classification within the field of stroke using CT imaging.

We contribute a comprehensive literature research on medical imaging employing machine learning methods in Chapter 2. We shortly discuss recent successes in MRI imaging, where application of deep learning is in a more evolved stage. We find that existing prediction models of functional outcome are limited to the use of clinical stroke scores and other clinical characteristics. Moreover, we focus on other medical imaging tasks beyond stroke that utilizes CNNs or other feature learning methods on CT imaging.

Our main contribution lies in addressing the data scarcity issue by rethinking certain components of traditional CNN architectures and refining them to suit CT image analysis. We construct a data-efficient feature learning method inspired by recent works that redefine neural receptive fields [28]. We extend the underlying work [28] by further increasing the discriminative power of CNNs. Our solution reduces the number of model-parameters

(13)

compared to standard CNNs while retaining a sufficient level of flexibility of the encoding. We present our image processing pipeline containing all these elements in Chapter 4.

We evaluate our model on a common benchmark of natural image classification as well as on predicting certain stroke metrics and outcomes of ischemic stroke patients in Chapter 5. We test the individual CNN models trained on 2D Maximum Intensity Projection (MIP) images and combine their imaging features with clinical variables for the same classification tasks. In Chapter 6 we conclude and enclose thoughts about future work and possible fields of application.

(14)

Chapter 2 Related work

In this section we explore all significant advances in related research that served as our inspiration. We begin by sketching up the current roadmap of ischemic stroke research, discussing automated approaches of stroke image interpretation and present studies that correlate imaging patterns to functional outcome. We provide an overview of CNN applications in medical domain across MR and CT imaging. In the final subsection we present recent development of CNN architectures from various aspects.

2.1 Image interpretation and functional outcome in

Ischemic Stroke

A comprehensive overview of important aspects of outcome prediction was given in [29], a relatively early work. The authors discuss the potentials of different imaging modalities. Characteristics of the ischemic lesion and vascular circulation were found to improve prediction accuracy over clinical variables.

To identify subgroups of patients who can potentially benefit from treatment, several endovascular clinical trials [30], [31], [32], [3], [33], [34] have investigated to further refine the role of imaging. [35] provides a recent roadmap towards successful findings and remaining uncertainties in outcome prediction and treatment efficiency. A so-called optimal imaging profile is described by biomarkers that promise the highest confidence in predicting treatment effect. These include large vessel occlusions, small ischemic core, good collaterals and large penumbra for endovascular treatment. Consequently, image interpretation has to recognize these phenomena to enable accurate prediction of outcome.

2.1.1 Infarction recognition, quantification

Infarctions can be recognized on several modalities. Diffusion Weighted Imaging (DWI) can be used for accurate quantification. Thanks to its acquisition properties, quantifying algorithms provide ground truth about infarction and have superior sensitivity and specificity over other modalities [36]. However, conventional ASPECTS can have good overall accuracy to predict a large MRI DWI lesion volume as shown by [37]. Beyond a certain volume the benefit of IVt-PA may be limited [38], [30], [36], thus dichotomized estimation of DWI lesion has clinical relevance in treatment selection. Figure 2.1. NCCT, CTP and DWI images of the same

patient. Red indicates the ischemic core, green the penumbra on CTP [37].

(15)

Results of [37] are promising and show the potential of NCCT, despite the small population analyzed (N=59). Figure 2.1. illustrates how different a lesion looks on NCCT, CT Perfusion (CTP) and DWI modalities.

The ASPECT score was also shown to be related to functional status of patients. [39] found significant association to mRS in the whole population of patients (N=162) who received IVt-PA before 4.5 hours and in the subgroup of treated before 3 hours (N=133). Furthermore, [40] reported that ASPECTS can be a good predictor of functional outcome after Intra-Arterial Treatment as well. In their study of 70 patients, ASPECTS >7 had the strongest association with of good outcome (mRS<3) amongst baseline predictors. Goyal et al. [4] analyzed a selected cohort (N=1287) of 5 well-known clinical trials [30], [31], [3], [33], [32] that established the benefit of IAT. They also found significant association between high ASPECTS (subgroups of 6-8 and 9-10) and favourable treatment effect of IAT.

Baseline ASPECTS is a strong predictor of outcomes after both IVt-PA and IAT. However, how to quantitatively assess hypondense areas and whether treatment benefits extend to patients with lower ASPECTS scores remain uncertain. Patients with ASPECTS of 5-7 could potentially benefit from IAT as suggested in [41], while previous studies dichotomized ASPECTS at 7 or 8. All these clinical studies illustrate the need for accurate hypodensity recognition to assess lesion properties on CT imaging. Automatic approaches to quantify early ischemic changes i.e. automate ASPECTS scoring has been proposed [42], [43] to overcome certain limitations of manual scoring. Moderate correlation to

mRS was achieved by the automated method in [42]. Results were reported on a test set of 63 patients. Scores from 2 individual observers and from their consensus correlated moderately with the National Institutes of Health Stroke Scale (NIHSS), but only consensus scores correlated with mRS. Worth noting, that all patients included in this study received only standard IVt-PA. [43] analyzes the performance of the e-ASPECTS algorithm, 3 trainees and 3 expert observers on 34 patients by comparing to a DWI-ASPECTS ground truth defined as the consensus of another 2 independent, non-blinded experts. The algorithm proved more sensitive (46.46% [95% CI 30.8-62.1]) than all trainee and one expert in recognizing affected regions and comparably specific (94.15% [95% CI 91.7-96.6]) to all 6 physicians. Both methods offer promise for the future of

imaging-based treatment selection, despite the small validation size.

2.1.2 Collateral circulation

Sufficient collateral flow can extend treatment windows and ensure better outcome following acute ischemic stroke [35], [44], [45] via slowing down infarct growth. Figure 2.3. depicts on an example, how differently final infarct on MRI develops in presence and absence of good collateral blood supply.

Figure 2.2: Regions of ASPECTS on the Ganglionic and Supraganglionic level

(16)

Intracranial as well as leptomeningeal collaterals have influence in predicting functional outcome of Anterior Circulation Ischemic Stroke (ACIS) treated with IVt-PA as shown in [46], [47]. Several grading systems have been employed by [46], but only one – called Miteff [45] – was found to reliably predict good and poor functional outcome. Topographic collaterals were assessed by [47] based on ASPECTS regions. Due to the low inter-observer agreement most regions were excluded from analysis. M5 regional Collateral Score (CS) had the strongest association to mRS score and increased the prediction power of NIHSS moderately.

On the other hand, [48], [49], [50] demonstrate the same association for endovascular treatment in case of ACIS and basilar artery occlusions. [48] analyzed 84 patients by the aspects of collateral status and recanalization (defined by the Thrombolysis In Cerebral Infarction (TICI) score). Significant correlation to mRS was only found amongst those who recanalized (TICI 2b-3) with good or intermediate collaterals at baseline. Instead of pursuing direct interaction between collaterals and functional outcome, [49] aimed to characterize a malignant profile that could potentially impact treatment decisions using collateral quantification on CTA for 197 AIS patients. The employed CS definition was able to accurately predict large DWI lesion, a suggested sign of limited or no treatment effect [51], [52]. In addition to the most common large vessel occlusions, [50] provides analysis of a small group of patients (N=21) with basilar artery occlusions. Favourable outcome was only achieved by patients with good collateral outcome, as revealed via analysis of the filling of the posterior communicating artery.

There is great inconsistency in quantifying collateral flow as pointed out in [53]. A wide review is provided about existing practices including various grading scales and modalities being used. The authors found 7 different methods using CT imaging with 5 of them reporting beneficial prognostic significance of good collateral grade [45], [54], [55], [56], [57]. It seems empirically favourable to dichotomize collateral scores into good or intermediate (moderate) and poor when analyzing their correlation to outcome measures. Reducing the dimensions of this complex phenomena to such an extent is however highly challenging.

Figure 2.3: Collateral status on CTA MIP with baseline and 24h MRI DWI [48]. The larger infarcts (white areas) and faster development are clearly visible on

(17)

2.1.3 Thrombus characteristics

The main goal of CTA imaging is to identify the location of the occlusion and potentially analyze certain properties of the blood clot. The location of thrombi as well as the length bears crucial information from an interventional point of view in case of IAT and has been found to predict poor outcome of IVt-PA [58], [59], [60], [61], [62], [63].

Furthermore, Santos et al. recently introduced 2 quantitative metrics of thrombi; CTA attenuation increase3_{(∆) and thrombus void fraction}4₍_𝜀_{) to estimate the relative blood flow}

through the thrombus [64]. The authors named this measure thrombus perviousness. The thresholds to categorize a thrombus as pervious were optimized on 199 patients by ROC analysis of predicting functional outcome (mRS). Both dichotomized biomarkers – using the optimal threshold – correlated significantly with mRS, recanalization and final infarct volume. In addition, the same authors present a semi-automated thrombus segmentation framework in [65] to aid quick and accurate assessment of thrombus characteristics.

Table 2.1. serves as a summary of the findings in the referenced studies. We note that all studies were conducted on different populations and different measures were used to assess the performance of the listed biomarkers. Consequently, results cannot be directly compared to each other and indicate the relevance of imaging findings rather than drawing the current state-of-the-art of functional outcome prediction.

Table 2.1: Summary of association of imaging patterns to outcome measures reported in this section. Colors indicate the treatment performed on the analyzed patients (red – IVt-PA, blue

– IAT, green – Independent of treatment).

AUC: Area Under Curve, OR: Odds Ratio, aOR: adjusted OR, cOR: common OR, CI: Confidence Interval, ρ: Spearman’s correlation coefficient, * 0.05 significance level, SP: Specificity, SV: Sensitivity, RR: Rate Ratio

3_{∆= 𝝆} 𝑡ℎ𝑟𝑜𝑚𝑏𝑢𝑠 𝐶𝑇𝐴 _{− 𝝆} 𝑡ℎ𝑟𝑜𝑚𝑏𝑢𝑠 𝑁𝐶𝐶𝑇 _{, 𝑤ℎ𝑒𝑟𝑒 𝜌 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑒𝑠 𝑎𝑡𝑡𝑒𝑛𝑢𝑎𝑡𝑖𝑜𝑛 𝑖𝑛 𝐻𝑈𝑠} 4_{𝜀 = ∆ ∆} 𝑐 ⁄ , 𝑤ℎ𝑒𝑟𝑒 ∆𝑐 𝑖𝑛𝑑𝑖𝑐𝑎𝑡𝑒𝑠 𝑡ℎ𝑒 𝑡ℎ𝑒 𝑎𝑡𝑡𝑒𝑛𝑢𝑎𝑡𝑖𝑜𝑛 𝑖𝑛𝑐𝑟𝑒𝑎𝑠𝑒 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑅𝑂𝐼 𝑜𝑛 𝑡ℎ𝑒 𝑐𝑜𝑛𝑡𝑟𝑎𝑙𝑎𝑡𝑒𝑟𝑎𝑙 𝑠𝑖𝑑𝑒 − 𝑖. 𝑒. 𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡 𝑒𝑛ℎ𝑎𝑛𝑐𝑒𝑚𝑒𝑛𝑡 𝑜𝑓 𝑏𝑙𝑜𝑜𝑑

Method Large DWI lesion mRS

ASPECTS  > 70mL: AUC 0.87 [95% CI 0.78–0.96] [37]  >= 3: aOR 0.844 [95% CI 0.720-0.990] (IVT < 4.5h), aOR 0.779 [95% CI 0.622-0.976] (IVT < 3h) [39]  < 3: OR 9.63 [95% CI 1.34-69] [40], cOR 2.34 [95% CI 1.68-3.26] (6-8), cOR 2.66 [95% CI 1.61-4.40] (9-10) [4]

 Full range: ρ = -0.256* (consensus), ρ = -0.272* (auto) [42] Collaterals  >100 mL: AUC 0.84 [CS=0: SP 97.6%, SV 54.5%] [49]  >70 mL: AUC 0.80 [CS=0: SP 99.3%, SV 42.9%] [49]  >= 3: OR 2.592 [95% CI 1.113-6.038] [46]  < 3: OR 3.341 [95% CI 1.241-8.996] [46]

 Full range: OR 2.62 [95% CI 1.215-5.682] (M5), AUC 0.752 [95% CI 0.694-0.809] [47], RR 3.8 [90% CI 1.2-12.1] [48] Thrombus characteristics  Full range: 𝐴𝑈𝐶∆ 0.67, 𝑂𝑅∆ 3.2 [95% CI 1.7-6.4], / 𝐴𝑈𝐶𝜀 0.69, 𝑂𝑅𝜀 3.3 [95% CI 1.7-6.3] [64]

(18)

2.2 CNNs in Medical Domain

2.2.1 Dimension of learnable features

Initial works have been focused on application of the standard 2D convolutional framework on medical images. Soon the field realized that exploiting the 3D nature of certain modalities can lead to superior performance. The most common approach to date is to formulate 3D classification as an ensemble of 3 orthogonal 2D views. Detection of pulmonary nodules was amongst the first applications of this technique [66]. Results have been improved by [67] with a more complex architecture building on the same idea. The multi-view approach is extended to nine symmetric planes of a cube and three different ways to fuse the parallel extracted features of the nine 2D CNNs is proposed. [68] further confirms the validity of 2.5D methods by markedly improving the sensitivity of existing Computer-Aided Detection systems while reducing false-positives. The proposed framework outperforms the state-of-the-art on sclerotic metastases, lymph nodes and colonic polyps detection benchmarks by a high margin.

On the other hand, to fully exploit hierarchical 3D features, one has to employ 3D convolution throughout CNNs. 3D convolution is a more computationally intensive operation than its 2D variant, especially when applied to large medical images. Kamitsas et al. showed that with in-depth analysis of the limitations, deep 3D CNNs can be implemented efficiently, and adapted to a variety of clinical settings [69]. The state-of-the-art is improved in brain injury, brain tumor and ischemic lesion segmentation on MRI with a dual-channel 11 layer deep CNN. We note however that these methods excel better in case of a patch-wise training framework than in whole-image classification.

In closer relation to our field of application, CNNs has been recently introduced to stroke imaging by 2 interesting papers. Lisowska et al. proposed 3D networks built on bilateral comparison of extracted patches and achieved encouraging results in automatic detection of early ischemic changes and dense vessel signs on NCCT scans [70]. Furthermore, in [71] – using the same CNN architecture – an automatic ASPECTS algorithm is presented. The CNN-based score shows moderate agreement with the ground truth score on a dichotomized (ASPECTS >= 7) patient level (Cohen’s κ 0.58), but only minor agreement on a territory level (Cohen’s κ 0.39).

2.2.2 Transfer learning

Medical detection or segmentation tasks, which incorporates pixel level annotations can be redefined on numerous small sized patches causing classical CNN architectures to prosper. More related to our problem, in case of disease classification, however, annotations are image or patient level and the challenge of small datasets arises. A popular and widely applied solution is transfer learning. Transfer learning means essentially using pre-trained convolutional filters of CNNs that have been trained on large – typically natural image – datasets. There are 2 strategies of usage 1) “freezing” the networks parameters and use it for feature extraction, 2) fine-tuning a pre-trained network on the medical data at hand. Albeit the usually large difference in the pre-training and actual domain, numerous successful applications have emerged [72], [73], [74] some even approaching human level performance [75], [76]. On the other hand, transfer learning introduces certain limitations too. In general, when re-using pre-trained weights of networks, one has to employ the same – typically 2D – architecture, thus retains the same discriminative power. Consequently, medical images cannot be used directly and have to be formulated as RGB images, restricted to 2D. True potentials of this technique can be exploited by transferring trained parameters from closer domains e.g. across imaging modalities available [77], [78].

(19)

2.2.3 Unsupervised learning

During supervised learning, the parameters of CNNs are tuned to minimize a pre-defined error function of target classes. Therefore, such feature extractors learn patterns in favor of discriminating target classes rather than building understanding of the underlying concepts – i.e. annotations – and the structure of images. A practical example is shown on Figure 2.4, where the false classification of a CNN is merely based on the background. In contrast, the goal of unsupervised learning is to learn the latent diversity of data without any classification of samples involved. This framework draws also more analogy to human learning, as we can learn and recognize structures without knowing their specific label. Nevertheless, accurate modelling of data-distributions is challenging, hence unsupervised and supervised techniques are often combined. In medical imaging, unlabeled data is usually available in larger extent than labeled due to time-consuming and expensive annotations. As a result, unsupervised feature learning, prior to supervised tailoring tends to be a popular and powerful tool for increasing accuracy and generalizability of networks, especially in medical domain [79].

Application of unsupervised methods such as Deep Belief Networks and Stacked Auto-Encoders in medical image or exam classification dates back to 2013 and focused mostly on neuroimaging [80], [81], [82]. With advances of CNNs, the focus of the community shifted towards utilizing the same feature extraction mechanism as CNNs through Convolutional Auto-Encoders (CAE). In particular, several works focused on avoiding redundancy of features by imposing sparsity constraints on them. Sparse features in general imply a more efficient and specific representation [83]. Sparsity definitions can serve various purpose from ensuring stability of learning to guiding the learner towards specific features. For instance, [84] simply forces the mean activation of receptive fields of a 1 layer 3D CAE to be close to zero in order to prevent the over-complete representation from learning the identity function. In contrast, Hou et al. in [85] defines a more complex sparsity to directly emphasize properties of nucleus on histopathology images.

For further exploration, we refer to [79], which provides a detailed and comprehensive survey of over 300 deep learning applications in medical image analysis. This survey not only lists exciting advances in medical imaging but gives brief explanation of models making it an important read for anybody interested in the field. Furthermore, the authors provide an overview of current state-of-the-art and a highly valuable discussion about open challenges and future directions. In the followings, we aim to highlight contributions that supported our architectural design and construction of our imaging pipeline from different aspects.

2.3 Data-efficient CNN architectures

Advanced technology and improved algorithms have enabled the training of truly deep and very accurate networks for various computer vision tasks. In recent years, the major interest of the Figure 2.4. (a) Raw data, (b) Pixels the CNN used

for a false classification [25]. Recognizing the snowy environment of a wolf might lead to accurate discrimination in the dataset, but does

not imply the understanding of the concept “wolf”.

(20)

community seems to be shifting towards interpretability and more importantly data-efficiency of CNNs. Data-efficient CNNs are able to learn effective representations of images while requiring as few data as possible. This makes such models a potential choice for small datasets.

2.3.1 Parameter efficient architectures

As CNNs become increasingly deep, the problem of maintaining signal and gradient through many layers emerges. Several works have addressed this issue lately in different ways [86], [87]. One of the most recent ones is the DenseNet architecture [88]. The proposed idea aims to support maximum information flow through the network by connecting all layers with each other. In contrast to ResNets, feature-maps are concatenated instead of being summed. As a nice side-effect, this can potentially lead to a naturally reduced number of parameters and can prevent learning of redundant feature-maps. The authors also report about an empirically observed regularizing effect of dense connections, which further implies parameter-efficiency and can make DenseNets suitable for smaller training sets.

2.3.2 Decreasing flexibility

In presence of sufficiently large data sets, CNNs are able to learn fine-grade and general features thanks to their highly adaptive receptive fields consisting of numerous adjustable parameters. In case of small number of training samples, however, it is not possible to model such complex parameter spaces and CNNs quickly overfit. Similarly to a CNN’s cascade model but with fixed filter parameters, Scattering Networks [89] can offer effective encoding of images. The scattering framework builds on the enhancement of local and global invariance groups by complete, pre-defined wavelet filter sets, which are tailored explicitly to the type of images per hand [90], [91]. Several works have applied this technique and showed promising results in image classification and understanding [92], [93], [94]. In an even more interesting vein, recent contributions were published about hybrid networks, which use scattering transforms for early layers of CNNs und train the remaining layers of the network in the traditional supervised way [95], [92]. Results imply that low-level filters and well-known transformation groups does not need to be learned. Moreover, building CNNs on top of engineered scattering models can alleviate overfitting on small datasets and retain superior performance on larger datasets while relying on less computational resource for training.

In [28] another way of hybridization is presented, which directly infuses the scattering idea into the traditional CNN architecture, hence retains the ability to train networks end-to-end. The proposed Structured Receptive Field Network (RFNN) replaces the fully adaptive convolutional kernels with weighted combinations of a fixed basis set of filters. The authors chose the family of Gaussian derivatives for basis due to their firmly grounded properties in scale-space theory [96], [97], [98]. RFNNs outperformed Scattering Networks on all reported benchmarks and classic CNNs where training data was limited. Moreover, a 3D variant of the proposed technique improved the state-of-the-art (at the time of publication) on the widely used ADNI5_dataset

while using an order of magnitude less training samples.

2.3.3 Increasing functionality

Another line of research about data-efficient CNNs investigates how to increase their expressiveness through exploiting equivariance of certain transformation groups inherent in training data. In image processing terminology a model 𝑓 is called equivariant to a class of transformations 𝜏, if for all transformations 𝑇 ∈ 𝜏 of the input 𝑥, another transformation 𝑇′ exists so that 𝑓(𝑇𝑥) = 𝑇′𝑓(𝑥) for all 𝑥 [99]. As a specific case invariance means that for all ∈ 𝜏 ,

(21)

𝑓(𝑇𝑥) = 𝑓(𝑥), hence representations invariant to class 𝜏 are explicitly also equivariant (not necessary vice versa). A widely used approach to exploit (approximate) invariance to transformations of the input is data-augmentation [100], which however provides no guarantee that the learned representations generalize. Such guarantee can be ensured when infusing equivariant properties directly into the encoding model.

One way to realize this arises naturally from the already discussed Scattering framework. While designing the suitable filter-bank for the domain of application, incorporating group-equivariance in the scattering transform definition is straightforward [90], [94], [91]. In a similar line, [101] utilizes Gabor filters to extract rotation-invariant features prior to a standard CNN architecture.

Another way is to extend architectures for instance by creating multiple feature-maps with transformed copies of kernels or inputs of layers while sharing weights across transformations [102], [103]. Pooling – typically max or average – can be applied optionally after populating groups of feature-maps in each layer [104] or exclusively after the last convolutional layer [105] to obtain global equivariance. Motivated by this vein of research, we propose to extend the original RFNN framework with rotation sharing and analyze its effect in multiple local pooling/aggregation scenarios.

A more elegant, nevertheless more complex, framework is one allowing networks to facilitate learning equivariant properties from data instead of directly encode them. This approach requires essential modification in the standard CNN architectures. Group and representation theory gives a grounded mathematical foundation, which ensures higher interpretability of models compared to classic CNN. Although this field of research is very recent, promising results are being published [106], [107], [108], [109], [110].

(22)

Chapter 3 Preliminaries

Our proposed method is built upon recent advances in deep neural networks. In followings we briefly introduce basic concepts to support full understanding of our method.

3.1 Convolutional Neural Networks

Convolutional Neural Networks (CNN) are extensions of Artificial Neural Networks (ANN) capable of exploiting locality of features in structured data (e.g. images). Neurons of a CNN have restricted connections only to a designating region of the input, called receptive field. Via sharing the same weights, the receptive fields enhance the same features of the input at different locations. This significantly reduces the number of parameters to learn and exploits translational equivariance at the same time. Moreover, each layer 𝑙 ∈ [1, 𝐿] of a CNN stacks 𝐼𝑙

neurons on each region to increase the variety of features extracted. Activations of each neuron are computed throughout the input of the layer producing the so-called feature maps (FM). The 𝐼𝑙 feature maps – each enhancing different patterns – are then grouped and fed to the

consecutive layer. Structuring neurons ensures capturing spatial correlation of neighboring variables. Sharing weights across receptive fields allows highly efficient implementation of computations via convolution. Computing each feature map and single convolution operations can also be parallelized making use of valuable properties of modern GPUs.

Figure 3.1. illustrates 2D and 3D convolution, while (3.1) gives the mathematical formulation of the computation of the mth FM in the lth layer with f being the activation function, F the shared weights designated to mth FM and b the learnable bias. (∗) refers to the convolutional operation.

𝒐𝑚𝑙 = 𝑓 (∑ 𝑭𝑖,𝑚𝑙 ∗ 𝒐𝑖 (𝑙−1)

𝐼𝑙−1

𝑖=1

+ 𝑏𝑚𝑙 )

Convolutional layers are usually followed by a max-pooling operation, which sub-samples the spatial resolution of the feature maps by taking the maximum activation of sub-regions. This helps to reduce irrelevant and usually harmful positional dependency of local features and makes the model robust to shifts and distortions.

(23)

The output of the last convolutional layer represents the – usually – downscaled output feature set of the feature extractor part of the network. These features can be fed into further processing layers depending on the task per hand, e.g. classification. For classification, fully connected layers are commonly used, which in practice couples an ANN classifier to the convolutional layers. The last fully connected layer, also called a softmax-layer, has a neuron for each class, combined with a softmax activation to compute probabilities. Another popular method is called global average pooling, which was introduced in [111]. The idea is to compute the spatial average of each feature map and feed the resulting vector into the softmax-layer. This enforces direct relation between feature maps and output categories thus provides a more transparent explanation of the predicted class. All the weights of a CNN – i.e. convolutional and fully connected as well – are optimized with Stochastic Gradient Descent (SGD) using Backpropagation to compute the gradients on the weights. With large datasets, computing complete gradients of the training set becomes infeasible. Thus CNNs are trained using mini-batch SGD, which samples random subsets of the training set and estimates the optimum iteratively. This results in faster learning, more adaptive models to different test cases and makes training on larger datasets possible.

As [112] empirically illustrated, receptive fields detect increasingly more relevant and complex features in higher layers as combinations of the patterns enhanced in previous layers. This property of CNNs allows better understanding of model-behavior and makes them a preferable choice for processing structured data such as images.

3.2 Maximum Intensity Projection

Maximum Intensity Projection (MIP) is a method used in medical imaging that consists of projecting the voxels with the highest attenuation within defined sections of the 3D volume onto the 2D plane of choice. The size of the sections and the distance between them can be calibrated to adjust granularity of the compression. In a special case, the whole image can be taken as one section, thus projecting the 3D scan into a single 2D image. MIPs provide a simple way to enhance contrast-material filled structures, thus they are widely used in vascular imaging including Positron Emission Tomography (PET), MR Angiography (MRA) and CTA. Its primary clinical application in ischemic stroke is two-fold: locating the thrombus and inspecting the collateral circulation on CTA imaging. Figure 3.2. shows an example how the MIP can suppress noise and enhance the visibility of the 3D vascular structure.

(24)

Figure 3.2. A slice from an axial MIP (left) and the original CTA (right) image.

(25)

Chapter 4 Methods

As highlighted in the introduction, conventional CNNs does not excel in presence of small datasets, which means a big challenge for medical applications. To this end, we present a data-efficient CNN architecture, which builds on the structure of biological receptive fields thus able to recognize the same pattern on different scales and orientations. In this chapter, we provide all the details about the mathematical formulation of our solution, network architecture, we use and classifications tasks, we explore.

4.1 Data-efficient Networks

4.1.1 Structuring receptive fields

To increase the efficiency of the feature extraction mechanism, we replace the fully learnable convolutional kernels by structured receptive fields, a technique proposed by Jacobsen et al. in [28]. The basic idea is to decrease the number of model-parameters by introducing prior knowledge about spatial properties of local features in terms of a fixed filter basis. Contrary to Scattering Networks [89], [91], [90], invariance is still learned from the data, but on the level of effective filter combinations instead of pixels. Thus, the proposed Structured Receptive Field Networks (RFNNs) can be positioned halfway between Scattering Networks and general CNNs from a data processing point of view. RFNNs seem to outperform both techniques on small and medium sized datasets, so can be an ideal choice when data is sparse. Convolutional kernels are constructed as a weighted combination of Gaussian filters up to a certain order of derivatives as illustrated on Figure 4.1. The weights are subject to be trained via Backpropagation whereas the order of derivatives can be determined by parameter selection. The reason of using Gaussian derivatives is its relation to biological visual perception. Neurophysiological studies have shown that measured response profiles of the mammalian visual cortex can be well modeled by Gaussian derivatives up to an order of four [113]. In this respect an idealized artificial model of a visual system shall be built on the same concept [97].

(26)

For clarity, we shortly review the mathematical formulation of an RFNN convolutional layer. We follow the convention of notation introduced in [28]. We denote the output of the 𝑙𝑡ℎ layer with 𝑜𝑙, the fixed Gaussian basis filters with 𝜙, the effective convolutional kernels with 𝐹 and the combination parameters being learnt with 𝛼. 𝐺: ℝ𝐷𝑥ℝ+→ ℝ is the D-dimensional Gaussian kernel with standard deviation t, used to construct our discrete basis filters:

.

∗ denotes convolution operation. Equation (1) illustrates the computation of the 𝑛𝑡ℎ feature map of the 𝑙𝑡ℎ convolutional layer in a standard CNN, where 𝐹𝑖,𝑛𝑙 is a fully learnable filter and 𝑖

iterated over input channels. We structure these filters by substituting the linear combination of the fixed bases functions in (4.2). Here, 𝑚 iterates over all the basis filters. Note that in this formulation only the 𝛼 parameters are learnt. Next, we make use of the commutativity of the convolution and write up 𝜙 in terms of Gaussian derivatives up to 𝑀𝑡ℎ order. In (4.4) we arrive to the formulation that we implemented in practice. We note that rearranging Equation (4.4) allows for more efficient implementation without calculating the effective filters. However, we aim to closely inspect the emerging filters of our networks, thus use Equation (4.4).

Figure 4.1. RFNN layer: 𝐹𝑙convolutional filters in layer 𝑙, 𝛼𝑙 learnt weights of the basis filters, 𝜙𝑚 Gaussian basis filters up to 2nd order of derivative.

𝐺(∙; 𝑡) = 1 (2𝜋𝑡)𝐷 2⁄ ℯ− (𝑥1 2_+⋯+𝑥 𝐷 2_{) 2𝑡}_⁄ = ∑ 𝐹_𝑖,𝑛𝑙 ∗ 𝑜_𝑖(𝑙−1) 𝐼 𝑖=1 = ∑ ( ∑ 𝛼_{𝑖,𝑛,𝑚}𝑙 𝜙𝑚 𝑀 𝑚=0 ) ∗ 𝑜_𝑖(𝑙−1) 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼𝑖,𝑛,𝑚𝑙 𝜙𝑚 𝑀 𝑚=0 ) 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼_{𝑖,𝑛,(𝑗} 1,⋯𝑗𝐷) 𝑙 𝜕𝑗1+⋯+𝑗𝐷 𝜕𝑥₁𝑗1_{⋯ 𝜕𝑥} 𝐷 𝑗_𝐷𝐺(𝑥1, ⋯ 𝑥𝐷) 0 ≤ 𝑗₁≤ 𝑀⋯0 ≤ 𝑗_𝐷 ≤ 𝑀, 0 ≤ 𝑗1+⋯+𝑗𝐷 ≤ 𝑀 ₎ . 𝐼 𝑖=1 (4.1) (4.2) (4.3) (4.4) 𝑜𝑛𝑙

(27)

It is important to mention that each Gaussian filter is normalized to unit Root Mean Square (RMS) to keep the magnitude of each filter approximately in the same range. Otherwise higher order derivatives would be suppressed by nature when linearly combined with lower orders. This can potentially harm the learning of the optimal combination coefficients as the gradient updates would not lie in the same range for all derivatives.

4.1.2 Multi-scale representation

In line with [28], we make use of the grounded properties of Gaussians in Scale-space theory [97]. Scale-space theory provides a mathematical framework to study the multi-scale nature of image features. Functions in Scale-space are equipped with the so-called self-similarity property. This property lets them change their effective size through a continuous scale parameter while retaining their spatial structure. The purpose of modelling images with such functions originates from the multi-scale appearance of objects in image data. Scale-space allows for describing local features decoupled from their scale, thus the same structures can be detected at different scales.

To realize the scale-space representation 𝐿: ℝ𝐷𝑥 ℝ+→ ℝ of an arbitrary continuous function 𝑓: ℝ𝐷→ ℝ one needs a sufficient kernel and transformation. Scale-space prescribes a set of properties to be met by this transformation and kernel, called the scale-space axioms. Some essential ones amongst others are linearity, translation invariance and the notion that new local structures are not introduced when scale is increased. Convolution by the Gaussian kernel was shown to satisfy the axioms and thus is the unique method for generating a scale-space [97], [96]:

where 𝐿(; 0) represents the “zero scale” being equal to the original function.

Thus, the representation of a function in scale-space is practically given by Gaussian smoothing parameterized by the variance of the kernel. Using this, we can formulate the scale-space representation of our RFNN layers by substituting the output feature maps in the place of 𝑓(∙) in (4.5): 𝐿(∙; 𝑡) = 𝐺(∙; 𝑡) ∗ 𝑓(∙) 𝐿(∙; 0) = 𝑓(∙) = 𝐺(𝑥1, ⋯ 𝑥𝐷; 𝑡) ∗ 𝑜𝑛𝑙 = 𝐺(𝑥1, ⋯ 𝑥𝐷; 𝑡) ∗ ∑ 𝑜𝑖 (𝑙−1) ∗ ( ∑ 𝛼_{𝑖,𝑛,(𝑗}𝑙 ₁_,⋯𝑗_𝐷₎ 𝜕 𝑗1+⋯+𝑗𝐷 𝜕𝑥₁𝑗1_{⋯ 𝜕𝑥} 𝐷 𝑗𝐷𝐺(𝑥1, ⋯ 𝑥𝐷) 0 ≤ 𝑗1 ≤ 𝑀⋯0 ≤ 𝑗𝐷 ≤ 𝑀, 0 ≤ 𝑗1+⋯+𝑗𝐷 ≤ 𝑀 ₎ 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼_{𝑖,𝑛,(𝑗}𝑙 ₁_,⋯𝑗_𝐷) 𝜕𝑗1+⋯+𝑗𝐷 𝜕𝑥₁𝑗1_{⋯ 𝜕𝑥} 𝐷 𝑗𝐷𝐺(𝑥1, ⋯ 𝑥𝐷; 𝑡) 0 ≤ 𝑗1 ≤ 𝑀⋯0 ≤ 𝑗𝐷 ≤ 𝑀, 0 ≤ 𝑗₁+⋯+𝑗_𝐷 ≤ 𝑀 ₎ 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼_{𝑖,𝑛,𝑚}𝑙 𝜙𝑚(𝑡) 𝑀 𝑚=0 ) 𝐼 𝑖=1 (4.7) (4.5) 𝐿(∙; 𝑡)𝑛𝑙 (4.6) (4.8) (4.9)

(28)

Using the associativity and distributivity of convolution we can propagate the smoothing Gaussian all the way to our fixed Gaussian derivatives, which will simply inherit the scale of the smoothing signal. This means, that the original definition of an RFNN layer complies with the scale-space representation, where the scale of the effective filters is sampled from the continuous parameter 𝑡. Practically, a multi-scale RFNN layer produces 𝑇 ∗ 𝑁 feature maps, where 𝑁 is the output depth of the layer and 𝑇 is the number of scales used. Feature maps associated with different scales contain local features of different scales enhanced. An illustration of a multi-scale RFNN layer is shown on Figure 4.2.

4.1.3 Multi-orientation representation

When looking at Gaussian filters, one can clearly recognize symmetries as well as steerability, a less obvious yet important property. Steerability groups special functions, whose transformed versions can be expressed as linear combination of a fixed basis set of functions. For instance, rotated Gaussian derivatives emerge from linear combination of derivatives along the x and y (and z in 3D) axes. Steerable filters enable the enhancement of diverse, locally oriented structures in images while suppressing noise, thus have a desired property in image analysis. Even though, the original formulation of RFNNs allows rotated versions of the basis set to emerge, the exact coefficients of polynomials are still challenging to learn when training samples are sparse. We propose to design multiple orientations explicitly into the model to facilitate the recognition of complex features in case of sparse data. This approach leans more towards Scattering networks [91] while retains the flexibility of CNNs to learn effective combination of the filters.

Following the guidelines of [97], [114], [115] we formulate the minimal set of x-y (-z in 3D) separable filters for all orders and derive the necessary interpolation functions to rotate them. We present the steps of the 2D formulation here and provide the same derivation for the 3D case in Appendix.

(29)

We discuss the steerability of even or odd parity functions of the form of a polynomial times a separable windowing function. Moreover, these functions are assumed to have an axis of rotational symmetry pointing along the direction cosines 𝛼 and 𝛽. More formally,

describes the rotated version of such a function 𝑓 by the transformation 𝜃, where 𝑊(𝑟) is a radially symmetric function and 𝑃𝑀 is an 𝑀𝑡ℎ order polynomial in

Steering theorem: Given the axially symmetric function 𝑓(𝑥, 𝑦) = 𝑊(𝑟)𝑃𝑀(𝑥) with even or odd

symmetry Mth order polynomial. Let 𝜃: (𝛼, 𝛽) denote the direction cosines of the axis of symmetry of 𝑓𝜃(𝑥, 𝑦) and 𝜃𝑗: (𝛼𝑗, 𝛽𝑗) the direction cosines of the axis of symmetry of 𝑓𝜃𝑗(𝑥, 𝑦). Then,

holds if and only if: 1. 𝐽 ≥ 𝑀 + 1 and 2. 𝑘𝑗(𝜃) satisfy

Considering the need for extension to 3D for medical volumetric data, we aim to design our steerable filters in the most efficient way. Computational costs for not x-y separable 3D filters grow as the cube of the filter size, while these costs grow linearly in the x-y separable case [114]. Here, we consider a function 𝑓(𝑥, 𝑦) x-y separable if 𝑓(𝑥, 𝑦) = 𝑓(𝑥)𝑓(𝑦). Convolution with such 2D filters can be implemented as a sequence of 1D convolutions, hence the computational benefit. To this end, we formulate steerability using x-y separable basis filters.

By the Steering theorem, 𝐽 = 𝑀 + 1 basis functions suffice to steer 𝑓(𝑥, 𝑦). Let us assume that a basis set of 𝐽 x-y separable functions exists. There will be then some set of separable basis functions 𝑅𝑗(𝑥)𝑆𝑗(𝑦) so that

To find the interpolation functions, 𝑘𝑗(𝜃) we equate the products of x, y in (4.10) with those in

(4.14). Each basis function 𝑅𝑗(𝑥)𝑆𝑗(𝑦) can contribute only one product of powers of x, y of order

M, so that 𝑃𝑀(𝑥′) remains a polynomial of order M. When considering only these highest order

products we have 𝑓𝜃_{(∙) = 𝑊(𝑟)𝑃} 𝑀(𝑥′) 𝑥′ = 𝛼𝑥 + 𝛽𝑦 𝑓𝜃(𝑥, 𝑦) = ∑ 𝑘_𝑗(𝜃)𝑓𝜃𝑗_{(𝑥, 𝑦)} 𝐽 𝑗=1 ( 𝛼𝑀 𝛼𝑀−1𝛽 ⋮ 𝛽𝑀 ) = ( 𝛼1𝑀 𝛼2𝑀 … 𝛼𝐽𝑀 𝛼1𝑀−1𝛽1 𝛼2𝑀−1𝛽2 … 𝛼𝐽𝑀−1𝛽𝐽 ⋮ ⋮ ⋱ ⋮ 𝛽1𝑀 𝛽2𝑀 ⋯ 𝛽𝐽𝑀 ₎ ( 𝑘1(𝛼, 𝛽) 𝑘2(𝛼, 𝛽) ⋮ 𝑘𝐽(𝛼, 𝛽) ) 𝑓𝜃(𝑥, 𝑦) = 𝑊(𝑟) ∑ 𝑘_𝑗(𝜃)𝑅_𝑗(𝑥)𝑆𝑗(𝑦) 𝐽 𝑗=1 𝑅𝑗(𝑥)𝑆𝑗(𝑦) = 𝑐(𝑥𝑀−𝑗+ ⋯ + 1)(𝑦𝑗+ ⋯ + 1) (4.10) (4.11) (4.12) (4.13) (4.14) (4.15)

(30)

where c is a constant. Substituting (4.11) into (4.10) yields in M+1 coefficients for 𝑥𝑀−𝑖𝑦𝑖, where 0 ≤ 𝑖 ≤ 𝑀, according to the multinomial theorem:

Thus, the interpolation functions follow:

for 1 ≤ 𝑗 ≤ 𝐽 = 𝑀 + 1 (using the indexing convention of (4.14)). To find 𝑅𝑗(𝑥)𝑆𝑗(𝑦) basis

functions we can rewrite the steering equation (4.14) in the form:

By inverting the interpolation matrix of 𝑘𝑗’s, the basis functions emerge as a linear combination

of rotated replicas of 𝑓; thus, are steerable themselves as well.

This formulation allows us to efficiently synthesize a multi-orientation filter set from Gaussian derivatives. Any Gaussian derivative of arbitrary order 𝑚 can be obtained from orthogonal – physicists’ – Hermite polynomials 𝐻𝑚 through pointwise multiplication with a Gaussian

windowing function:

This means that the Gaussian derivative basis of our RFNN layers is steerable with a minimal set of M+1 x-y separable basis functions. Next, we formulate a Multi-Scale and -Orientation RFNN layer denoted by MSO-RFNN layer by using the steering equation (4.14) in (4.9) for the basis functions 𝜙𝑚. As the combination of steering basis functions results in 1 oriented

derivative per order, we change back to the notation of (4.3) in the summation of orders. (𝑥′)𝑀_{= ∑ (}𝑀 𝑖) 𝛼 𝑀−𝑖_𝛽𝑖_𝑥𝑀−𝑖_𝑦𝑖 𝑀 𝑖=0 = ∑ 𝑀! (𝑀 − 𝑖)! 𝑖!𝛼 𝑀−𝑖_𝛽𝑖_𝑥𝑀−𝑖_𝑦𝑖 𝑀 𝑖=0 𝑘𝑗(𝜃) = (𝐽 − 1)! (𝐽 − 𝑗)! (𝑗 − 1)!𝛼 𝐽−𝑗_𝛽𝑗−1 ( 𝑓𝜃1_{(𝑥, 𝑦)} 𝑓𝜃2(𝑥, 𝑦) ⋮ 𝑓𝜃𝐽_{(𝑥, 𝑦))} = 𝑊(𝑟) ( 𝑘1(𝜃1) 𝑘2(𝜃1) … 𝑘𝐽(𝜃1) 𝑘1(𝜃2) 𝑘2(𝜃2) … 𝑘𝐽(𝜃2) ⋮ ⋮ ⋱ ⋮ 𝑘1(𝜃𝐽) 𝑘2(𝜃𝐽) … 𝑘𝐽(𝜃𝐽)₎ ( 𝑅1(𝑥)𝑆1(𝑦) 𝑅2(𝑥)𝑆2(𝑦) ⋮ 𝑅𝐽(𝑥)𝑆𝐽(𝑦) ) 𝐺𝑚(𝑥; 𝜎) = 𝜕 𝑚 𝜕𝑥𝑚𝐺(𝑥; 𝜎) = (−1)𝑚 1 (𝜎√2)𝑚𝐻𝑚( 𝑥 𝜎√2) ∘ 𝐺(𝑥; 𝜎) = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼_{𝑖,𝑛,𝑚}𝑙 𝑀 𝑚=0 𝜙𝑚(𝑡; 𝜃)) 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼_{𝑖,𝑛,𝑚}𝑙 ∑ 𝑘_𝑗𝑚(𝜃)𝑅_𝑗𝑚(𝑥; 𝑡)𝑆_𝑗𝑚(𝑦; 𝑡) 𝐽 𝑗=1 𝑀 𝑚=0 ) 𝐼 𝑖=1 = ∑ 𝑜_𝑖(𝑙−1)∗ ( ∑ 𝛼𝑖,𝑛,𝑚𝑙 ∑ 𝑘𝑗𝑚(𝜃)𝐺𝑗𝑚(𝑥, 𝑦; 𝑡; 𝜃) 𝐽 𝑗=1 𝑀 𝑚=0 ) 𝐼 𝑖=1 (4.16) (4.17) (4.18) (4.19) 𝐿(∙; 𝑡; 𝜃)𝑛𝑙 (4.20) (4.21) (4.22)