Sparse Redundancy Analysis of High Dimensional Genetic and Genomic Data: A novel implementation of a multivariate statistical tool in omics data analysis

(1)

Sparse Redundancy Analysis of

High Dimensional Genetic and

Genomic Data

A novel implementation of a multivariate statistical

tool in omics data analysis

Attila Csala

June 20, 2016, 62 pages

University of Amsterdam

Faculty of Medicine

(2)

Scientific Research Project

Title: Sparse Redundancy Analysis of High Dimensional Genetic and Genomic Data Subtitle: A novel implementation of a multivariate statistical tool in omics data analysis Location: Academic Medical Center - University of Amsterdam

Department of Clinical Epidemiology, Biostatistics and Bioinformatics Meibergdreef 15 1105 AZ Amsterdam Z.O.,

The Netherlands

Period: November 2015 - June 2016

Student: Attila Csala

Student number: 10196986 e-mail: mail@csala.me

Tutor: Dr. F.P.J.M. (Frans) Voorbraak

Academic Medical Center - University of Amsterdam Department of Medical Informatics

Meibergdreef 15 1105 AZ Amsterdam Z.O., The Netherlands

Room: J1B-110

Phone: +31 (0)20-56 65853

e-mail: f.p.voorbraak@amc.uva.nl

Mentor: Dr. M. H. P. (Michel) Hof

Academic Medical Center - University of Amsterdam

The Netherlands Room: J1B-207

Phone: +31 (0)20-56 66950 e-mail: m.h.hof@amc.uva.nl

Mentor: Prof. Dr. A.H. (Koos) Zwinderman

Academic Medical Center - University of Amsterdam

The Netherlands Room: J1B-226

Phone: +31 (0)20-56 65820

(3)

Abstract

Recent technological developments have enabled the possibility of genetic and genomic integrated data analysis approaches, where multiple omics data sets from various biological levels are combined and used to describe (disease) phenotypic variations. The main goal is to explain and ultimately predict phenotypic variations by understanding their genetic basis and the interaction of the associated genetic factors. Therefore, understanding the underlying genetic mechanisms of phenotypic variations is an ever increasing research interest in biomedical sciences. The current statistical toolbox for the analysis of integrated genetic and genomic data is unfortunately lagging behind. One of the techniques that has been developed is based on Canonical Correlation Analysis (CCA), which optimizes the multivariate correlation coefficients of the various data sets involved. A disadvantage of this method is that it does not make a distinction between the variables which need to be predicted (e.g. phenotypes) and the predictive variables (e.g. the genetic and genomic data). It is possible to make this distinction using a well known multivariate technique called Redundancy Analysis (RDA). However, current implementations of this technique cannot deal with the usual high dimensional nature of genetic and genomic data.

The subject of my Scientific Research Project (SRP ) is the development and implementation of a version of RDA, Sparse Redundancy Analysis (sRDA), that can deal with the usual high dimensional nature of omics data sets. I had two general hypotheses, namely that (1) my software implementation of sRDA is reliable in identifying correlated components in high dimensional omics data sets, and that (2) sRDA is computationally faster than CCA. These questions are answered by applying sRDA on simulated omics data of similar size as occur in ongoing omics studies. So I simulated data from large samples (n=10,000) and small samples (n=100) of patients. Simulations were repeated multiple times and performance of sRDA and CCA was quantified by the sensitivity and specificity measures of identifying the simulated pathways. Also, chromosome wide genetic and methylation data of 450,000 markers were analysed both with sRDA and CCA.

The simulation studies showed that my implementation of sRDA’s algorithm is capable to analyse high dimensional data sets and to identify multiple latent variables that are associated with the explanatory and response data set. Sensitivity measures of the simulation studies resulted between 0.70 and 0.99 (increased in relation with the sample size) and specificity measures resulted in 0.98. The software implementation of sRDA is also compliant with parallel computational methods and therefore computational time can be further reduced. The algorithm is also extendable with various penalization methods which further optimizes the analysis procedure. In some cases applying different penalization methods can lead to a five times faster computational time without a serious decrease of the algorithms ability to identify highly correlated variables. The real data analysis showed that sRDA is computationally faster than CCA (analysis took 267 and 575 seconds, respectively).

The aim of this SRP was to make an RDA implementation that can deal with high dimensional integromics data sets. The main motivation was to improve the interpretability of the analysis results and reduce computational time compared to sparse CCA in research that aims to explain the vari-ability in phenotypes from other clinical and molecular data sources. The end product of my SRP is an R implementation of an analytic method for big genetic and genomic data analysis that delivers easily interpretable results to researchers both with and without background in multivariate statistics. Keywords: Redundancy Analysis (RDA), Canonical Correlation Analysis (CCA), genetic and ge-nomic data, big data, integromics

(5)

Dutch

Recente technologische ontwikkelingen hebben het mogelijk gemaakt om simultaan genetische data en omic data te analyseren, waarbij meerdere omic datasets van verschillende biologische niveaus gecombineerd worden om fenotypische variaties te beschrijven. Het hoofddoel van dergelijke anal-yses is vaak om fenotypische variaties uit te leggen en uiteindelijk te voorspellen aan de hand van genetische factoren. Het gevolg hiervan is dat het begrijpen van de onderliggende genetische mechanis-men van fenotypische variaties een steeds vaker voorkomechanis-mend onderzoeksonderwerp in de biomedische wetenschappen is. Echter loopt de huidige statistische toolbox voor ge¨ıntegreerde genetische en omic data-analyse achter op de wensen van de gebruikers.

Een van de beschikbare statistische technieken die vaak gebruikt wordt is Canonical Correlation Analysis (CCA), waarbij de multivariate correlatiecofficinten van de verschillende betrokken datasets worden geoptimaliseerd. Een nadeel van deze methode is dat er geen onderscheid wordt gemaakt tussen de variabelen die voorspeld moeten worden (bijvoorbeeld fenotypes) en de variabelen waarop de voorspelling gebaseerd is (bijvoorbeeld de genetische en omic data). Met een andere bekende multivariate statistische techniek, Redundancy Analysis (RDA), is dit onderscheid wel te maken. Helaas kunnen de huidige implementaties van deze techniek het vaak voorkomende hoog-dimensionale karakter van genetische en omic data niet aan.

Het onderwerp van mijn Scientific Research Project (SRP ) is de ontwikkeling en implementatie van een versie van de RDA die de gebruikelijke hoog-dimensionale karakter van omics data sets wel aan kan. Deze versie van RDA noem ik Sparse Redundancy Analysis (sRDA). Voor mijn SRP had ik twee algemene hypotheses, namelijk dat (1) mijn software-implementatie van sRDA de gecorreleerde componenten in hoog-dimensionale omics datasets betrouwbaar kan identificeren, en dat (2) sRDA computationeel sneller is dan CCA. Ik heb beide hypotheses getoetst door het toepassen van sRDA op gesimuleerde omic data. Hiervoor heb ik gegevens gesimuleerd van grote steekproeven (n = 10.000) en kleine steekproeven (n = 100) van patinten. De simulaties werden meerdere malen herhaald en de prestaties van sRDA and CCA werden gekwantificeerd door maten van sensitiviteit en maten van specificiteit. Daarnaast heb ik chromosoom-breed genetische en methylatie data van 450.000 markers geanalyseerd met sRDA en CCA.

De simulatie studies toonden aan dat mijn toepassing van de sRDA algoritme hoog-dimensionale datasets kan analyseren en dat de methode ook in staat is om meerdere latente variabelen (die zijn geassocieerd met de verklarende en response dataset) te identificeren. De sensitiviteit in de simu-latie studies varieerde tussen 0,70 en 0,99 (hoger met een grotere steekproef) en specificiteit was ongeveer 0,98. Verder is de software implementatie van sRDA compatibel met parallelle rekentech-nieken waardoor de berekeningstijd verder gereduceerd kan worden. Het algoritme kan ook met diverse penalization methodes worden uitgebreid die de analyse procedure verder optimaliseren. In mijn sim-ulatiestudie vond ik dat in sommige gevallen het toepassen van verschillende penalization methodes tot een vijfmaal lagere rekentijd leidden zonder significante verslechtering van de sensitiviteit en speci-ficiteit. De chromosoom-breed genetische en methylatie data analyse toonde dat sRDA computationeel sneller is dan CCA (analyse nam respectievelijk 267 en 575 secondes in beslag).

Het doel van deze SRP was een RDA implementatie te maken die met de hoog-dimensionele in-tegromics datasets kan analyseren. De hoofdmotivatie was om, in vergelijking met penalized CCA, de interpreteerbaarheid van de analyseresultaten te verbeteren en de computationele tijd te vermin-deren. Het eindproduct van mijn SRP is een R implementatie van een analytische methode voor grote genetische en genomische data-analyse die interpreteerbare resultaten aan onderzoekers zowel met als zonder achtergrond in de multivariate statistiek levert.

(6)

Chapter 1

Background

This report is part of my Scientific Research Project (SRP ), and forms the Master thesis of the Master program of Medical Informatics at the University of Amsterdam (UVA). For my SRP, I worked on the development and implementation of a multivariate statistical tool for genetic and genomic data analysis. My SRP internship took place at the Clinical Epidemiology, Biostatistics and Bioinformatics department at the Academic Medical Center (AMC ), Amsterdam. The SRP led to a submission of a paper for publication in the Bioinformatics journal. The structure of the SRP report is identical with the one of the paper submission (Chapter 2: Bioinformatics paper), except that there is an additional introductory chapter (Chapter1: Background), and an extended discussion chapter (Chapter 3: Discussion and Future Perspectives).

In this first chapter, my aim is to provide an introduction to the underlying biomedical and statistical concepts that can hopefully help to make the Bioinformatics journal paper informative for the reader. The content of this chapter conveys all the information that I needed when I started to work on the topic of multivariate analyses of genetic and genomic big data eight months ago. At that time, being a major in Medical Informatics, I had elementary proficiency in most of the biomedical concepts (e.g. those from the molecular and cellular processes), and the statistics behind multivariate methods (e.g. Redundancy Analysis RDA, Canonical Correlation Analysis CCA) for genetic and genomic big data analysis looked rather intimidating. By now, hopefully, I can provide an introduction that could help the reader to not only appreciate the emerging importance of the combination of these topics but also see how they are interrelated with each other and can eventually drive the further development of biomedical informatics and health sciences in general.

In the subsections of this chapter, my aim is to explain the role of genetic and genomic data in biomedical sciences. Section1.1, provides a definition of biomedical informatics and puts genetic and genomic research in perspective. Section 1.2provides a short introduction on the terminology and underlying principles of genetic and molecular biology that are needed to appreciate why biomedical and genomic data analysis can help us to understand (disease) phenotype variations. Section 1.3 introduces the statistical challenges of genetic and genomic data analysis and Section 1.4describes some of the available statistical tools for genetic and genomic data analysis, including the motivation for the development of a novel statistical tool. I state my research questions in Section 1.5.

Chapter 2 describes in more depth the mathematical-statistical methodology I developed and implemented during my SRP and Chapter 3 provides an extended discussion on the strength and limitations of study.

(7)

Figure 1.1: Biomedical informatics perspective (Shortliffe14).

1.1 The role of genetic and genomic data in biomedical

infor-matics

This section aims to explain what is the role of genetic and genomic data in biomedical sciences and how genetic and genomic research is related to medical informatics. Medical Informatics is defined as the ”discipline concerned with solving information-related problems in the medical and healthcare sectors”1_{. According to (}_Shortliffe14_{), the semantics of medical informatics has somewhat changed} since the last few years. Nowadays, biomedical informatics refers to the discipline that used to be coined as health informatics. A more formal definition of biomedical informatics is given by the American Medical Informatics Association;

”Biomedical informatics is the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving, and decision

making, driven by efforts to improve human health”(Kulikowski12).

(Shortliffe14) argues that medical informatics is the area of biomedical informatics that deals with information-related problems and challenges on the individual’s (patient and health care professional) level (Figure1.1), while bioinformatics studies the structure and analysis of information at the genetic and molecular level. One main driving force of biomedical informatics is to translate biomedical and genomic data into clinical care (Shortliffe14).

1.2 Genetic and molecular biology

Understanding the underlying genetic mechanisms of complex (disease) phenotypes is an ever increas-ing research interest in biomedical sciences. Accordincreas-ing to the central dogma of molecular biology,

1_{https://www.uva.nl/en/education/master-s/master-s-programmes/item/medical-informatics.html} (last accessed on June 17, 2016)

(8)

Figure 1.2: Central dogma of molecular biology, the arrows indicate the sequential transfer of infor-mation at the genetic and molecular level. Adopted from (Rossi03).

there is a sequential transfer of information within a biological organism that has a well described directionality (Crick70). The basic unit that stores information in all known biological organisms is the Deoxyribonucleic Acid (DNA), that contains four type of nucleotides (cytosine, guanine, adenine, and thymine). Genetic information is coded by both the sequential order of these nucleotides and the structure of the DNA double helix on which these nucleotides are situated. Genes are the functional units of a DNA sequence that produces proteins through DNA transcription into Ribonucleic Acid (RNA) and RNA translation into protein (Figure 1.2). Proteins are biomolecules considered to be the building blocks of cells, to be responsible for functions such as replication of DNA and metabolic reactions. There are about 20,000 different human genes known and they can be found in almost all human cells. A particular gene is composed from only a fraction of approximately 3,000,000,000 DNA base pair nucleotides. There is a distinction made between the study of genes and between those DNA base pair nucleotides which functions are (yet) unknown. Genetics is the study of the DNA sequence pairs which are identified as a gene and genomics is the study of the entire genetic material of a given organism. Bioinformatics is the discipline that studies the representation, structure and analysis of information in biological systems (Figure1.1) (Shortliffe14).

The directionality of information transfer implies that an organism’s genotypic variations (i.e. vari-ation in genetic data) are responsible for its phenotypic varivari-ations (i.e. varivari-ation in an organism’s traits). Therefore, understanding the underlying genetic mechanisms of complex phenotypes is an ever increasing research interest in biomedical information sciences.

In the past, genetic and genomic data have been associated with (disease) phenotypes on a variable by variable basis. This single-data-type analysis approach led to discoveries of basic biological path-ways but it did not properly model the complex biological networks and therefore could not reveal the underlying mechanisms of complex traits (Ritchie15).

Recently, however, there has been emerging interest in associating multiple phenotypes with mul-tiple genetic and or genomic data sets. This area is referred to as ”integromics” (Lˆe Cao09a), which aims at relating clinical and several molecular data sources all at once in an effort to find pathways that are involved in the phenotypes of interests.

In the last decade technological advances have enabled measurement of thousands to millions of ge-netic and molecular markers in patients (Ritchie15,Adams15). Many medical research projects have established biobanks for data sets from various biological levels storing both biological tissue and in-silico data of patients on genetic variation, gene- and protein-expression, metabolites, metagenomes of

(9)

Figure 1.3: Information flow between omics domains. Adopted from (Ritchie15).

multiple tissues as well as multiple disease-phenotypes varying over time, under different environmen-tal exposures and after various treatments (Figure1.3). At the same time advancements in knowledge databases and the establishment of large biobanks have provided huge data-mining possibilities. Na-tional computing services (e.g. SURFsara) and leading data-science companies (e.g. Amazon, IBM: Watson) have developed large-scale computer facilities to enable dealing with very large databases.

These developments have unlocked the possibility of integrated data analysis approaches, where multiple data sets from various biological levels (Figure1.3) can be combined and used as predictive variables to describe phenotypic variations. Such integrated data analysis approach allows a more holistic modeling and therefore hopefully a better understanding of the underlying genetic basis of complex traits (Ritchie15,Gligorijevi´c15). The underlying objective is to explain and ultimately pre-dict complex traits by understanding the genetic basis of those and the interaction of the associated genetic factors. This is one of the primary motivations behind the concept of integrated data analysis which has become a major driving force in translational bioinformatics that aims to translate biomed-ical and genomic data into clinbiomed-ical care (Shortliffe14). Many life-scientists expect that translational bioinformatics will enable some form of personalized medicine in the near future and this is also the principle subject of systems medicine.

Many of these omics data sets (Figure1.3) are used in research projects in the AMC too. An exam-ple is the Helius study where phenomic, genomic, metabolomic and metagenomic data of up to 25,000 participants has been sampled (Stronks13). Another example is the EU-funded Biostat-CHF project of 2,500 patients with heart failure aiming at finding biomarkers for heart failure-treatment efficiency us-ing genetic markers, and gene-expression, proteomic and metabolomic measurements (Ouwerkerk14). A third example is the Compare project of 250 patients with Marfan syndrome aiming at finding biomarkers for aneurysm progression using genetic and methylation markers, gene and metabolite expression measurements (Franken16).

(10)

1.3 Statistical challenges for genetic and genomic big data

analysis

Translational bioinformatics aims to bridge the gap between biomedical research and applied clinical practice. The technological advances have enabled measurement and storage of thousands up to millions of genetic and molecular markers in patients, paving the road towards integrated data analysis approaches of genetic and genomic data with the objective to improve our understanding of complex phenotypes. The current statistical toolbox for the analysis of integrated genetic and genomic data is however lagging behind (Gligorijevi´c15). The main challenges that one could encounter during analysis of genetic and genomic data include the high-dimensionality nature of the data and the integration of multiple sources of genetic and genomic data. (Ritchie15) describes these in detail, including the challenges of potential confounding and the quality of the data. A distinction can be made between the statistical challenges of single-data-type analysis approach of individual omics data sets and the meta-dimensional analysis of integrated omics data.

Most statistical analyses focus on the association between a single disease-phenotype and one marker or a subset of markers from a particular omics-dataset. This is the main approach of all Genome Wide Association Studies (GWAS ). In GWAS, genomic data from different individuals are analysed in order to discover genetic variants. Most commonly the variation in Single-Nucleotide Polymorphisms (SNPs) is compared to traits and other biological markers. SNPs are such variation in nucleotides in the DNA sequence that are present in some part of the population. Thus GWAS studies mainly aim to explain traits through SNPs. There are over 60,000,000 SNPs identified in the human genome. Some traits result from SNPs and SNPs are also shown to increase risk for complex traits (Hamosh92, Wolf13).

1.3.1 High dimensionality leads to computational challenges as well as

challenges in interpretation

One of the typical characteristic of SNP annotation data (and generally of genetic and genomic data too) is its high dimensionality. Current measurements can be up to millions of variables (which usually exceeds the sample size). For example, there are over 60,000,000 SNPs identified in the human genome, and these variations are usually stored as the variables for a number of different patients. In such a scenario, the number of variables vastly exceeds the number of patients.

Among others, there are three major challenges when such high dimensional data is to be analysed. Firstly, simple regression analysis often fails to perform (Johnstone09) because of the presence of linearly correlated variables (i.e. multicollinearity). Regression analysis involves the inversion of the correlation matrix and inversion is only possible if the correlation matrix is non-singular. In the case of multicollinearity, the correlation matrix becomes singular, thus non-invertible, in which case regression analysis can not be performed (Boulesteix07). The second main challenge concerns the interpretability of the results which is extremely difficult with such high number of variables (Witten09a). And the third challenge is the speed of computation, which could take up to several days in a full GWAS analysis.

1.4 Statistical tools for genetic and genomic big data analysis

1.4.1 Principal Component Analysis (PCA)

A widely used tool for analysing genomic data is Principal Components Analysis (PCA) to reduce the dimensionality of high dimensional data (Ma11). PCA aims to describe the variation in data with the help of latent variables or Principle Components (PCs). These latent variables are linear combinations of the original variables that are extracted from the data set with the goal of explaining the most variance in the data set with least possible variables (i.e. PCs). The latent variables are orthogonal to each other, thus the covariances of the different latent variables with each other are zero, which leads to the property of each latent variables explaining a different portion of the variance in the data set. For example, if we have a gene expression data set with 20,000 genes, we can apply

(11)

PCA on the data set to see which linear combinations of the original variables are explaining the most variance in the original data set. By doing so, the dimensionality of the data set often can be reduced from 20,000 to a handful of variables (i.e. PCs) that still preserve most of the variance that was observed in the original variables. As a graphical illustration of PCA, consider the case of six patients and two gene expression variables (Figure 1.4A contains the scatter plot of the two gene expression variables). −5 −3 −1 1 3 5 A Second gene e xpression v ar iab le −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −3 −1 1 3 5 B Second gene e xpression v ar iab le 1st PC 2nd PC

First gene expression variable

Figure 1.4: PCA applied on a two dimensional data set. PCA explains the variance in the original data by the help of latent variables (i.e. PC s). In this example, most of the variance in the data can be explained by the 1st PC.

The characteristic equation of PCA is

(R − λkI)uk = 0, (1.1)

where R is the covariance matrix of the data set X, λk is the kth eigenvalue of R, I is the identity

matrix and uk is the kth eigenvector of R (Legendre12). For example, if there is a two dimensional

data set X of six patents and two gene expression variables;

X =           2 1 3 2 0 0.5 1.5 2 4 3.5 4.2 3           ,

then R, the covariance matrix of X becomes:

R = " 2.575 1.64 1.64 1.3 # .

(12)

PCA is applied on the covariance matrix of the original data set and it searches for eigenvalues and eigenvectors of this covariance matrix. The covariance matrix of a data set is the matrix which displays in row i and column j the covariance between the i -th and j -th variable. This matrix is symmetric and contains on the diagonal the variances of the variables (it is also known as the variance-covariance matrix). A positive, resp. negative, covariance between variables denotes a positive, resp. negative, correlation between the variables. A zero covariance corresponds with independence of the variables.

Applying PCA on the above covariance matrix leads to: "_2.575 _1.64 1.64 1.3 # − " λ1= 3.6970472, 0 0 λ2= 0.1779528 #" −0.8253209 0.5646640 −0.5646640 −0.8253209 # = 0.

The principal components are then obtained by multiplying U (the matrix of eigenvectors) by the matrix of the centered data set. The principal components also provide the coordinates of the variables on the new principal axes (Legendre12). The scatter plot of the two PCs for the six patients with two variables in Figure1.4A is detailed in Figure1.4B. Notice that the variance of PC 2 is negligible compared to the variance of PC 1. Hence, if we ignore PC 2 and only analyse PC 1 we reduce the dimensionality from two variables to one PC without much loss of information.

Perhaps one of the most intuitive ways to think about PCA is as follows; given a data set X, PCA looks for a diagonal matrix that is equivalent to the covariance matrix of X. Rearranging Expression 1.1, PCA can be expressed as

Ruk= λkuk. (1.2)

PCA is well understood, many software implementations of it are readily available. Some of the extended versions of it can be very fast even with millions of genetic markers but it does suffer from the disadvantage of using only one source of data (Zou06, Ma11). Thus when applied for multi-set genetic data analysis, after observing a marker with a significant association from one data set, mining other omics data and knowledge bases is a hand-job. Since PCA is an analytic tool that is restricted to a single data set, it is less suitable for genetic and genomic research where the goal is to describe (disease) phenotypic variations with multiple omics data sets from various biological levels.

1.4.2 Canonical Correlation Analysis (CCA)

Canonical Correlation Analysis (CCA) is a symmetrical statistical method to analyse multiple data sets. It is also a dimension reduction method, and it provides latent variables called canonical com-ponents, that can be similarly interpreted as the principal components from PCA. The characteristic equation of CCA for the case of two data sets (X and Y) is

(R−1_XXR−1_YYRXYR 0

XY− λkI)uk= 0, (1.3)

where R−1_XX is the inverse of the covariance matrix of the data set X, R−1_YY is the inverse of the covariance matrix of the data set Y, RXY is the covariance matrix of the data set X and Y, R

0 XYis

the transposed covariance matrix of the data set X, λk is the ktheigenvalue, I is the identity matrix

and uk is the kth eigenvector of R. (Legendre12) discuss mathematics behind CCA in detail.

Expression1.3perhaps, on the first sight, does not help to reveal the working of CCA as intuitively as Expression 1.1 and Expression 1.2 do for PCA, nevertheless the working of CCA can be easily explained by its name. In mathematics, canonical form means the most simple representation of a mathematical object (e.g a set, relations, a function) that still preserves the characteristics of the object (without compromising the object’s generality) (Legendre12). Correlation here denotes that the analysis will be constrained by the involved data sets (i.e. Y will constrain X and X will constrain Y, too). For the sake of simplicity, one can think of CCA as a multi-set version of PCA, where one seeks for linear combinations (canonical variates) of the data sets involved that has the highest possible correlation with each other. A more formal definition of CCA is given at Section2.2.1.1.

CCA, in its generic form, cannot deal with the usual high dimensionality of genetic and genomic data sets. (Waaijenborg08) developed a method to analyse multiple omics data sets. They anal-ysed the association between disease subtypes, genetic markers and cancer-tissue gene-expression of

(13)

patients with glioblastoma, and looked at regulation of specific pathways involved in glioblastoma (Waaijenborg09b, Waaijenborg09a). Their method, nonparametric Penalized Canonical Correlation Analysis (pCCA) is a multi-omic extension of CCA and yields sparse correlated canonical components from the various omics data sets. Penalization means that the high dimensionality of the data is re-duced by applying some form of constrain on the variables of the data sets during the analysis, and sparsity refers to the sparse elements in the results (i.e. omission of some variables from the canonical components by assigning them with zero regression weights, Section2.2.2.2). For example, one can apply Least Absolute Shrinkage and Selection Operator (LASSO ) during CCA in order to set some of the regression weights to zero. This in turn leads to sparse results where some variables are omitted due to their zero regression weights (Waaijenborg08).

1.4.3 Motivation for a novel statistical tool for big genetic and genomic

data

Following (Waaijenborg08), others have developed similar penalized CCA methods with different sparsity schemes (Witten09b, Lˆe Cao09a). pCCA is also being used at the Clinical Epidemiology, Biostatistics and Bioinformatics department at the AMC, where for example a set of 13,000,000 imputed genetic markers, 1,000 proteomic markers, and 100 disease phenotypes are analysed. The primary analysis took about 2.5 days which the department found too long and had liked to have a tool available with reduced computational time compared to pCCA. Also, the pCCA tool could not account for the directionality in the omics sets due to its symmetric nature which eventually can hamper the interpretability of the results (Figure 2.7).

During my SRP, I developed an alternative to pCCA based on the asymmetric relations of omics data sets. pCCA assumes symmetry of the relations between for instance genetic markers and gene expression, allowing genetic markers to influence gene expression but also allowing gene expression to influence genetic markers. The latter is presumably not possible since the central dogma of molecular biology states that genetic data influences gene expression which then influences protein expression. Biological mechanisms are probably more complex but in many cases asymmetric association can be assumed and I used this asymmetry to develop sparse or penalized Redundancy Analysis (sRDA). Similar to pCCA, I used a penalized version of redundancy analysis to derive sparse components reflecting pathogenic or molecular pathways. sRDA overcomes the problem of multicollinearity of high dimensional omics data sets and improves the interpretability of the results. Additionally, I implemented the sRDA algorithm as an R-tool that can distribute some of its computational tasks to multiple processing units at once (i.e. parallel computing) in order to decrease computational time. The following section describes the main principles of RDA.

1.4.4 Redundancy Analysis (RDA)

Redundancy Analysis (RDA) is yet another dimension reduction method. Suppose we have two data sets, X and Y. Redundancy Analysis (RDA) can be seen as a special case of PCA where the subject of PCA is the matrix of fitted values we obtain through regressing Y on X;

ˆ Y = Xb

= X[X0X]−1X0Y.

(1.4)

Similarly to PCA and CCA, we obtain some sort of principal or canonical components, which too are called latent variables. The different annotation denotes somewhat different concepts. The latent vari-ables of RDA are also linear combinations of one data set (the explanatory), but the motivation behind RDA is to explain the variance in the response data set Y with the help of the linear combinations of the explanatory set X. More technically, the criterion of RDA is to maximize the correlation between a linear combination of the explanatory variables X and the response variables Y (Section2.2.1.2). RDA also can be seen as an extended multiple linear regression (regressing multiple response variables on multiple explanatory variables), thus the response variables are considered to be dependent on the explanatory variables (which defines the directionality of RDA). Essentially this is the property that

(14)

distinguishes RDA from CCA and makes RDA an attractive statistical method for multiple omics set data analysis where directionality between the data sets can be presumed (Figure2.1).

The characteristic equation of RDA is (R−1_XXRXYR

0

XY− λkI)uk= 0, (1.5)

where R−1_XX is the inverse of the covariance matrix of the data set X, RXY is the covariance matrix

of the data set X and Y, R0XY is the transpose covariance matrix of the data set X, λk is the kth

eigenvalue, I is the identity matrix and uk is the ktheigenvector of R.

Expression 1.5 shows that RDA is a special case of PCA, that is, if we apply RDA on two data sets where the response and explanatory matrices are identical, the results will be equal to the one of PCA. (Legendre12) can be consulted for the detailed mathematics behind RDA.

1.5 Research questions

My SRP concerns the development and implementation of new mathematical-statistical methodology. The methods and approaches are mainly deductive and not of a classical experimental study where a hypothesis is tested with new data. Nevertheless, I have two general hypothesis, namely that (1) my software implementation of RDA, Sparse Redundancy Analysis (sRDA) is reliable in identifying correlated components in high dimensional omics data sets, and that (2) sRDA is computationally faster than pCCA. These questions are answered by applying sRDA on simulated omics data of similar size as occur in ongoing omics studies. So I simulated data from large samples (n=10,000) and small samples (n=100) of patients. Simulations were repeated multiple times and performance of sRDA and pCCA was quantified by the sensitivity and specificity of sRDA and pCCA to identify the simulated pathways. Also, chromosome wide genetic and methylation data of 450,000 markers were analysed both with sRDA and pCCA. Both the simulations and the real data analysis are described in the Results section of Chapter 2.

Chapter2also describes in more depth the mathematical-statistical methodology I developed and implemented during my SRP and Chapter 3discusses the strength and limitations of study in more detail.

(15)

Chapter 2

(16)

Bioinformatics

doi.10.1093/bioinformatics/xxxxxx Advance Access Publication Date:

Original Paper

Genome Analysis

Sparse Redundancy Analysis of High Dimensional

Genetic and Genomic Data

Attila Csala1∗_{, Frans P.J.M. Voorbraak}2_{, Aeilko H. Zwinderman}1_{and Michel}

H. Hof1

1_{Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, Amsterdam, 1105 AZ, The} Netherlands and

2_{Department of Medical Informatics, Academic Medical Center, Amsterdam, 1105 AZ, The Netherlands.}

∗_{To whom correspondence should be addressed.}

Received on ; revised on ; accepted on

Abstract

Motivation

Recently there has been much interest in associating multiple phenotypes with multiple genetic and or genomic data sets. Current implementations of multivariate statistical methods either cannot deal with the high dimensionality of genetic and genomic data sets or are performing sub-optimally during analysis. A technique that has been developed to analyse such data is Canonical Correlation Analysis (CCA), which optimizes the multivariate correlation coefficients of the various data sets involved. A disadvantage of this method is that it does not make a distinction between the variables which need to be predicted (phenotypes) and the predictive variables which are usually the genetic and genomic data. It is possible to make this distinction using Redundancy Analysis (RDA). However, current implementations of this technique cannot deal with the usual high dimensional nature of genetic and genomic data.

Results

We proposed Sparse Redundancy Analysis (sRDA) for high dimensional genetic and genomic data analysis. We built a software implementation of our algorithm and conducted simulation studies on a computer cluster to assess the reliability of sRDA. Both the analysis of simulated data and the analysis of 500K methylation markers and 20K gene-expression values measured in a set of 55 patients with Marfan syndrome shows that sRDA is able to deal with the usual high dimensionality of genetic and genomic data and performs better than CCA.

Availability

Our algorithm is implemented in R and available upon request.

Contact

mail@csala.me

Supplementary information

(17)

2.1 Introduction

In the past, genetic and genomic data have been associated with (disease) phenotypes on a variable by variable basis. Recently, however, there has been much interest in associating multiple phenotypes with multiple genetic and or genomic data sets. This area is referred to as ”integromics”, which aims at relating clinical and several molecular data sources all at once in an effort to find shared pathways that are involved in the phenotypes of interests.

One of the techniques that has been developed to analyse genetic and genomic big data is based on Canonical Correlation Analysis (CCA) (Waaijenborg08,Witten09a). CCA is a symmetrical analytic method that searches for linear combinations of the explanatory and response variables in a way that these linear combinations are maximally correlated.

A usual research direction in systems medicine/biology is to explain the variability in the phenotypes of an organism from various molecular data sources. The central dogma of molecular biology states that there is a sequential transfer of information within a biological organism (Crick70). Thus the DNA transcription variability can be explained from DNA and regulation, and the RNA translation variability can be further explained from RNA. CCA cannot account for this sort of directionality of the clinical and molecular data sources, since CCA is symmetric and does not make a distinction between the variables which need to be predicted and the predictive variables. As a consequence, applying CCA to explain the variability in the phenotypes from other clinical and molecular data sources has two disadvantages.

Firstly, the symmetric process introduces unnecessary computational costs and leads to sub-optimal computational time. Secondly, since the result is described in terms of highly correlated pairs of linear combinations of variables from the two respective data sets involved, the interpretability of the results is hampered (vandenWollenberg77,Takane07).

In this paper we will show that it is possible to improve both computation time and interpretability by using a well-known non-symmetric multivariate technique called Redundancy Analysis (RDA) (Stewart68,vandenWollenberg77,Fornell88, Johansson81).

Firstly, with RDA, it is possible to make a distinction between predicted and predictive variables. By taking advantage of this directionality, computational time can be reduced. Secondly, by optimizing the multivariate correlation coefficients between a linear combination of the predictive variables and the variables in the predicted data set, RDA improves the interpretability of the results compared to CCA (Takane07,Lambert88).

Unlike CCA, which has been extended to cope with high dimensional data sets (Huopaniemi10, Witten09a, Waaijenborg08), RDA has received little attention in the field of integromics. Although (Takane07) introduced a theoretical framework for regularized redundancy analysis in order to deal with the problem of multicollinearity among explanatory variables, current implementations of this technique cannot deal with the usual high dimensional nature of genetic and genomic data (the so-called p n problem).

The aim of this paper is to make an RDA implementation that can deal with high dimensional integromics data sets. Our motivation is to improve the interpretability of results and reduce compu-tational time compared to sparse CCA in research that aims to explain the variability in phenotypes from other clinical and molecular data source(s). In general, we would like to provide a descriptive statistical method for big genetic and genomic data analysis that delivers easily interpretable results to researchers both with and without background in multivariate statistics.

(18)

2.2 Methods

2.2.1 Multivariate techniques for integromics data analysis:

CCA and

RDA

2.2.1.1 CCA

Consider we have data from n individuals spread over two matrices, X and Y, where X is an n × p dimensional matrix containing p explanatory (predictive) variables, and Y is a n × q dimensional matrix with q response (predicted) variables. Let ξ be the n × 1 dimensional latent variable of X and η the n × 1 dimensional latent variable of Y . The vectors α and β denote the associated weights of ξ and η, respectively (Figure2.1).

CCA optimizes the multivariate correlation coefficients of sets of linear combinations of the various data sets involved (Figure2.1). The sets of linear variates are the canonical components (ξ, η). The criterion of CCA is to maximize the within-pair correlations of the canonical variates (Lambert88),

|cor(ξ, η)|. (2.1)

CCA actually explains the variance between linear combinations of the predicted and explanatory variables (between ξ and η). However, high correlation between canonical correlation pairs does not necessarily indicate high explained variance of the included Y variables. Therefore, interpretation of CCA is difficult (Stewart68, vandenWollenberg77, Lambert88), and can hamper interpretation of results of studies that aims to explain the variability in the phenotypes from other clinical and molecular data sources.

Figure 2.1: Differences between RDA and CCA. CCA explains the variance between linear com-binations of the predicted and explanatory variables while RDA explains the variance between the predicted variables and the linear combination of the explanatory variables.

2.2.1.2 RDA

Redundancy Analysis (RDA) describes the amount of variance of a given set of response variables on basis of a given set of explanatory variables.

The objective of RDA is to define a linear combination of X and α (denoted as ξ) that yields for the maximum sum of explained variance between ξ and the Y variables (Isra¨els92), or, in other words, to maximize the correlation between Y and ξ, i.e.

q

X

k=1

(19)

2.2.2 RDA for genetic and genomic big data analysis

Current implementations of RDA cannot deal with the usual small sample size and high dimensionality (p n) of genetic and genomic data. To overcome this shortcoming of RDA, we propose Sparse Redundancy Analysis (sRDA). We first describe how RDA can be accomplished through a simple Partial Least Squares (PLS ) iterative algorithm, then describe how to extend the algorithm with penalization techniques to be able to analyse data sets where p n.

2.2.2.1 RDA in the Partial Least Squares (PLS ) framework

(Fornell88) showed that RDA can be accomplished in a regression framework via a simple Partial Least Squares (PLS ) iterative algorithm. PLS is used to obtain the covariance matrix of X and Y through the use of latent variables (ξ and η) (Wegelin00), and was first developed by Wold (Wold75). Our implementation of RDA is based on Fornell’s algorithm (see Algorithm 1 for the pseudo-code). In the following, we describe how this algorithm maximizes the criterion for RDA showed in Expression2.2.

The objective of RDA is to maximize the correlation between Y and ξ (Expression 2.2). In a regression framework, this can be achieved by first calculating the latent variable η with arbitrary β(0) _{weights (Operation} ₆_{from Algorithm} ₁_{). ξ is calculated with the α weights that are obtained}

from the multivariable regression of η on X (Operation7and8) and in turn η is re-estimated with the updated β(1) _{weights calculated by the univariable regression of Y on ξ (Operation} ₉_and₁₀_{). This}

process is repeated until convergence, where convergence is reached if the summed squared differences of β1, β0 and α1, α0 are smaller than θ, a predefined small positive tolerance (Operation11). The resulting latent variables form the first canonical component. The α weights can be interpreted as an indicator for the contribution of X variables to ξ, and the β values indicate the correlation between the particular Y variables and ξ, the linear combination of the X variables.

Note that this PLS algorithm can be easily turned into a generic CCA algorithm by replacing Operation 9 with a multivariate regression (i.e. regressing X multivariately on η to update α) (Waaijenborg08).

Algorithm 1 Pseudo-code for RDA

1: procedure RDA(X,Y )

2: Center and scale X and Y

3: Set α(0)_{and β}(0)_{to arbitrary vectors [1, 1, ..., 1]}0 _{with length p and q, respectively}

4: CRT = 1

5: while CRT ≥ θ (some small positive tolerance) do

6: η = Y β(0) 7: α(1)= [X0X]−1[X0η] . update α by regressing η on X 8: ξ = Xα(1) 9: β(1)0= [ξ0ξ]−1[ξ0Y ] . update β by regressing Y on ξ 10: η = Y β(1) 11: CRT =P(β(1)_{− β}(0) )2(α(1)− α(0) )2 12: α(0)_{= α}(1)_{and β}(0)_{= β}(1) 13: end while 14: Return α(0), β(0) 15: end procedure

We can calculate multiple latent variables (ξ) by repeating our RDA analysis using the residuals of the X data set and the original Y data set as

Xres_p = Xp− Xpredictedp = Xp− ξY = Xp− ξ[ξ 0 ξ]−1[Xpξ 0 ]. (2.3)

In general, a maximum number of p latent variables can be obtained with this method. If p < n, a direct analytical solution exists to derive all p latent variables (Johansson81, Fornell88), but we

(20)

consider the more common situation with integromics that p n. A criterion for the maximum number of latent variables that we wish to obtain through the analysis might be based on the sum of squared difference of the correlations between the last two latent variables and the response data set, i.e.

q

X

k=1

cor(ξi−1, Yk)2− cor(ξi, Yk)2, (2.4)

where i denotes the i th latent variable and q is the number of columns in the response data set. If the sum of squared differences between the explained correlations (Expression 2.4) is smaller than a small predefined limit, we may stop the procedure of obtaining further latent variables. Since the latent variables are orthogonal to each other, each of them explains a different portion of variance in Y. By definition, the first latent variable (ξi=1) has the highest absolute sum of squared correlation

with Y (Ma11), and all following variables (ith_{+ 1) explain a smaller portion of variance in Y,} q X k=1 cor(ξi, Yk)2> q X k=1 cor(ξi+1, Yk)2. (2.5) −0.2 0.0 0.2 W eights of ξ 0 50 100 150 200 250 300 −0.2 0.0 0.2 P enaliz ed w eights of ξ Variable index

Figure 2.2: Graphical representation of how the penalization affects the regression weights. The upper plot shows the resulted ξ or α weights of RDA and the bottom plot shows the resulted ξ or α weights of sRDA of 300 variables.

(21)

2.2.2.2 Sparse Redundancy Analysis (sRDA)

The algorithm in this form (Algorithm 1) does not work with high dimensional (p n) data sets. Step 7 in our algorithm involves an ordinary multivariable regression which requires (n > p). We applied the Elastic Net penalization technique in order to extend the scope of RDA on p n data sets. Elastic Net combines the Ridge and the LASSO penalization techniques (Zou05,Waldmann13). Ridge penalization shrinks the weights (α) towards zero by penalizing the diagonal of the correlation matrix (by the coefficient λ2), while the LASSO forces some weights to be zero by penalizing their

size (by the coefficient λ1), therefore LASSO improves interpretability of the results (Tibshirani96).

Since in Elastic Net the Ridge and the LASSO penalties are applied simultaneously, double shrink-age is introduced, which leads to biased results (Zou05). To overcome this bias, (Zou05) proposed a corrected version of Elastic Net (where of 1 + λ2 is a scaling factor to prevent double shrinkage),

ˆ α(1)= argmin α α 0 X 0 X + λ2I 1 + λ2 α − 2η0Xα + λ1α. (2.6)

Elastic Net forces some regression weights to be zero (with λ1) and also shrinks the weights

to-wards zero (with λ2), simultaneously (Figure2.2). Note that the LASSO, Ridge and Univariate Soft

Thresholding (UST ) are special cases for the Elastic Net, where setting λ2 = 0 yields to LASSO,

λ1= 0 yields to Ridge and λ2= ∞ yields to UST regression (Zou05,Waldmann13).

In the following we will discuss how to determine the optimal Ridge and LASSO penalty parameters for sRDA. 25 50 100 200 25.65 25.70 25.75 25.80 25.85 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Ridge

Sum absolute correlations

factor(LASSO) 25 50 100 200

Figure 2.3: Plot of the mean of absolute correlations from a 10 fold cross-validation simulation with a grid of LASSO (λ1) values of 25, 50, 100, and 200, and Ridge (λ2) values of 1, 10, 100 and 1000

(22)

2.2.2.3 Selection of penalty parameters

The optimal Ridge and LASSO penalty parameters are determined through k -fold cross-validation. The data sets of X and Y are divided into k subgroups, where 1 of these subgroups is taken as the test set while the remaining k-1 subgroups are grouped into a training set.

The α weights are estimated with the training set and are used to calculate the latent variables in the test set. The sum of the absolute correlations is calculated between the just obtained ξ and the Y from the training set. Repeating this k times ensures that each subgroup of the original data sets has been both training and test set in the cross-validation process. The sums of the absolute correlations are accumulated through k iterations and are used as an indicator for optimal parameter selection: the optimal LASSO and Ridge parameters are defined as the combination of these parameters that provide the maximum sum of absolute correlations through the k -fold cross-validations.

The cross-validation can be done with r and l number of Ridge and LASSO parameters, respec-tively, that vary over a grid with a predefined set of numbers (Figure2.3). We observe l×r×k iterations for determining the optimal values for the Ridge and LASSO parameters, which leads to a compu-tationally expensive procedure. However, the computational time can be reduced substantially with parallel computing.

2.2.3 Analysis of multiple data sets with RDA

Our sRDA algorithm is easily extendable to analyse multiple data sets. Suppose there are three data sets X, Y and Z. Assuming that both X and Z have a causal effect on Y, we can combine the contribution of X’s and Z’s latent variables (ξ1 and ξ2 respectively) into a single latent variable (ξ, Figure2.4). This unified contribution of the latent variables (ξ) is further used to maintain the regression weights (β) of the response variables (Y).

Figure 2.4: Directionality of RDA in multiple data sets

The algorithm is described in detail in pseudo-code at Algorithm 2. To maximize the correlation between Y and ξ (Expression2.2), first the latent variable η with arbitrary β(0) weights is calculated (Operation 6 in Algorithm 2). ξ1 is calculated with the α1(1) weights that are obtained from the multivariable regression of η on X and ξ2 is calculated with the α2(1) weights obtained from the multivariable regression of η on Z (Operation 7, 8, 9 and 10). The latent variables ξ1 and ξ1 are combined into a single latent variable ξ (Operation11). The latent variable of the response variables η is re-estimated with the updated β(1) _{weights which is obtained by the univariable regression of}

(23)

reached if the summed squared differences of β(1)_{, β}(0)_,α1(1)_{, α1}(0)_,α2(1) _{and α2}(0) _{are smaller than}

θ, a predefined small positive tolerance (Operation11). Algorithm 2 Pseudo-code for multiple data sets RDA

1: procedure multipleSetRDA(X,Z,Y )

2: Center and scale X, Z and Y

3: Set α1(0), α2(0)and β(0)to arbitrary vectors [1, 1, ..., 1]0 with length p, z, and q, respectively

4: CRT = 1

5: while CRT ≥ θ (some small positive tolerance) do

6: η = Y β(0) 7: α1(1)= [X0X]−1[X0η] . update α1 by regressing η on X 8: ξ1 = Xα(1) 9: α2(1)_{= [Z}0_Z]−1_[Z0 η] . update α2 by regressing η on Z 10: ξ2 = Zα(1) 11: ξ = ξ1 + ξ2 12: β(1)0= [ξ0ξ]−1[ξ0Y ] . update β by regressing Y on ξ 13: η = Y β(1) 14: CRT =P(β(1)_{− β}(0) )2(α1(1)− α1(0) )2(α2(1)− α2(0) )2 15: α1(0)= α1(1), α2(0)= α2(1)and β(0)= β(1) 16: end while 17: Return α1(0), α2(0), β(0) 18: end procedure

(24)

2.3 Results

We tested our method on both simulated and real omics data. In this section we describe first the simulation settings and results, then the results of the omics data analysis.

2.3.1 Simulation studies

We have conducted two simulation studies to assess the precision of sRDA. In the first simulation study, we show how we assessed sRDA’s ability of selecting variables from the explanatory data set (X ) that are associated (highly correlated) with the response dataset Y. In the second simulation study we show the capability of sRDA to identify multiple latent variables that are associated with variables of the explanatory data set. The simulations were repeated 1000 times. We applied Elastic Net penalization using 10 -fold cross-validation to find the optimal penalty parameters.

2.3.1.1 Single latent variable

We looked at three scenarios to assess the ability of the sRDA algorithm of assigning non-zero regression weights to the associated variables of the explanatory data set. We generated two sets (X and Y ). For the explanatory data set X, we generated 1000 ”noise” variables that are not associated with the latent variable, sampled from the uniform distribution with correlation between -0.4 and 0.4 (αnoise∼unif{−0.4, 0.4}). Additionally, we added 10 more variables that are associated with the latent variable for the data set (later on denoted as w ), with regression weight αassocd sampled from the normal distribution with mean (µ) 0.9 and a standard deviation (σ) of 0.05 (αassocdN(0.9, 0.05)). These associated variables were placed to the first 10 positions among all variables. The response data set was created similarly.

The three scenarios differed in the number of rows the data sets contained. In the first scenario, both data sets X and Y contained 100 rows (n=100, small sample size). We designed our second simulation study similarly, but we increased the number of rows in both data sets X and Y from 100 to 250 (n=250, medium sample size). In the third scenario, the number of rows was increased to 10,000 in both data sets (n=10,000, large sample size).

We assessed the ability of the sRDA algorithm of identifying associated variables through measuring specificity and sensitivity. As we have set the w associated variables to the first w places among all p variables in the explanatory data set, we defined specificity and sensitivity as follows; sensitivity (Sens1) as the measure of the number of variables assigned with non zero regression weights at the

first w positions in the explanatory data set (w denotes here the number of associated variables added to the data set), divided by w. We have defined a second measure for sensitivity (Sens2)

too, where we counted how many of the variables with the highest w correlation weights are in the first w positions, and divided this number by w (thus this measure was not affected by the LASSO parameter since we only looked if the first w variables are among those having the highest correlation weights from all p variables).

Specificity (Spec) is defined as the measure of the number of variables assigned with zero regression weights among the variables at the last p-w positions in the explanatory data set (where p denotes the number of variables in the explanatory data set and w denotes the number of associated variables among all variables), divided by p-w.

The results can be found in Table2.1. Both Sens1 and Sens1 measures resulted in high values.

Sens1 measures resulted between 0.70 and 0.83, and Sens2 resulted between 0.86 and 0.99 across

Sens1 Sens2 Spec

Small sample size 0.701 0.863 0.983 Medium sample size 0.714 0.948 0.984 Large sample size 0.839 0.992 0.986

(25)

the simulation scenarios. Their measured values are increased in relation with n. We measured a very high value for Spec, 0.98, which was not much affected by the change in n over the three simulations. 2.3.1.2 Multiple latent variables

In the second part of the simulation study, our objective was to assess the ability of sRDA of identifying multiple latent variables that are associated with the explanatory and response data set, by assigning non zero regression weights to the associated variables.

2.3.1.3 Scenario of 3 latent variables

Similarly to the first part of the simulation, we generated two data sets. For the explanatory data set, we generated 1000 variables that are not associated with the latent variables (αnoise∼unif{−0.3, 0.3}). Additionally, we added m variables in total, distributed in 3 groups (g=3 ), with αassocd _coefficients

that were associated with the latent variables which were sampled from the normal distribution with mean µ and standard deviation σ; m1= 10 αassocd∼N(0.8, 0.05), m2= 5 αassocd∼N(0.7, 0.05), and

m3= 5 αassocd∼N(0.6, 0.05).

Similarly to the first simulation, these associated variables were placed at the first m positions among all variables. The response data set was created similarly and both data sets had 1000 rows. 2.3.1.3.1 Scenario of 5 latent variables We have created a similar scenario with 1000 variables that are not associated with the latent variables of the explanatory set (αnoise_{∼unif{−0.3, 0.3}), and}

we added five latent variable groups (g=5 ), in total m variables; m1= 5 αassocd∼N(0.5, 0.05), m2= 5

αassocd∼N(0.7, 0.05), m3 = 10 αassocd∼N(0.8, 0.05), m4 = 10 αassocd∼N(0.6, 0.05), and m5 = 10

αassocd∼N(0.9, 0.05).

2.3.1.3.2 Scenario of 5 latent variables - Uni-variate Soft Thresholding We created a third scenario which is identical to the one above. Since it is shown that Uni-variate Soft Thresholding (UST ) is much faster compared to Elastic Net when the number of variables are large (Zou05), in this third scenario we applied UST penalization instead of Elastic Net.

To assess sRDA’s precision in latent variable detection, we defined the following sensitivity measure: the number of regression weight vectors from the first g returned regression weight vectors (regression weights for the latent variables, where g denotes the total number of associated variable groups) that had non zero regression weights among their first m variables (where m denotes the total number of variables of all the associated variables groups), divided by the g.

We ran the simulations 1000 times in each case, with fixed values for the Ridge and LASSO penalties, 1 and 10, respectively (except for the 5 latent variables scenario, where the Ridge parameter is set to ∞ due to UST ). The sensitivity scores presented are the average scores over the 1000 simulations (see Table2.2for results). We obtained a very high sensitivity score for both the 3 and 5 latent variables scenario, 0.996 and 0.987, respectively. Changing the penalization technique from Elastic Net to UST dropped the sensitivity from 0.987 to 0.917.

Assessing the computational time, on average it took 55.22 seconds to finish one simulation with the 5 latent variable scenario using Elastic Net, while it took 11.65 seconds to conduct one simulation with the same settings, using UST.

Penalization method Sensitivity

3 latent variables Elastic Net 0.996

5 latent variables Elastic Net 0.987

5 latent variables Univariate Soft Thresholding 0.917 Table 2.2: Simulation study results with multiple latent variables

(26)

2.3.2 Analysis of methylation markers and gene expression data sets

In our real data analysis scenario, our aim was to explain the variability in genomewide gene expression by genomewide transciptomics data (i.e. methylation markers). Both data sets included data from the same 55 patients with Marfan Syndrome. These patients participated in the Dutch Compare Trial (Groenink13). The methylation DNA data from blood leukocyte are measured with the Illumina 450,000 technology, and the gene expression data are derived from skin biopsy using using Affymetrix Human Exon 1.0 ST Arrays (Radonic12). The explanatory (data on methylation markers) and the response (data on gene expression values) set contained 485,512 and 18,424 variables, respectively.

First we analysed these data sets using Elastic Net for sRDA (Figure 2.5). The cross validation study that was run in parallel on the Lisa Cluster1 _{using 16 nodes and it took about 2.5 days.}

0 2 4 6 8 10 6600 6800 7000 7200 7400 7600 0 20 40 60 80 100 Ridge parameter LASSO par ameter

Sum of absolute correlation

Figure 2.5: Optimal values obtained through a 10 fold cross validation for the Elastic Net variant of sRDA. The 10 fold cross validation study resulted in 100 for LASSO and 1 for the Ridge penalty parameter as optimal values

Given the size of the data, we choose to use UST over the Elastic Net version of our sRDA, and we ran a 10 fold cross validation study to identify the optimal LASSO parameter, which resulted in λ = 40 (Figure2.6). The cross-validation study and the analyses with the optimal LASSO parameter took 8,059 and 189 seconds, respectively. The resulting latent variables are represented in FigureB.1 in the Appendix.

(27)

0 50 100 150 200 7600 7800 8000 8200 LASSO penalty

Sum of absolute correlations

40

Figure 2.6: Optimal parameter selection for the LASSO parameter using UST for sRDA

First latent variable Second latent variable

Gene name Correlation coefficient Gene name Correlation coefficient

ARPC3 0.8093 BCAS3 0.5902 C3orf23 0.8093 CADM1 0.6039 EXOC6 0.8008 CDKL5 0.6273 FAM118B 0.8155 GPC3 0.6025 GTPBP8 0.8178 ING4 0.5977 HSPH1 0.8073 LRIG3 0.6026 KIAA1841 0.8029 MGEA5 0.5947 KPNA2 0.8127 NONO 0.5970 RCH1 0.8048 SARNP 0.6047 RPS6KA3 0.8054 SUPV3L1 0.6014

(28)

We used an over-representation analysis tool to link the highly associated genes (the top 10 with highest correlation coefficients) from the analysis with existing pathways. Multiple pathways were associated with the analysed genes from the first latent variable. The identified pathways included the ”Cellular responses to stress” (p value 2.88×10−2), the ”Regulation of HSF1-mediated heat shock response” pathway (p value 9.09×10−3) and the ”Senescence-Associated Secretory Phenotype (SASP)” pathway (p value 7.5×10−2_{) (the red set of genes on Figure}_B.1_).

When analyzing the ten genes with the highest correlation coefficients from the second latent vari-able, the identified pathways included the ”Negative regulators of RIG-I/MDA5 signaling” pathway (p value 8.82×10−3), the Diseases associated with glycosaminoglycan metabolism pathway (p value 3.37×10−2) and the Nectin/Necl trans heterodimerization pathway (p value 1.66×10−2) (the blue set of genes on Figure B.1).

We analysed the same data sets using pCCA in order to assess the required computational time. We used UST penalization method with a LASSO parameter of 10. In the case of pCCA, this means that there were all but 10 variables’ regression weights set to zero from both data sets (Figure 2.7). A single pCCA analysis on the same data sets took 575 seconds, while using sRDA took 267 seconds.

(29)

2.4 Discussion

We proposed Sparse Redundancy Analysis (sRDA) for high dimensional genetic and genomic data analysis as an alternative to Canonical Correlation Analysis (CCA). While CCA is a symmetric mul-tivariate statistical analysis, meaning that it does not make the distinction between the explanatory and predicted data sets, sRDA takes advantage on the prior knowledge of directionality that is often presumed by omics data sets. We showed that, compared to CCA, sRDA performs better in terms of computational time and provides more easily interpretable results when applied on omics data sets.

With the simulation study, we showed that the implementation of our algorithm can identify mul-tiple latent variables that are associated with the explanatory and response data set, with high sensitivity and specificity measures. Our software implementation is also compliant with parallel computational methods and therefore computational time can be further reduced. We implemented different penalization methods (Ridge, LASSO, UST ) to further optimize our analysis procedure and showed scenarios where applying different penalization methods can lead to a five times faster compu-tational time without a serious decrease of the algorithms ability to identify highly correlated variables (it took 55.22 seconds to analyse 5 latent variable using Elastic Net (with sens = 0.987 ), while using UST took 11.65 seconds (with sens = 0.917 ) for the same data).

A TP6AP2 C18orf8 DNAJC1 DPM1 FAM118B MRPL51 NDUF AF4 TMEM123 TMEM126B UQCRB cg04902899 cg21871151 cg03765692 cg18862888 cg16993220 cg23056713 cg17181255 cg00994896 cg23321397 cg00533891 0.831 η ξ −0.1 −0.05 0 0.05 0.1

Figure 2.7: Results of the pCCA analysis of the methylation markers and gene expressions data set. Since CCA searches for linear combinations from both data sets that results in the highest correlation between those linear combination (ξ and η), and involves only multivariate regression, the regression weights are not interpretable as proportions of the total explained variance.

(30)

Through the analysis of 500,000 methylation markers and 20,000 gene-expression values measured in a subset of 55 patients with Marfan syndrome, we showed that sRDA is able to deal with the usual high dimensionality of genetic and genomic data and performs better optimized compared to CCA (267 and 575 seconds, respectively).

Although (Takane07) introduced a theoretical framework for regularized redundancy analysis in order to deal with the problem of multicollinearity in the explanatory variables, their framework relies on the Ridge penalization technique and it only shrinks the regression coefficients towards zero but does not set them to zero, which makes interpretation of results difficult if there are thousands or even millions of variables. One important strength of our proposed method is the ability to set regression coefficients to zero. Hereby one can predefined prior the number of variables one wishes to obtain during the analysis (for example, 10 genes can be selected from the gene expression data set that have the highest correlation coefficient with the linear combinations of the corresponding methylation markers, as illustrated in FigureB.1). Interpretability of the results are improved compared to the CCA analysis too. Since in sRDA’s algorithm we use a univariate regression step to derive the weights for the response variables (β, Operation 9 in Algorithm 1), the regression weights for the response variables can be interpreted as the variance of each individual variable proportionally contributed to the total variance we explain in the response data set from the explanatory data set. Thus in our real data scenario, we can interpret the regression weights β (i.e. the regression coefficients of the gene expression data set) as each gene’s individual contribution to the total variance that the linear combination of variables in the Methylation marker data set explain in the gene expression data set (Table2.3), and the sum of the correlation between Y and ξ (Expression2.2) is equal to the sum of these regression weights (β), since in our case β is the correlation coefficient. This property does not hold for CCA (Figure2.7).

Systems medicine aims to explain genotype/phenotype relations in patients and uses mainly high-dimensional genomic/proteomic data to accomplish that. Our sRDA’s algorithm is easily extendable to analyse multiple omics sets (Section 2.2.3). Suppose there are three omics data sets X, Y and Z: say, genetic markers (X), methylation markers (Z) and gene-expression values (Y). In this case it seems natural to assume that both X and Z may have a causal effect on Y but the relation between X and Z is unclear. There are at least two options that need to be evaluated: using sRDA and ignoring the association between X and Z and using a combination of sRDA and CCA. Further research is needed to evaluate those options.

(31)

2.5 Conclusion

Systems medicine aims to explain genotype/phenotype relations in patients and uses mainly high-dimensional genomic/proteomic data to accomplish that. Recently there has been much interest in associating multiple phenotypes with multiple genetic and or genomic data sets. Current implemen-tations of multivariate statistical methods either cannot deal with the high dimensionality of genetic and genomic data sets or perform sub-optimally. We proposed Sparse Redundancy Analysis (sRDA) for high dimensional genetic and genomic data analysis. We built a software implemented of our algo-rithm and conducted simulation studies on SURFsara’s Lisa computer cluster to assess the reliability of sRDA. Analysis of 500,000 methylation markers and 20,000 gene-expression values measured in a subset of 55 patients with Marfan syndrome shows that sRDA is able to deal with the usual high dimensionality of genetic and genomic data and performs better optimized compared to CCA. Our sRDA’s algorithm is easily extendable to analyse multiple omics sets. Such an generalized version of sRDA could become an optimal statistical method for multiple omics sets analysis in systems medicine research.

Sparse Redundancy Analysis of High Dimensional Genetic and Genomic Data: A novel implementation of a multivariate statistical tool in omics data analysis

Sparse Redundancy Analysis of

High Dimensional Genetic and

Genomic Data

A novel implementation of a multivariate statistical

tool in omics data analysis

Attila Csala

University of Amsterdam

Scientific Research Project

Contents

Abstract

Dutch

Chapter 1

Background

1.1

The role of genetic and genomic data in biomedical

infor-matics

1.2

Genetic and molecular biology

1.3

Statistical challenges for genetic and genomic big data

analysis

1.3.1

High dimensionality leads to computational challenges as well as

challenges in interpretation

1.4

Statistical tools for genetic and genomic big data analysis

1.4.1

Principal Component Analysis (PCA)

1.4.2

Canonical Correlation Analysis (CCA)

1.4.3

Motivation for a novel statistical tool for big genetic and genomic

data

1.4.4

Redundancy Analysis (RDA)

1.5

Research questions

Chapter 2

Genome Analysis

Sparse Redundancy Analysis of High Dimensional

Genetic and Genomic Data

Abstract

2.1

Introduction

2.2

Methods

2.2.1

Multivariate techniques for integromics data analysis:

CCA and

RDA

2.2.2

RDA for genetic and genomic big data analysis

2.2.3

Analysis of multiple data sets with RDA

2.3

Results

2.3.1

Simulation studies

2.3.2

Analysis of methylation markers and gene expression data sets

2.4

Discussion

2.5

Conclusion