Algorithms and analysis of human disease genomics Inouye, M.

(1)

Algorithms and analysis of human disease genomics

Inouye, M.

Citation

Inouye, M. (2010, April 20). Algorithms and analysis of human disease genomics. Retrieved from https://hdl.handle.net/1887/15277

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/15277

Note: To cite this publication please use the final published version (if applicable).

(2)

6

GENERAL DISCUSSION

(3)

119 This thesis presents four chapters of analysis and algorithms for genome-wide association and transcriptomic studies. Chapters 2 and 3 detail a genotype calling algorithm and an analysis of whole-genome amplified DNA respectively while Chapters 4 and 5 contain a complete GWAS and an integrated analysis of genomic, transcriptomic, and quantitative lipid trait information.

Genotype determination is of central importance for any study which considers genotype-phenotype or population genetic relationships. Before the widespread implementation of high-throughput genotyping platforms that assess thousands of samples at hundreds of thousands of SNPs, genotypes were determined using simple heuristics or manually by human eye. The impracticality of this and the potential for systemic bias in large datasets is high, therefore the sophistication of genotype calling methods has increased with the quantity of data.

In the past several years, there has been a confluence of genotype calling methods toward mixture models. One of the first to implement this was Affymetrix’s Bayesian Robust Linear Model with Mahalanobis distance (BRLMM)¹. BRLMM pooled intensity information from multiple samples for each SNP while also implementing methods, e.g. quantile normalization and median polish, to normalize their intensity distributions, processing which is important to control for non-biological variation (typically arising from laboratory effects or batch processing). Another early Affymetrix procedure, the Markov chain Monte Carlo based CHIAMO algorithm by the Wellcome Trust Case Control Consortium, implemented a Bayesian hierarchical mixture model². Within this framework, the genotype classes (AA, AB, and BB) are given prior parameters allowing information to be borrowed across cohorts. However, at the time these early algorithms were not portable for the Illumina genotyping platform. This left researchers little choice but to use Illumina’s proprietary calling algorithm, GenCall, and/or manually-curated cluster positions in order to call genotypes for their DNA collections. The question was posed: given Illumina intensity data, how can a calling algorithm be constructed so that it does as well or better than the proprietary or manual methods?

Chapter 2, A genotype calling algorithm for the Illumina BeadArray Platform, addressed the problem of individual genotype assignment by presenting a statistical

(4)

method for the classification of Illumina’s fluorescence intensity clusters. The algorithm, ILLUMINUS, operates by pooling normalized intensity values across multiple samples for each SNP. A polar transformation was implemented and the resulting data was modeled using a three-component bivariate mixture model, where each component is a multivariate truncated t- distribution representing one of three genotype classes (assuming an autosomal SNP). The parameters of the mixture model are initialized empirically and fitted using an Expectation-Maximization (EM) algorithm. This differs from both BRLMM, which uses genotype calls made by Dynamic Model mapping (the forerunner for BRLMM) to seed the model, and CHIAMO, where priors are assigned to the genotype class parameters. While these algorithms are arguably more statistically sophisticated than ILLUMINUS, the computational speed advantage of using an EM algorithm should not be

underestimated. In the context of a high-throughput genotyping pipeline, this means that ILLUMINUS can play an initial role in assessing the genotyping success of each sample batch as well as reduce the computational burden (in CPU minutes), leaving more resources for the downstream analyses of trait association and population structure. ILLUMINUS was tested on both amplified and unamplified DNA and, relative to the proprietary Illumina algorithm with and without manual curation, showed improved accuracy over a larger number of calls. This was especially true for amplified DNA. While other algorithms have been tried on amplified DNA with varying levels of success, ILLUMINUS has shown robustness in this regard, and it has been used to call thousands more samples of amplified DNA at the Sanger Institute³.

The two most popular genotyping platforms, Affymetrix and Illumina, have typically required between 5 and 10 micrograms of DNA per sample in order to achieve successful genotyping. For some sample collections this can lead to a choice of whether or not to amplify rare and/or small quantities of DNA. However, for the current generation of large-scale, genome-wide studies it was unclear how

amplification would affect genotyping efficiency. Since the genome is differentially amplified, how does this affect fluorescent signal intensity? How does this translate for the performance of current genotype calling algorithms and what quantitative impact does it have on the genomic coverage of microarrays? Finally, can we use recently developed genotype imputation methods to rescue genomic regions and

(5)

121 SNPs which suffer from amplified DNA? These questions are central to maximizing the impact of GWAS or population studies using amplified DNA.

The third chapter, Whole genome-amplified DNA: insights and imputation, explores the effects of DNA amplification. The negative impact of amplification on fluorescent signal, genotype calling confidence, and genomic coverage was evaluated across 6,500 DNA samples from several different populations (British, Gambian, and Vietnamese) drawn from both the Affymetrix and Illumina platforms. Using the algorithm IMPUTE version 1, Chapter 3 also investigated genotype imputation as a possible counterweight to DNA amplification and found that imputation recovered 60-70% of missing data, largely reinstating the power and coverage expected of an unamplified cohort. Overall, Chapter 3 serves as both a cautionary tale for

investigators who face the choice of amplification and a guide for those who have no choice but to amplify due to rare or depleted DNA stocks.

While Chapters 2 and 3 presented analysis for the execution of GWAS, Chapter 4 showed a published instance of their implementation, in this case the ILLUMINUS calling algorithm and efforts to minimize false positive findings due to failed assays and batch effects. As an example GWAS, Bone mineral density, osteoporosis, and osteoporotic fractures: a genome-wide association study (Chapter 4) identified common SNPs at two genomic loci which contribute to an osteoporosis phenotype.

SNPs near TNFRSF11B associated with decreased bone mineral density and increased risk of osteoporosis while a non-synonymous SNP in the LRP5 gene associated with decreased bone mineral density, osteoporosis, and osteoporotic fractures. The combined effect of the risk alleles at both loci was similar to that of other

environmental osteoporosis risk factors and the loci were estimated to have a higher population frequency, however this represents only a first step on the road to accurate, meaningful prediction of disease risk. This chapter has lead studies to further

delineate the common genetic component of osteoporosis by identifying an additional 18 causative genes^{4, 5}. Chapter 4 also explored the functional consequences of genetic variation at rs4355801 (near TNFRSF11B) using gene expression data from

lymphoblast cell lines (LCLs), finding that the presence of the G allele resulted in a two-fold overexpression of TNFRSF11B. While offering insight into one of the possible functions of the SNP, it was not shown that expression of TNFRSF11B itself

(6)

associated with bone mineral density or osteoporosis. Further, the presence of the expression association may be the result of selection bias for LCLs. If we are to assume that the biological pathways leading to complex phenotypes proceed in a linear way from genetic variation to transcriptional variation to phenotypic variation (other levels of variation can also be assumed), then the chain of causation would ideally be shown at each level.

With the deluge of common genomic variants associated with complex phenotypes, a logical next step is to use this knowledge to underpin large-scale functional

investigation. In addition to Chapter 4, recent studies show that genetic variation plays a significant role in driving gene expression^{6, 7}, thus leading to the integration of transcriptomes with genomic profiles and clinical measurements^8-10.

Chapter 5, An immune response pathway associated with blood lipid levels, seeks to uncover the functional consequences of common disease variants as well as explore the potential for co-expression pathways to associate with quantitative traits. The study does this by integrating transcriptional and genetic profiles from over 500 individuals with clinical lipid measurements in blood. The top transcriptional associations with apolipoprotein B, high density lipoprotein, and triglyceride levels are shown to all function as part of the same co-expression sub-network. Further, the expression of the sub-network (Lipid-Leukocyte module) was shown to be driven by a previously observed GWAS signal for serum IgE levels¹¹. The Lipid-Leukocyte module contains key players in IgE signal transduction, a powerful pathway in allergy and inflammation. These findings indicate a previously unknown link between blood lipids and the immune response as well as provide a potential mechanistic hypothesis for the pathogenesis of coronary artery disease.

While the success of the study in Chapter 5 should provide a measure of optimism for future integrative studies, it is worth noting that we are still only just beginning to understand the complexity of biological networks and how to analyze them. For example, the test in Chapter 5 used to associate a co-expression sub-network with a lipid measurement only incorporated the sample loadings of the first singular vector from a decomposition of a sub-network’s expression matrix¹². While the

interpretation of this test is straight forward, it does not take into account the

(7)

123 orthogonal variation of additional singular vectors nor does it incorporate information about the amount of variation in expression explained in the first singular vector. This may lead to sub-networks being differentially represented by their summary

expression profiles (i.e. only 30% of the variation in a sub-network is tested whereas 90% of the variation is tested in another). Further, genetic variation can be used to orient the edges of a network thus allowing for the inference of causality^13-15, a key goal of any biological systems analysis. However, current studies are still

underpowered to detect genetic variants both in cis and in trans which drive gene expression. In order to better orient networks, expression quantitative trait loci studies will likely need sample sizes approaching those of GWAS, and even then there is no reason to expect that the genetic component of each gene’s expression will be the same as others.

Taken together, this thesis offers insight into how far the field of human disease genomics has come and how quickly it is now moving. The successes and failures of GWAS have not as yet been fully realized, and only time will tell as intricate biochemical and molecular studies are done to elucidate the potential mechanisms behind GWAS signals. In the mean time, the field must address the two largely exclusive issues of disease prediction and functional investigation. Both have tangible benefits for human health, however the most cogent and effective way forward remains to be seen.

(8)

References

1. Affymetrix, Inc. (2006) BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500K Array Set.

2. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-678 (2007).

3. Loos, R. J. et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity. Nat. Genet. 40, 768-775 (2008).

4. Styrkarsdottir, U. et al. New sequence variants associated with bone mineral density. Nat. Genet. 41, 15-17 (2009).

5. Rivadeneira, F. et al. Twenty bone-mineral-density loci identified by large-scale meta-analysis of genome-wide association studies. Nat. Genet. 41, 1199-1206 (2009).

6. Goring, H. H. et al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 39, 1208-1216 (2007).

7. Stranger, B. E. et al. Population genomics of human gene expression. Nat. Genet. 39, 1217-1224 (2007).

8. Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease. Nature 452, 429- 435 (2008).

9. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423-428 (2008).

10. Bosco, A., McKenna, K. L., Firth, M. J., Sly, P. D. & Holt, P. G. A network modeling approach to analysis of the Th2 memory responses underlying human atopic disease. J. Immunol. 182, 6011-6021 (2009).

11. Weidinger, S. et al. Genome-wide scan on total serum IgE levels identifies FCER1A as novel susceptibility locus. PLoS Genet. 4, e1000166 (2008).

12. Horvath, S. & Dong, J. Geometric interpretation of gene coexpression network analysis. PLoS Comput. Biol. 4, e1000117 (2008).

13. Schadt, E. E. et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet. 37, 710-717 (2005).

14. Li, R. et al. Structural model analysis of multiple quantitative traits. PLoS Genet. 2, e114 (2006).

15. Aten, J. E., Fuller, T. F., Lusis, A. J. & Horvath, S. Using genetic markers to orient the edges in quantitative trait networks: the NEO software. BMC Syst. Biol. 2, 34 (2008).

(9)

125 SUMMARY

Every living organism has a genome, a set of encoded biological instructions for the recreation of that particular organism. Genomics is the study of genomes and their function. In terms of the classical Nature vs. Nurture view of human traits and behavior, the genome constitutes the vast majority of the Nature side therefore it also plays a significant role in the development of human disease. In order to understand diseases like diabetes and schizophrenia, we must therefore understand our genome.

This thesis presents current studies and analytical methodologies in human disease genomics. These include the first non-proprietary genotype calling algorithm for the Illumina genotyping platform, the first assessment of whole genome amplified DNA for genome-wide association studies, a genome-wide association study investigating the genetic basis for bone mineral density, osteoporosis, and osteoporotic fractures, and an integrated genomics study of genetic, transcriptional, and blood lipid variation.

Before the design of the ILLUMINUS genotype calling algorithm (Chapter 2), genotypes assayed on the Illumina platform had to be called using a proprietary, black-box algorithm and/or the human eye. ILLUMINUS now allows for faster, automated, and more accurate genotype determination along with freely available source code. ILLUMINUS has been used in dozens of genome-wide association studies, uncovering scores of disease-associated genetic variants, and has stimulated the development of more customized algorithms for the Illumina platform.

Whole genome amplification is a technique that can make all the difference to the feasibility of a study. Many DNA collections are so rare and valuable that only minute quantities of DNA can be used, therefore exploiting the copying capabilities of polymerases is of immense interest. However, the technique is not perfect and its unintended effects can have a significant impact on the power to detect genetic variants associated with disease. This thesis provides the first large-scale assessment of whole genome amplified DNA in genome-wide association studies (Chapter 3) by characterizing its reduction of sample call rate, genome coverage, and increased signal variation. Further, a potential panacea in genotype imputation is tested and found to recover much of the lost performance.

(10)

Multiple analytical techniques are integrated in order to investigate the genetics underlying bone mineral density, osteoporosis, and osteoporotic fractures (Chapter 4).

Using a genome-wide association study framework, the association of genomic loci harboring TNFRSF11B and LRP5 is uncovered. This constitutes the first large-scale study of common genetic variation for these osteo-phenotypes, and, with the additional loci discovered by even larger studies, a first step on the road to the accurate prediction of disease status.

With the recent deluge of genetic association studies, the systematic functional enquiry has largely lagged behind. This thesis also explores the analysis of large-scale transcriptomic datasets and their integrated use with genetic and clinical data. Since genes which reside in the same biological pathway also tend to have the same expression patterns, networks of gene co-expression can be defined and characterized by their association with clinical measurements like the high-density lipoprotein concentration in circulation. Using these methods, a sub-network containing key mediators of the immune response and inflammation is discovered and found to strongly associate with blood lipid levels (Chapter 5). The sub-network, Lipid- Leukocyte module, represents a potential molecular hypothesis of the involvement of immunoglobulin E transduction in blood lipid pathways and the risk of atherosclerotic plaque formation.

While these studies have had a direct impact on scientists’ capacity to understand human disease, they constitute only a sliver of the rapidly developing field of genomics.