Algorithms and analysis of human disease genomics Inouye, M.

(1)

Algorithms and analysis of human disease genomics

Inouye, M.

Citation

Inouye, M. (2010, April 20). Algorithms and analysis of human disease genomics. Retrieved from https://hdl.handle.net/1887/15277

Version: Corrected Publisher’s Version

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden

Downloaded from: https://hdl.handle.net/1887/15277

Note: To cite this publication please use the final published version (if applicable).

(2)

1

INTRODUCTION AND AIMS OF THE THESIS

(3)

The study of the genetic basis for traits and disease began with the rediscovery in 1900 of Gregor Mendel’s seminal work on pisum sativum by Hugo de Vries and Carl Correns and has progressed rapidly with the recent sequencing of the human genome^{1, 2}. Today, in a span of 4-5 human meioses, there are thousands of scientists investigating how genetic variation correlates with observable phenotypes in a wide range of organisms. The most successful approaches in this endeavor have historically been candidate gene studies and linkage mapping, however in the past decade the foundation has been laid for an alternative method, association studies. This thesis covers the development of analytical tools needed for association studies and a specific example of their implementation for osteoporosis. From the success of association studies, I extend my research to associate phenotypes with gene transcription and co-expression networks while also incorporating information from the functional effects of association study findings. Taken together, this thesis aims to characterize the recent past of trait-disease mapping while also offering a glimpse of the field’s future directions.

Genome-wide linkage mapping has been tremendously successful in identifying the genes which underlie highly penetrant, monogenic (Mendelian) disease³. Since Mendelian diseases are usually quite rare, genetic markers proximal to the causative disease variant will co-segregate within families (Figure 1). Therefore, if a sufficient number of pedigrees can be established, then the genomic interval of the disease variant can be narrowed to the point that sequencing a collection of cases and controls can identify the variant.

(4)

Figure 1: Given the co-segregation of the disease variant with genetic markers A and B in the pedigree on the right, a logarithm of the odds of linkage (LOD score) can be calculated. In this example, the causative disease variant lies between A and B.

The power of linkage mapping however depends on two crucial assumptions (a) that the disease variant is highly penetrant and (b) the disease or trait itself is highly heritable.

Candidate gene studies (CGS) are largely based on a priori expectations of which genes are involved in disease. These genes are then sequenced in a collection of cases and controls⁴. The success of CGS has been variable as, almost by definition, it depends wholly on the knowledge of investigators, a humbling prospect given that linkage studies commonly exhibit surprising results. However, CGS has had notable success in particular for uncovering the effects of missense mutations in MC4R on morbid obesity⁵ and various genes in lipid metabolism^6-8.

Why has the success of linkage studies for Mendelian disease not been replicated for common disease? Beyond the observation that common diseases have far less heritability, many reasons have been put forward, however the most convincing comes from simple natural selection. Within a population, highly deleterious alleles would be subject to strong purifying selection and therefore would not be allowed to rise to the frequencies at

(5)

which common diseases occur. Further, Reich and Lander⁹ reasoned that since many common diseases tend to manifest themselves in later life (that is, after reproductive age), an allele that contributes fractionally to a disease could reach higher frequencies. This line of reasoning suggested that common disease and Mendelian disease have different genetic architectures, thus leading to the common disease-common variant hypothesis (CDCV). CDCV hypothesized that common disease is polygenic and made up of alleles of intermediate to common frequency in a population, each one contributing a relatively mild amount to overall disease risk.

Genome-wide association studies (GWAS) were conceived as a way of properly testing CDCV, however in order to make GWAS possible the genetics field had to make several key advances. Variation in the human genome had to be characterized, both in terms of increasing the known sample space through the SNP Consortium¹⁰ and in terms of understanding and measuring its correlated structure through recombination¹¹ and linkage disequilibrium^{12, 13}. The latter of which allowed for the intelligent selection of a subset of markers and formed the basis for the inference of genotypes at unobserved markers, therefore crucially reducing the number of variants needed to be tested and leading to the development of a number of imputation algorithms¹⁴ (Li Y, Ding J, Abecasis GR.

Markov model for rapid haplotyping and genotype imputation in genome wide studies.

Am J Hum Genet. 2007;S79:2290). Just as importantly, GWAS had to be appropriately powered to differentiate alleles of mild effects from random noise and artifacts, thus the genetic variation of large samples sizes needed to be assessed. The goal of accurately assaying large number of samples in a timely manner was primarily achieved by private companies (many working in collaboration with academics) who developed highly scalable platforms with microarrays that could assess thousands of samples at hundreds of thousands of genomic loci in a matter of months. In this area, the two most successful companies to date are Affymetrix (Santa Clara, CA, USA) and Illumina (San Diego, CA, USA).

The general idea behind GWAS is quite simple. In a case/control setting, a set of genetic markers is determined in two groups randomly drawn from the same population, the only

(6)

difference between them being that one displays a certain condition (e.g. type 1 diabetes) and the other does not (or has not yet manifested the condition). For each genetic marker, we can then perform a statistical test (e.g. a Cochran-Armitage test for trend or a logistic regression) to see if it can significantly predict case/control status. In a quantitative trait setting, the trait we are measuring is continuous (e.g. concentration of low density lipoprotein cholesterol), and we have a similar goal in that we wish to determine if a marker can significantly predict (e.g. through linear regression) the quantity of the trait.

The complexity of GWAS comes from the execution of these simple tests. There are major considerations as to the design of the study^{15, 16}, the most common being a two- stage design in which stage one is used to identify candidate loci to take forward for replication in a larger (typically broader) collection. There are issues of power and coverage^17-21 which have significant ramifications for how many samples are tested (and in a case/control setting, the ratio) and which genotyping array should be chosen. For example, a study needs to test almost 9,000 samples to have 90% power to detect a variant that has a 20% minor allele frequency and increases disease risk by a factor of 1.2 (assuming a significance threshold of 10^-8)¹⁸. Further, given the many correlated tests employed, there are a number of ways, both frequentist and Bayesian, to assign “genome- wide significance”^22-25, however GWAS seem to have settled on a straight frequentist cutoff of 10^-7or 10^-8; there exist simulation studies which support this view as it is predicted that there are approximately one million haploblocks in the human genome (Pe’er I and Daly 2008). Lastly, simple quality control plays an integral role in the success of a GWAS. As samples are sent for genotyping it is essential to control laboratory errors by tracking their progress using DNA fingerprints (typically genotypes at a few dozen high frequency loci); both sample and SNP filtering criteria, e.g. assessing call rate, pair-wise sample identity-by-descent, and a SNP Hardy-Weinberg test, are implemented to remove poor or failed assays; and manual SNP cluster plot inspection of the significantly associated signals is crucial to identify instances of poor genotyping classification (Figure 2), a source of bias especially in case/control designs²⁶.

(7)

Figure 2: Instances of (a) poor and (b) accurate genotype classification. Genotypes are classified as AA (blue), AB (green), BB (red), or no-call (grey). The poor classification favors the A allele for no-calls, leading to an artificially low A allele frequency and, if the no-calling is biased between cases and controls, a potentially false positive association signal.

For any study of SNP variation, two of the most fundamental problems are (a) how is the genotype of a sample determined and (b) how much DNA is required for genotyping.

While the designs of Affymetrix and Illumina microarrays are different both assess each SNP using fluorescent signal measures for the relative proportions of allele A and allele B in a given sample, therefore the problem of genotype determination, or “genotype calling”, is based on how we can best interpret this information. To date, the most successful genotype calling methods have been through computational algorithms, and Chapter 2 of this thesis, A genotype calling algorithm for the Illumina BeadArray platform (in collaboration with Yik Y. Teo and Taane G. Clark)²⁷, presents an open- source algorithm which calls genotypes in an accurate and fast manner. Subsequently, the algorithm ILLUMINUS has been successfully applied in a number of GWAS^28-32. The quantity of DNA required for successful genotyping is non-trivial, typically 5-10 micrograms, therefore protocols that amplify small quantities of DNA can be extremely

(8)

valuable for laboratories with resources that are finite or below those required. Chapter 3 of this thesis, Whole genome-amplified DNA: insights and imputation (in collaboration with Yik Y. Teo)³³, assesses the effects of the most popular method, multiple displacement amplification³⁴, so that researchers can make an informed decision of whether or not to amplify their sample collections.

With both the technology and the tools developed, there has been a deluge of significantly associated loci for a wide range of diseases and traits, including type 1 diabetes^{31, 35}, type 2 diabetes^{36, 37}, myocardial infarction^{38, 39}, Crohn’s disease^{40, 41}, celiac disease^{42, 43}, various cancers^44-46, obesity^{28, 47}, height⁴⁸, and hair/eye/skin color⁴⁹. Perhaps the most successful of which was the Wellcome Trust Case Control Consortium (WTCCC)⁵⁰, a collaboration of over a hundred scientists from the United Kingdom which sought to identify variants underlying seven common diseases (Figure 3).

(9)

Figure 3: Manhattan plots for the seven common diseases analyzed by the WTCCC⁵⁰. This figure is reproduced with the permission of Nature Publishing Group.

The fourth chapter of this thesis, Bone mineral density, osteoporosis, and osteoporotic fractures: a genome-wide association study, features a GWAS (in collaboration with J.

Brent Richards and Fernando Rivadeneira)³² that identified two biologically plausible, common gene variants which decrease bone mineral density and increase risk of osteoporosis, one of which also increases risk of osteoporotic fractures. In addition to Chapter 4, I have also contributed to the identification of a common variant in the IL2/IL21 region associated with celiac disease⁴³ and a common variant proximal to MC4R that is associated with obesity²⁸ (Figure 4).

(10)

Figure 4: Genome-wide association signals in celiac and obesity. This figure shows (a) the association signal of rs6822844 (meta-analysis P = 1.3x10^-14) for celiac disease and its regional LD structure and (b) the association signal of rs17782313 (P = 2.9x10^-6in stage one, replicated in adults at P = 2.8x10^-15) and its regional haplotype blocks as defined by light blue signals of recombination hotspots. These figures are reproduced from van Heel et al.⁴³ and Loos et al.²⁸ with the permission of Nature Publishing Group.

Taken together GWAS have been both a resounding success and an appreciable disappointment. We now have a catalogue of hundreds of genetic associations⁵¹, many of which were not predicted a priori and some of which share association with multiple diseases or traits thus revealing startling overlaps in genetic architectures^52-54. Also defying previous expectations, the majority of the associations appear to be in non-coding regions of the genome, indicating a potentially large role for gene regulation. However, functional conclusions are largely premature since many SNP(s) displaying association are tested because of their correlation with proximal SNPs, therefore it is important to fine map the associated region for the causal SNP⁵⁵ or genomic event like a deletion.

What is apparent from the many findings of GWAS is that, with a few notable exceptions^{56, 57}, the effects sizes for common SNPs are typically low, usually conferring an increase in risk of only 1.5-1.1 per associated allele. Assuming no interactions between loci, the genetic variance for a trait explained by all common loci together is quite low, usually 10% or less^{41, 58}.

In response, there are efforts now to expand the original CDCV hypothesis to encompass

(11)

the full allelic spectrum of genetic variation. As depicted in Figure 5, the general observation of disease-associated variants has been that allele frequency and effect are negatively correlated, thus rare, Mendelian alleles have large effects and common alleles have small effects. So far, we have explored the two extremes of the allelic distribution and now aim to focus on those alleles of intermediate frequencies (~0.01%-1.0%) and assumedly intermediate effects. The 1000 Genomes Project⁵⁹ and other trait-specific resequencing studies are underway to identify these loci and should serve to increase the amount of explainable genetic variance.

Figure 5: The allelic spectrum of disease-associated alleles is hypothesized to be somewhat “L” shaped.

The common disease-common variant hypothesis (CDCV) has largely captured those alleles of high frequency and low effects while Mendelian diseases are rare but of large effects. The intermediate variation remains to be characterized.

(12)

Another approach to understanding the capacity of the human genome to cause disease is to integrate transcriptional information. Given that many of the associated loci uncovered by GWAS appear to be in regulatory regions, the characterization of the regulatory effects of these loci is especially relevant and will likely to lead to novel functional insights^{60, 61}. It has been previously shown that SNPs and CNVs cause changes in gene expression^62-64 and, further, that these changes occur in a tissue-dependent manner⁶⁵. Recent studies have begun to integrate genetic, transcriptional, and disease information in a systems level approach which uses transcriptional variation to define a gene network and its sub-networks. This has lead to the identification of the macrophage-enriched metabolic network (MEMN), a sub-network shared by both mouse and man which is enriched for genes associated with obesity^{66, 67}. Murine knockouts of genes within the MEMN confirmed them as novel genes for metabolic syndrome⁶⁶. Chapter 5 of this thesis, An immune response pathway associated with blood lipid levels, attempts to connect the findings of GWAS with those at the systems level. To this end, we show that a sub-network of co-expressed genes, the lipid-leukocyte (LL) module, is not only associated with high density lipoprotein, apolipoprotein B, and triglyceride levels but also associated with variation at a common SNP previously shown to be associated with serum immunoglobulin E (IgE) levels⁶⁸. Given that LL module harbors key players in the immune response, histidine decarboxylase and subunits of high affinity IgE receptor, the genetic association is biologically plausible and generates a mechanistic hypothesis for the role the immune system might play in changes to serum lipid concentration.

The success of genetic mapping in human disease is only just the beginning of a path which takes us towards better mechanistic understandings and clinical applications. Just as diseases only have a finite heritability, so too does the genome constitute only the first layer of disease models which will eventually incorporate transcriptional, epigenetic, cellular and metabolomic information. As research increases in these areas, so too must the ability to measure and quantify environmental variables like the amount of exposure to ultraviolet radiation or caloric intake. Clinically, the heritability of common variation associated with disease has only given modest returns of up to 10% yet the proliferation of direct-to-consumer genetic tests (e.g. 23andMe, Inc.) has been somewhat concerning

(13)

given public difficulty in assessing health risks^{69, 70}. The era of “genomic medicine” will likely depend on the ability of established medical centres to incorporate recently developed technologies (e.g. next-generation sequencing) and findings in genetic associations into everyday clinical practice. To this end, two studies in particular are underway: ClinSeq⁷¹ at the National Human Genome Research Institute and ProjectK at Leiden University Medical Centre. ClinSeq investigates the practical considerations for implementation of next-generation sequencing for genomic medicine by aiming initially to sequence 300-400 genes from 1000 individuals with clinical data. ProjectK is a study of one human genome, from a Dutch clinical geneticist, which has been sequenced to

~20X depth. A complete disease and trait risk profile will be assembled, and, with input from the clinical geneticist herself, ProjectK aims not only to be a proof-of-principle for organizing and implementing genomic medicine at a medical centre but also to

investigate questions of individual perception and informed consent.

In the coming decade, genomic medicine will become more clinically prevalent and studies such as ClinSeq and ProjectK will only grow in importance. Just as scientists have taken extraordinary care in the genetic investigation of complex disease, it is likely that a similar care will be needed as this knowledge is transferred to non-scientists.

References

1. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001).

2. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304-1351 (2001).

3. Jimenez-Sanchez, G., Childs, B. & Valle, D. Human disease genes. Nature 409, 853-855 (2001).

4. Tabor, H. K., Risch, N. J. & Myers, R. M. Candidate-gene approaches for studying complex genetic traits: practical considerations. Nat. Rev. Genet. 3, 391-397 (2002).

5. Vaisse, C. et al. Melanocortin-4 receptor mutations are a frequent and heterogeneous cause of morbid obesity. J. Clin. Invest. 106, 253-262 (2000).

6. Cohen, J. et al. Low LDL cholesterol in individuals of African descent resulting from frequent nonsense mutations in PCSK9. Nat. Genet. 37, 161-165 (2005).

7. Romeo, S. et al. Population-based resequencing of ANGPTL4 uncovers variations that reduce triglycerides and increase HDL. Nat. Genet. 39, 513-516 (2007).

8. Cohen, J. C. et al. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305, 869-872 (2004).

9. Reich, D. E. & Lander, E. S. On the allelic spectrum of human disease. Trends Genet. 17, 502-510 (2001).

(14)

10. Sachidanandam, R. et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928-933 (2001).

11. McVean, G. A. et al. The fine-scale structure of recombination rate variation in the human genome.

Science 304, 581-584 (2004).

12. Daly, M. J., Rioux, J. D., Schaffner, S. F., Hudson, T. J. & Lander, E. S. High-resolution haplotype structure in the human genome. Nat. Genet. 29, 229-232 (2001).

13. International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299-1320 (2005).

14. Marchini, J., Howie, B., Myers, S., McVean, G. & Donnelly, P. A new multipoint method for genome- wide association studies by imputation of genotypes. Nat. Genet. 39, 906-913 (2007).

15. Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 38, 209-213 (2006).

16. McCarthy, M. I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356-369 (2008).

17. de Bakker, P. I. et al. Efficiency and power in genetic association studies. Nat. Genet. 37, 1217-1223 (2005).

18. Altshuler, D., Daly, M. J. & Lander, E. S. Genetic mapping in human disease. Science 322, 881-888 (2008).

19. Anderson, C. A. et al. Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am. J. Hum. Genet. 83, 112-119 (2008).

20. Barrett, J. C. & Cardon, L. R. Evaluating coverage of genome-wide association studies. Nat. Genet. 38, 659-662 (2006).

21. Spencer, C. C., Su, Z., Donnelly, P. & Marchini, J. Designing genome-wide association studies: sample size, power, imputation, and the choice of genotyping chip. PLoS Genet. 5, e1000477 (2009).

22. Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.

S. A. 100, 9440-9445 (2003).

23. Dudbridge, F. & Koeleman, B. P. Efficient computation of significance levels for multiple associations in large studies of correlated data, including genomewide association studies. Am. J. Hum. Genet. 75, 424- 435 (2004).

24. Nyholt, D. R. A simple correction for multiple testing for single-nucleotide polymorphisms in linkage disequilibrium with each other. Am. J. Hum. Genet. 74, 765-769 (2004).

25. Wacholder, S., Chanock, S., Garcia-Closas, M., El Ghormli, L. & Rothman, N. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J. Natl. Cancer Inst. 96, 434-442 (2004).

26. Clayton, D. G. et al. Population structure, differential bias and genomic control in a large-scale, case- control association study. Nat. Genet. 37, 1243-1246 (2005).

27. Teo, Y. Y. et al. A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics 23, 2741-2746 (2007).

28. Loos, R. J. et al. Common variants near MC4R are associated with fat mass, weight and risk of obesity.

Nat. Genet. 40, 768-775 (2008).

29. Sandhu, M. S. et al. LDL-cholesterol concentrations: a genome-wide association study. Lancet 371, 483-491 (2008).

30. Soranzo, N. et al. Meta-analysis of genome-wide scans for human adult stature identifies novel Loci and associations with measures of skeletal frame size. PLoS Genet. 5, e1000445 (2009).

31. Barrett, J. C. et al. Genome-wide association study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nat. Genet. (2009).

32. Richards, J. B. et al. Bone mineral density, osteoporosis, and osteoporotic fractures: a genome-wide association study. Lancet 371, 1505-1512 (2008).

33. Teo, Y. Y. et al. Whole genome-amplified DNA: insights and imputation. Nat. Methods 5, 279-280 (2008).

34. Dean, F. B. et al. Comprehensive human genome amplification using multiple displacement amplification. Proc. Natl. Acad. Sci. U. S. A. 99, 5261-5266 (2002).

35. Todd, J. A. et al. Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat. Genet. 39, 857-864 (2007).

36. Zeggini, E. et al. Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science 316, 1336-1341 (2007).

(15)

37. Zeggini, E. et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat. Genet. 40, 638-645 (2008).

38. Erdmann, J. et al. New susceptibility locus for coronary artery disease on chromosome 3q22.3. Nat.

Genet. 41, 280-282 (2009).

39. Myocardial Infarction Genetics Consortium et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 41, 334-341 (2009).

40. Parkes, M. et al. Sequence variants in the autophagy gene IRGM and multiple other replicating loci contribute to Crohn's disease susceptibility. Nat. Genet. 39, 830-832 (2007).

41. Barrett, J. C. et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. 40, 955-962 (2008).

42. Hunt, K. A. et al. Newly identified genetic risk variants for celiac disease related to the immune response. Nat. Genet. 40, 395-402 (2008).

43. van Heel, D. A. et al. A genome-wide association study for celiac disease identifies risk variants in the region harboring IL2 and IL21. Nat. Genet. 39, 827-829 (2007).

44. Song, H. et al. A genome-wide association study identifies a new ovarian cancer susceptibility locus on 9p22.2. Nat. Genet. 41, 996-1000 (2009).

45. Rapley, E. A. et al. A genome-wide association study of testicular germ cell tumor. Nat. Genet. 41, 807-810 (2009).

46. Ahmed, S. et al. Newly discovered breast cancer susceptibility loci on 3p24 and 17q23.2. Nat. Genet.

41, 585-590 (2009).

47. Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889-894 (2007).

48. Weedon, M. N. et al. Genome-wide association analysis identifies 20 loci that influence adult height.

Nat. Genet. 40, 575-583 (2008).

49. Sulem, P. et al. Genetic determinants of hair, eye and skin pigmentation in Europeans. Nat. Genet. 39, 1443-1452 (2007).

50. Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661-678 (2007).

51. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362-9367 (2009).

52. Smyth, D. J. et al. Shared and distinct genetic variants in type 1 diabetes and celiac disease. N. Engl. J.

Med. 359, 2767-2777 (2008).

53. Wolf, N. et al. Psoriasis is associated with pleiotropic susceptibility loci identified in type II diabetes and Crohn disease. J. Med. Genet. 45, 114-116 (2008).

54. Diabetes Genetics Initiative of Broad Institute of Harvard and MIT, Lund University, and Novartis Institutes of BioMedical Research et al. Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316, 1331-1336 (2007).

55. Lowe, C. E. et al. Large-scale genetic fine mapping and genotype-phenotype associations implicate polymorphism in the IL2RA region in type 1 diabetes. Nat. Genet. 39, 1074-1082 (2007).

56. Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385-389 (2005).

57. Strittmatter, W. J. & Roses, A. D. Apolipoprotein E and Alzheimer's disease. Annu. Rev. Neurosci. 19, 53-77 (1996).

58. Prokopenko, I., McCarthy, M. I. & Lindgren, C. M. Type 2 diabetes: new genes, new understanding.

Trends Genet. 24, 613-621 (2008).

59. Kaiser, J. DNA sequencing. A plan to capture human diversity in 1000 genomes. Science 319, 395 (2008).

60. Nica, A. C. & Dermitzakis, E. T. Using gene expression to investigate the genetic basis of complex disorders. Hum. Mol. Genet. 17, R129-34 (2008).

61. Dermitzakis, E. T. From gene expression to disease risk. Nat. Genet. 40, 492-493 (2008).

62. Stranger, B. E. et al. Population genomics of human gene expression. Nat. Genet. 39, 1217-1224 (2007).

63. Stranger, B. E. et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315, 848-853 (2007).

(16)

64. Goring, H. H. et al. Discovery of expression QTLs using large-scale transcriptional profiling in human lymphocytes. Nat. Genet. 39, 1208-1216 (2007).

65. Dimas, A. S. et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246-1250 (2009).

66. Chen, Y. et al. Variations in DNA elucidate molecular networks that cause disease. Nature 452, 429- 435 (2008).

67. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423-428 (2008).

68. Weidinger, S. et al. Genome-wide scan on total serum IgE levels identifies FCER1A as novel susceptibility locus. PLoS Genet. 4, e1000166 (2008).

69. Pancioli, A. M. et al. Public perception of stroke warning signs and knowledge of potential risk factors.

JAMA 279, 1288-1292 (1998).

70. Slovic, P. Perception of risk. Science 236, 280-285 (1987).

71. Biesecker, L. G. et al. The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine. Genome Res. 19, 1665-1674 (2009).