In silico strategies to improve insight in breast cancer Bense, Rico
DOI:
10.33612/diss.101935267
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2019
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):
Bense, R. (2019). In silico strategies to improve insight in breast cancer. Rijksuniversiteit Groningen.
https://doi.org/10.33612/diss.101935267
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.
Arkajyoti Bhattacharya
1* Rico D. Bense
1* Carlos G. Urzúa-Traslaviña1
1Elisabeth G.E. de Vries
1Marcel A.T.M. van Vugt
1Rudolf S.N. Fehrmann
11
University of Groningen, University Medical Center Groningen, Department of Medical Oncology, Groningen, The Netherlands
*These authors contributed equally to this work
Transcriptional effects of copy number alterations and their associations with biological processes
in a large set of human cancers
Submitted
ABSTRACT
Copy number alterations (CNAs) can promote tumor progression by altering gene
expression levels. Due to transcriptional adaptive mechanisms, however, CNAs do not
always translate proportionally into altered expression levels. By reanalyzing >34,000 gene
expression profiles, we revealed the degree of transcriptional adaptation to CNAs in a
genome-wide fashion, which strongly associated with distinct biological processes. We
then developed a platform-independent method – ‘transcriptional adaptation to CNA
profiling’ (TACNA profiling) – that extracts the transcriptional effects of CNAs from gene
expression profiles without requiring paired CNA profiles. TACNA profiling was applied
to >28,000 patient-derived tumor samples to define the landscape of transcriptional effects
of CNAs. The utility of this landscape was demonstrated by the identification of four
genes that are predicted to be involved in tumor immune evasion when transcriptionally
affected by CNAs. In conclusion, we provide a novel tool to gain insight into how CNAs
drive tumor behavior via altered expression levels.
5
The genomic composition of a tumor results from defects in genome maintenance, leading to the genomic instability that characterizes many cancers 1 . Genomic instability can result in the accumulation of structural chromosome aberrancies, including copy number alterations (CNAs). CNAs are important genomic events in the progressive molecular
‘rewiring’ that takes place during tumor cell evolution 2 .
CNAs can promote tumor progression via alteration of expression levels of genes located at the affected genomic regions 3–5 . Due to transcriptional adaptive mechanisms, however, changes in gene copy number at the genomic level do not always translate proportionally into altered gene expression levels (Fig. 1a) 6,7 . The degree of transcriptional adaptation to CNAs is currently determined with the ‘genetical genomics’ approach 8 , which combines paired genomic and gene expression profiles of tumor samples to explore associations between CNAs and gene expression levels. However, we previously demonstrated that this approach is difficult to use for detecting the effects of CNAs on gene expression levels in a gene expression profile 9 . This is because gene expression profiling is often performed on biopsies comprising tumor cells and non-tumor cells from the tumor microenvironment. A gene expression profile therefore measures the average expression of all cell types present in tumor biopsies. This means that the effects of CNAs on gene expression levels, besides being influenced by experimental and other non- genetic factors, are often overshadowed by the effects of non-tumor cells 10,11 . Therefore, to accurately determine the degree of transcriptional adaptation to CNAs using genetical genomics, large numbers of paired genomic and gene expression profiles of tumor samples are required. Unfortunately, few large-scale genetical genomic datasets are available, and many of the publicly accessible gene expression profiles from tumor samples do not have a paired genomic profile. These samples are therefore currently excluded from genetical genomics analyses. Despite the current efforts in genetical genomics, the degree of transcriptional adaptation to CNAs thus remains unclear for most genes.
Unravelling the degree of transcriptional adaptation to CNAs in a genome-wide
fashion would greatly enhance our knowledge of how CNAs drive tumor progression. We
therefore developed a new, platform-independent method: transcriptional adaptation to
CNA profiling (TACNA profiling). TACNA profiling extracts the effects of CNAs on gene
expression levels from a single gene expression profile – generated from a tumor biopsy
– without the need for a paired genomic CNA profile. Using this method, we defined the
landscape of transcriptional effects of CNAs in >34,000 expression profiles. To demonstrate
this landscape’s utility, we determined which genes and biological pathways affected
by CNAs might be associated with tumor-infiltrating immune cell composition. These
insights could ultimately contribute to the development of new therapeutic strategies.
RESULTS Data acquisition
We collected gene expression data of 34,494 samples from four public repositories: Gene Expression Omnibus (GEO, n = 21,592, generated with Aff ymetrix HG-U133 Plus 2.0), Th e Cancer Genome Atlas (TCGA, n = 10,817, generated with RNA-seq), Cancer Cell Line Encyclopedia (CCLE, n = 1,067, generated with Aff ymetrix HG-U133 Plus 2.0) and Genomics of Drug Sensitivity in Cancer (GDSC, n = 1,018, generated with Aff ymetrix HG-U219) (Fig. 1b). In total, 2,085 expression profi les were generated from cell line samples, 28,200 from patient-derived tumor biopsies, and 4,209 from patient-derived tissue biopsies of normal tissue. Th e patient-derived tumor gene expression profi les collected from GEO and TCGA represented 30 tumor types in total (Supplementary Tables 1-4). A detailed description of pre-processing, normalization and quality control is
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
25 15 5 -5 -15 -25 -1 0 1
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
Genomic mapping
Weightinconsensussource
Extreme-valued region indicator
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
25 15 5 -5 -15 -25 -1 0 1
CCLE - cell lines - microarray data (HG-U133 Plus 2.0)Concensus source Extreme-valued region indicator
CCLE - cell lines - microarray data (HG-U133 Plus 2.0)
GDSC - cell lines - microarray data (HG-U219)
GDSC - cell lines - microarray data (HG-U219) GEO - patient samples - microarray data (HG-U133 Plus 2.0)
GEO - patient samples - microarray data (HG-U133 Plus 2.0)
TCGA - patient samples - RNA-seq data
TCGA - patient samples - RNA-seq data
a
c
b
Gene expression level
Gain No alteration Loss
Low degree of transcriptional adaptation
Gain No alteration Loss
Gene expression level
High degree of transcriptional adaptation Altered gene expression levels Copy number
alterations
Gain Loss
Cancer hallmarks
Evading growth suppressors
Resisting cell death Genome instability
& mutation Sustaining proliferative signaling
Enabling replicative immortality
Avoiding immune destruction Deregulating cellular energetics Inducing angiogenesis
Tumor- promoting inflammation Activating invasion &
metastasis
Figure 1. Data acquisition and decomposition of gene expression profi les. (a) CNAs can promote tumor progression via altering expression levels of genes located at the aff ected genomic regions. However, due to transcriptional adaptation mechanisms, changes in gene copy number at the genomic level do not always translate proportionally into altered mRNA expression levels.
Unravelling the degree of transcriptional adaptation to CNAs for all genes will greatly contribute to our knowledge how CNAs
drive tumor progression. (b) Number of gene expression profi les collected from GEO, TCGA, CCLE, and GDSC. (c) Examples of
CNA-CESs harboring a pattern in which only genes mapping to a specifi c contiguous genomic region had a high absolute weight.
5
provided in the Supplementary Note.
Independent component analysis identifies the effects of CNAs on gene expression levels
In tumor gene expression profiling, the net measured expression level of an individual gene is determined by the combined effects of various transcriptional regulatory factors, including experimental, genetic (e.g. CNAs) and non-genetic factors. To identify the effects of these factors on gene expression levels, we first performed consensus-independent component analysis (consensus-ICA) 12 on each of the four datasets separately (Supplementary Fig. 1, see Supplementary Note). Consensus-ICA is a computational method to separate gene expression profiles into additive consensus estimated sources (CESs), so that each CES is statistically as independent from the other CESs as possible. We hypothesized that each CES describes the effect of a latent transcriptional regulatory factor on gene expression levels. In every CES, each gene has a weight that describes how strongly and in which direction its expression level is influenced by a latent transcriptional regulatory factor.
Ultimately, consensus-ICA resulted in 855 CESs in the GEO dataset, 1,383 CESs in the TCGA dataset, 467 CESs in the CCLE dataset and 466 CESs in the GDSC dataset.
In all four datasets, many CESs showed a consistent pattern in which only genes mapping to a specific contiguous genomic region had a high absolute weight (Fig. 1c).
We hypothesized that these CESs (referred to as CNA-CESs) capture the effects of CNAs on gene expression levels. Using a detection algorithm (see Supplementary Note), we identified 242/855 (28%) of CNA-CESs in the GEO dataset containing such a pattern, 447/1,383 (32%) in the TCGA dataset, 186/467 (40%) in the CCLE dataset and 175/466 (38%) in the GDSC dataset. In these CNA-CESs, 97% of genes present in the GEO dataset, 97% in the TCGA dataset, 95% in the CCLE dataset and 94% in the GDSC dataset were located in at least one genomic region identified by the detection algorithm as having a significant number of genes with a high absolute weight (Supplementary Table 5). These high percentages indicate that almost all genes are affected by CNAs in our datasets.
Inferred profiles describing the transcriptional effects of CNAs highly correlate with paired, independently generated CNA profiles
To continue testing the above hypothesis, we developed a method called transcriptional
adaptation to CNA profiling (TACNA profiling – see Supplementary Note). In TACNA
profiling, the gene expression profile of an individual sample is reconstructed using only
CNA-CESs. We assumed that the resulting TACNA profile would show the effect of CNAs
present in the genome on gene expression levels of that sample, and thus expected to see
a clear correlation between paired TACNA and CNA profiles. We retrieved CNA profiles,
derived from SNP arrays, for a subset of samples in the TCGA dataset (n = 10,620), CCLE dataset (n = 1,011) and the GDSC dataset (n = 962). Examples of TACNA profi les showed that the observed patterns clearly matched those in their paired, independently generated CNA profi les (Fig. 2a). Strong correlations were found between TACNA profi les and paired CNA profi les in all three datasets (Fig. 2b). Low correlations were found for a subset of samples in the TCGA dataset mainly representing normal tissue, which generally does not contain CNAs (see inset Fig. 2b). As expected, we found lower correlations between paired TACNA and CNA profi les in the TCGA dataset for patient-derived samples belonging to the more ‘point mutation-driven’ tumor types (e.g. acute myeloid leukemia) than for samples from tumor types that are more ‘copy-number driven’ (e.g.
ovarian cancer) (Fig. 2c) 13 . Moreover, the borders of marked genomic regions in 82/242
Copy number TACNA signal 1 3
-3 -1 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
Genomic mapping
Copy number TACNA signal 1 3
-3 -1 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
Genomic mapping
Copy number TACNA signal 1 3
-3 -1 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
Genomic mapping
Copy number TACNA signal 1 3
-3 -1 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
Genomic mapping
Copy number TACNA signal 10
-10 -2 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
2 6
-6
Genomic mapping
Correlation between TACNA- and CNA-profile 0
200 400 600 800 1,000
Count
−0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Tumor samples Normal samples
Correlation between TACNA- and CNA-profile
Variance of CNA-profile
−0.25 0.00 0.25 0.50 0.75 1.00
0 0.25
0.5 Tumor samples Normal samples
TCGA TCGA
Copy number TACNA signal 10
-10 -2 5
-5 -3 -1 1 3
1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122
2 6
-6
Genomic mapping
TCGA - individual patient sample TACNA profile CNA profile
TCGA - individual patient sample
CCLE - individual cell line sample CCLE - individual cell line sample
GDSC - individual cell line sample GDSC - individual cell line sample
0 20 40 60 80 100
Count
−0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Correlation between TACNA- and CNA-profile
0 20 40 60 80 100
Count
−0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Correlation between TACNA- and CNA-profile
Correlation between TACNA- and CNA-profile
Variance of CNA-profile
−0.25 0.00 0.25 0.50 0.75 1.00
0 0.25
CCLE 0.5 CCLE
Correlation between TACNA- and CNA-profile
Variance of CNA-profile
−0.25 0.00 0.25 0.50 0.75 1.00
0 0.25
GDSC 0.5 GDSC
−0.25 0.00 0.25 0.50 0.75 1.00
0 0
200
−0.50 −0.25 0.00 0.25 0.50 0.75
Tumor samples Normal samples
Normal Acute myeloid leukemiaThyroid carcinoma Prostate adenocarcinomaThymoma Uterine corpus endometrial carcinomaBreast carcinoma, unknown subtypeER−/HER2−/PR+ breast carcinomaDiffuse large B-cell lymphomaPancreatic adenocarcinomaSkin cutaneous melanomaGastric adenocarcinomaColon adenocarcinomaUrothelial carcinomaLower grade gliomaUveal melanomaMesotheliomaSarcoma Pheochromocytoma and paragangliomaER+/HER2+ breast carcinomaER+/HER2− breast carcinomaPapillary renal cell carcinomaHepatocellular carcinoma Head and neck squamous cell carcinomaOvarian serous cystadenocarcinomaChromophobe renal cell carcinomaCervical squamous cell carcinomaTriple-negative breast carcinomaLung squamous cell carcinomaClear cell renal cell carcinomaER−/HER2+ breast carcinomaTesticular germ cell carcinomaAdrenocortical carcinomaGlioblastoma multiformeRectal adenocarcinomaUterine carcinosarcomaEsophageal carcinomaLung adenocarcinomaCholangiocarcinoma
0.0 0.2 0.4 0.6 0.8 1.0
1st quartile of r [0-0.452) 2nd quartile of r [0.452-0.684) 3rd quartile of r [0.684-0.765) 4th quartile of r [0.765-1]
Proportion (n = 304)498)
142)514) 294)56) 160)156)94) 184)43)77) 526)36) 66) 366)138) 288) 516)544)
166)204)87) 470)11) 406)259) 451)413) 527)178) 80)48) 362)119) 492)504) 166)657) (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n = (n =(n =
a
b c
Figure 2. Transcriptional adaptation to CNA profi ling (TACNA profi ling). (a) Examples of TACNA profi les illustrating that the resulting patterns clearly matched those in their paired, independently generated CNA profi les. (b) Left panel: distribution of Pearson correlations between TACNA profi les and paired CNA profi les for the TCGA, CCLE, and GDSC datasets. Right panel:
variance observed in CNA profi les versus the Pearson correlation between CNA profi les and paired TACNA profi les for the TCGA, CCLE, and GDSC datasets. (c) Quartiles distribution plot of Pearson correlations between TACNA profi les and paired CNA profi les, per tumor type in the TCGA dataset. ER, estrogen receptor; HER2, human epidermal growth factor receptor 2;
PR, progesterone receptor.
5
(33%) CNA-CESs in the GEO dataset and 141/447 (32%) CNA-CESs in the TCGA dataset colocalized with a stringent set of common fragile sites on the human genome 14 , which is compatible with the notion of genomic instability underlying structural abnormalities such as CNAs (Supplementary Table 6).
The degree of inferred transcriptional adaptation is highly concordant across datasets and platforms
Although strong correlations were found between paired TACNA and CNA profiles, not all genes mapping to a marked genomic region in a CNA-CES had equally high weights.
We hypothesized that the absolute weight of an individual gene in a marked genomic region of a CNA-CES represents how much its expression level is affected by the underlying CNA: a lower absolute weight corresponds with a higher degree of transcriptional adaptation. A pair-wise comparison of the weights of genes mapping to marked genomic regions in CNA-CESs between the four datasets showed that many CNA-CESs strongly correlated with another CNA-CES in each of the three remaining datasets (Fig. 3a and Supplementary Table 7). This indicates that the specific patterns observed in many CNA- CESs are robust and independent of platform or dataset. For example, we found a high correlation between two CNA-CESs in the GEO dataset and TCGA dataset in which the oncogene KRAS had a high weight (Fig. 3b, r = 0.66).
The degree of inferred transcriptional adaptation is associated with distinct biological processes
Next, we determined whether the patterns observed in CNA-CESs were associated with
biological processes rather than being mere mathematical artifacts. For each CNA-CES
we transformed the weights of genes mapping to the marked genomic region to a metric
ranging from zero to one, where zero corresponds to the highest absolute weight (i.e.,
low degree of transcriptional adaptation to CNAs) and one corresponds to the lowest
absolute weight (i.e., high degree of transcriptional adaptation to CNAs). Subsequently,
all genes mapping to marked genomic regions in the CNAs-CESs were pooled and ranked
according to their metric. We then performed gene set enrichment analysis (GSEA) on
this ranked gene list using 12 gene set databases from the Molecular Signatures Database
(MSigDB) 15 . The analysis showed strong biological enrichment for all databases, with
high concordance between GSEA results obtained with GEO and TCGA metrics (r
ranging between 0.86 and 0.97, Supplementary Table 8). Genes with a lower degree of
transcriptional adaptation were highly enriched for gene sets involved in proliferation
(Fig. 3c). Conversely, genes with a higher degree of transcriptional adaptation were highly
enriched for immune-related gene sets. This indicates that the expression levels of genes
involved in immune signaling are less affected on average when CNAs occur in their
genomic region. Th e opposite eff ect was seen for genes involved in signaling pathways important for cell proliferation.
Most genes have a moderate to high degree of transcriptional adaptation to CNAs In both the GEO and TCGA dataset, most genes mapped to a marked genomic region in
Genomic mapping
GEO - Consensus Estimated Source 9 KRAS TCGA - Consensus Estimated Source 751
Genomic mapping
1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 2122
Weight in consensus sourceExtreme-valued region indicator
1.00
0.00 40.0
20.0
0.0
-20.0
Weight in consensus source
Extreme-valued region indicator
1.00
0.00 40.0
20.0
0.0
-20.0
Spearman r = 0.66
1 2 3 4 5 6 7 8 9 10 11 12 1314 15 16 17 18 19 20 2122
KRAS
Weight in consensus source Extreme-valued region indicator
a
b
GEO to TCGA 47.52%
0 25 50 75
0.00 0.25 0.50 0.75 1.00
Spearman r
−log10(P)
CCLE to GDSC 42.47%
0 50 100
0.00 0.25 0.50 0.75 1.00
Spearman r
−log10(P)
TCGA to GEO 25.95%
0 25 50 75
0.00 0.25 0.50 0.75 1.00
Spearman r
−log10(P)
GDSC to CCLE 44.57%
0 50 100
0.00 0.25 0.50 0.75 1.00
Spearman r
−log10(P) GEO 57%
(138/242) CCLE 39%
(73/186)
GDSC 44%
(77/175)
TCGA 26%
(118/447)
GEO 30%
(73/242) CCLE 55%
(102/186)
GDSC 47%
(77/175)
TCGA 13%
(58/447)
GEO 49%
(119/242) CCLE 31%
(58/186)
GDSC 31%
(55/175)
TCGA 29%
(131/447)
GEO 31%
(76/242) CCLE 45%
(84/186)
GDSC 58%
(102/175)
TCGA 12%
(54/447)
0 0.5 0.625 0.75 0.875 1
TCGA GEO
Oxidative phosphorylationMYC targets V1 E2F targets DNA repair Unfolded protein responseG2/M checkpoint mTORC1 signalingMYC targets V2 Protein secretionMitotic spindle P53 pathway Heme metabolismGlycolysis ROS pathwayAdipogenesis Fatty acid metabolismPeroxisome
PI3K/AKT/mTOR signaling Wnt/β-catenin signalingTGF-β signaling Androgen response IFN-α response
UV response up Notch signaling UV response down Cholesterol homeostasisApoptosis Estrogen response earlyEstrogen response late
SpermatogenesisApical surface IFN-γ response Complement
IL-6/JAK/STAT3 signaling Apical junctionHedgehog signaling Allograft rejection
Hypoxia
Bile acid metabolism Angiogenesis IL-2/STAT5 signalingXenobiotic metabolism KRAS signaling upPancreas β-cells TNF-α signaling via NF-κB Coagulation Inflammatory response Myogenesis KRAS signaling down EMT transition
Proliferation-related gene sets Immune-related gene sets P = 1 x 10-50 P = 1 x 10-10 P = 0.0001 P = 0.05
Enriched for genes having a lower degree of transcriptional adaptation Enriched for genes having a higher degree of transcriptional adaptation
c
Figure 3. Degree of transcriptional adaptation to CNAs. (a) Citrus plots showing the Spearman correlations between CNA-CESs
in the reference dataset, depicted in bold, with CNA-CESs in the other three datasets. Blue lines indicate r > 0.5. Correlations
are calculated based on the weights of genes in marked genomic regions of the CNA-CESs under investigation. Each scatterplot
with marginal histograms shows the correlations versus their -log
10transformed P values. Th e inset shows correlations > 0.5
having a P value < 0.05. (b) Example of a CNA-CESs in the GEO dataset that is highly correlated with a CNA-CES in the TCGA
dataset with KRAS having a low degree of transcriptional adaptation to CNAs. (c) Enrichment results for the MSigDB Hallmark
collection. A yellow bubble indicates enrichment for genes with a high degree of transcriptional adaptation to CNAs, and a blue
bubble indicates a low degree. Th e size of the bubble corresponds to the signifi cance level. Only CNA-CESs having at least 50
genes in their marked region were included. EMT, epithelial-mesenchymal transition; ROS, reactive oxygen species.
5
only one CNA-CES (Supplementary Table 9). The degree of transcriptional adaptation of these genes (n = 7,641) highly correlated between the GEO and TCGA dataset (Fig. 4a, r = 0.84). Of these 7,641 genes, 786 (10.2%) had low adaptation (metric < 0.25), 3,898 (51.0%) moderate adaptation (metric 0.25 – 0.75), and 2,957 genes (38.7%) high adaptation (metric
> 0.75, Fig. 4b). These results suggest that for most genes the transcriptional effect of an underlying CNA at their genomic position is largely buffered by non-genetic adaptive mechanisms.
Amplification of oncogenes and the resulting enhanced gene expression levels have been implicated in the progression of many cancers. 16 We observed a large variability in the degree of transcriptional adaptation in a set of oncogenes obtained from the Catalogue of Somatic Mutations in Cancer Gene Census (Fig. 4c and Supplementary Table 10) 17 . Whereas some oncogenes in this set had a low degree of transcriptional adaptation (e.g.
a
b
200150
100
50
0
0.0 0.2 0.4 0.6 0.8 1.0
Number of genes
Average degree of transcriptional adaptation
n = 786Low Moderate
n = 3,898 High
n = 2,957 0.00
0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00
Degree of transcriptional adaptation of genes occurring once in a CNA-CES (GEO) Degree of transcriptional adaptation of genes occurring once in a CNA-CES (TCGA)
Spearman r = 0.84
0.0 0.2 0.4 0.6 0.8 1.0
Average degree of transcriptional adaptation
c
Figure 4. Degree of transcriptional adaptation to CNAs per individual gene. Only CNA-CESs having at least 50 genes in
their marked region were included. (a) Degree of transcriptional adaptation to CNAs for individual genes mapping to a
marked genomic region in a single CNA-CES in both the GEO and TCGA dataset. (b) Distribution of the average degree of
transcriptional adaptation to CNAs for genes mapping to a marked genomic region in a single CNA-CES in both the GEO and
TCGA dataset. (c) Average degree of transcriptional adaptation to CNA for a set of oncogenes obtained from the Catalogue of
Somatic Mutations in Cancer Gene Census.
RAF1, metric = 0.06), most showed a relatively high degree of transcriptional adaptation to CNAs (e.g. MYC, metric = 0.86). Th is suggests that also transcription of many oncogenes is under the strong control of non-genetic regulatory factors that buff er the eff ects of CNAs.
Transcriptional eff ects of CNAs in >21,000 cancer samples
TACNA profi ling was applied to samples from overlapping tumor types in the GEO dataset (n = 13,180) and TCGA dataset (n = 8,150) to defi ne the landscape of transcriptional eff ects of CNAs. Th is showed a high correlation between the average pan-cancer TACNA profi les in the GEO and TCGA dataset (r = 0.73, Fig. 5a). Th ese average pan-cancer patterns of transcriptional eff ects of CNAs closely resemble average pan-cancer copy number patterns reported in previous studies 18,19 . As expected, high correlations between the average GEO and TCGA TACNA profi les of copy number-driven tumors such as triple-negative breast carcinoma (r = 0.80), colorectal adenocarcinoma (r = 0.74), and ovarian carcinoma (r = 0.73) were observed (Fig. 5b and Supplementary Table 11).
Chromosome
8,150 unrelated primary tumor samples (TCGA)
1 2 3
45 67
89 1011
1213 1415
1617 1819
2021 22 Positive transcriptional effect
Negative transcriptional effect
Acute myeloid leukemiaAdrenocortical carcinoma
ER-/HER2+ breast carcinoma ER+/HER2- breast carcinoma ER+/HER2+ breast carcinoma
Triple-negative breast carcinomaCervical carcinomaCholangiocarcinomaColorectal adenocarcinomaCutaneous melanomaDiffuse large B-cell lymphoma
Esophageal carcinoma Gastric adenocarcinomaGlioblastoma multiformeHepatocellular carcinomaHNSCCLower grade gliomaLung adenocarcinomaOvarian carcinomaPancreatic carcinomaProstate carcinomaChromophobe renal cell carcinoma
Clear cell renal cell carcinoma Papilla
ry renal cell carcinomaThyroid carcinomaUrothelial carcinomaUveal melanoma
a c
b
Genomic mappingSpearman correlation r = 0.73
Average TACNA expression
GEO
TCGA
Pan-cancer
Positive Negative 0.75
0.50 0.25 0.00 -0.25
0.50 0.25 0.00 -0.25
-0.50 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 21 22
Triple-negative breast carcinoma (n = 146) Adrenocortical carcinoma (n = 79) Colorectal adenocarcinoma (n = 573) Ovarian carcinoma (n = 307) ER+/HER2+ breast carcinoma (n = 140) ER+/HER2- breast carcinoma (n = 554) ER-/HER2+ breast carcinoma (n = 43) Glioblastoma multiforme (n = 166) Papillary renal cell carcinoma (n = 291) Cholangiocarcinoma (n = 36) Hepatocellular carcinoma (n = 373) Cutaneous melanoma (n = 472) Gastric adenocarcinoma (n = 415) Esophageal carcinoma (n = 185) Lung adenocarcinoma (n = 517) Uveal melanoma (n = 80) Diffuse large B−cell lymphoma (n = 48) Cervical carcinoma (n = 306) Prostate carcinoma (n = 498) Clear cell renal cell carcinoma (n = 534) Chromophobe renal cell carcinoma (n = 66) Urothelial carcinoma (n = 408) Thyroid carcinoma (n = 509) Pancreatic carcinoma (n = 179) HNSCC (n = 522) Lower grade glioma (n = 530) Acute myeloid leukemia (n = 173)
Triple-negative breast carcinoma (n = 737) Adrenocortical carcinoma (n = 40) Colorectal adenocarcinoma (n = 2,710) Ovarian carcinoma (n = 187) ER+/HER2+ breast carcinoma (n = 506) ER+/HER2- breast carcinoma (n = 1,678) ER-/HER2+ breast carcinoma (n = 456) Glioblastoma multiforme (n = 390) Papillary renal cell carcinoma (n = 37) Cholangiocarcinoma (n = 4) Hepatocellular carcinoma (n = 364) Cutaneous melanoma (n = 398) Gastric adenocarcinoma (n = 332) Esophageal carcinoma (n = 97) Lung adenocarcinoma (n = 1,019) Uveal melanoma (n = 106) Diffuse large B−cell lymphoma (n = 752) Cervical carcinoma (n = 62) Prostate carcinoma (n = 308) Clear cell renal cell carcinoma (n = 225) Chromophobe renal cell carcinoma (n = 37) Urothelial carcinoma (n = 39) Thyroid carcinoma (n = 97) Pancreatic carcinoma (n = 81) HNSCC (n = 386) Lower grade glioma (n = 158) Acute myeloid leukemia (n = 2,604)
-1 Spearman r 0 1 GEO
TCGA
Figure 5. Th e landscape of transcriptional eff ects of CNAs in a large set of cancer samples. Only CNA-CESs having at least 50
genes in their marked region were included. (a) Average pan-cancer TACNA profi les in GEO and TCGA dataset. (b) Distance
matrix of Spearman correlations between average TACNA profi les in the GEO dataset and TCGA dataset for overlapping
tumor types. Th e size and transparency of a square corresponds to the absolute correlation coeffi cient. HNSCC, head and neck
squamous cell carcinoma. (c) Hierarchical clustering of the landscape of transcriptional eff ects of CNAs in the TCGA dataset
for 8,150 cancer samples of tumor types also present in the GEO dataset.
5
Hierarchical clustering of the landscape of transcriptional eff ects of CNAs in the GEO and TCGA dataset showed clear diff erences and commonalities between tumor types (Fig.
5c and Supplementary Fig. 2), indicating that tumor types share strong transcriptional eff ects of CNAs driving biological pathways relevant for their tumor behavior. Th e landscape of transcriptional eff ects of CNAs can enable transcriptomic-wide association studies (TWASs) to generate strong hypotheses on genes and biological pathways aff ected by CNAs that might be causally linked to tumor phenotypes.
Transcriptional eff ects of CNAs are associated with the inferred activity and composition of immune cells present in the tumor microenvironment
Recently, a higher CNA burden was found to be negatively associated with an expression- based metric describing CD8+ T cell activity in the tumor microenvironment 19,20 . Th e ability to elicit an anti-cancer immune response with immune checkpoint inhibitors depends on
Figure 6. Transcriptional eff ects of CNAs in relation to inferred composition and activity of immune cells present in the tumor microenvironment. Only CNA-CESs having at least 50 genes in their marked region were included. (a) Per tumor type, Spearman correlations between inferred CNA burden and an expression-based metric describing CD8+ T cell activity for each sample. (b) Manhattan plot showing, the enrichment (-log
10P value on the y-axis) for genes with a strong correlation between TACNA expression level and inferred CD8+ T cell abundance per cytogenetic band defi ned according to the MSigDB Positional Gene Sets collection. (c) Left panel: constructed co-functionality network for the top 350 genes with the highest inverse correlation between their TACNA expression levels and CD8+ T cell abundance. Middle panel: cluster in which genes share strong predicted co-functionality (r > 0.8) within the co-functionality network that was enriched with genes predicted to be involved in immunological processes (e.g. complement, infl ammatory response, and IFN-γ response). Right panel: a GBA strategy with >106,000 expression profi les in combination with the gene set collection obtained from the Mammalian Phenotype (MP) Ontology predicted for the genes IRF2, OSTF1, LOC90768, and ZCCHC6 that altered transcription results in a decreased CD8+/αβ T cell number.
Gene IRF2
OSTF1
LOC90768
ZCCHC6
Predicted mammalian phenotype Decreased B cell number Enlarged spleen Enlarged lymph nodes Decreased IFN-γ secretion Increased leukocyte cell number Decreased T cell number Decreased lymphocyte cell number Decreased CD8+/αβ T cell number Enlarged spleen Increased leukocyte cell number Impaired natural killer cell mediated cytotoxicity Abnormal cytokine secretion Decreased CD8+/αβ T cell number Enlarged lymph nodes Abnormal macrophage physiology Decreased T cell proliferation Decreased B cell number Enlarged spleen Decreased T cell number Decreased CD8+/αβ T cell number Increased leukocyte cell number Enlarged lymph nodes Decreased lymphocyte cell number Increased spleen weight Enlarged spleen Increased leukocyte cell number Increased IgG1 level Decreased T cell number Decreased CD4+/αβ T cell number Enlarged lymph nodes Decreased CD8+/αβ T cell number Abnormal cytokine secretion
Distance r 0.630.66 0.660.62 0.630.63 0.610.63
0.540.53 0.580.53 0.520.56 0.510.52
0.480.46 0.470.46 0.450.44 0.460.42 0.570.54 0.590.54 0.510.56 0.530.58
Permuted Z-score 12.61 12.42 11.86 11.74 11.63 10.99 10.52 10.21 10.76 8.898.67 8.548.53 8.468.46 8.39 9.809.09 8.037.90 7.767.49 7.276.92 10.83 9.469.17 8.918.86 8.808.40 8.30
a
c
> 100 75 50 25 0 25 50 75
> 100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122
Genomic mapping -log10(P)
Negatively correlated Positively correlated Gene enrichment per cytoband on correlations between TACNA expression levels and CD8+ T cell abundance
b
Z-transformed P value
3.25 - Unfolded protein response 2.50 - KRAS signaling down 4.25 - MTORC1 signaling 4.20 - Spermatogenesis 7.45 - MYC targets V1 8.68 - G2/M checkpoint
4.86 - Mitotic spindle 8.54 - E2F targets 6.63 - DNA repair 5.48 - MYC targets V2
5.42 - Complement 5.13 - IFN-γ response 5.01 - Inflammatory response 4.58 - Allograft rejection 4.47 - IL2/STAT5 signaling 4.23 - TNF-α signaling via NFκB 3.96 - IL6/JAK/STAT3 signaling 3.66 - Heme metabolism 3.16 - IFN-α response 2.81 - KRAS signaling up 2.88 - Hedgehog signaling
2.10 - Notch signaling 1.95 - Hypoxia 2.69 - Coagulation 3.04 - Apical junction 4.98 - Myogenesis 3.71 - UV response down
1.92 - Estrogen response early 4.63 - EMT
3.45 - Angiogenesis 2.77 - Inflammatory response
2.32 - TNF-α signaling via NFκB 2.85 - IFN-γ response 2.53 - Spermatogenesis 2.27 - G2/M checkpoint 2.69 - E2F targets 3.35 - Allograft rejection 2.95 - KRAS signaling down
2.44 - Myogenesis 2.25 - MYC targets V1
IRF2
OSTF1
LOC90768 ZCCHC6
NINJ1
ACSL1 DAPK1
NFIL3
KLF3
ANXA1 THBD Cholangiocarcinoma Pancreatic carcinoma ER−/HER2+ breast carcinoma Triple−negative breast carcinoma Gastric adenocarcinoma Ovarian carcinoma Lung adenocarcinoma Diffuse large B−cell lymphoma HNSCC Colorectal adenocarcinoma Thyroid carcinoma ER+/HER2+ breast carcinoma Hepatocellular carcinoma Adrenocortical carcinoma Cutaneous melanoma Esophageal carcinoma ER+/HER2− breast carcinoma Cervical carcinoma Clear cell renal cell carcinoma Urothelial carcinoma Uveal melanoma Glioblastoma multiforme Prostate carcinoma Acute myeloid leukemia Lower grade glioma Papillary renal cell carcinoma Chromophobe renal cell carcinoma
−0.8
−0.6
−0.4
−0.2 0.0 0.2 0.4 0.6
Spearman r
n = 100 n = 400 n = 900 n = 1,600 n = 2,500
GEO TCGA