Somatic mobile element detection in pediatric cancer:
Supplementary materials
Student: Ramon van Amerongen (6132375) First examiner: Jayne Hehir-Kwa Second examiner: Josephine Daub
University: Utrecht University Faculty: Faculty of Science
Master: Bioinformatics and Biocomplexity
Contents
CONTENTS ... 2
METHODS ... 4
1. BENCHMARKING: HG002 SAMPLEREMAPPINGCOMMAND...4
2. BENCHMARKING: SIMULATED ME CHARACTERISTICS...4
3. DESCRIPTIONOF MOBSTERALTERATIONS...4
3. JITTERBUGMODIFICATIONS...5
4. OVERLAPALGORITHM...5
FIGURES AND TABLES ... 6
1. DISTRIBUTIONOFGERMLINEMOBILEELEMENTINSERTIONSPRESENTIN HG002...6
2. TSD TYPESOFSUCCESSFULLYSIMULATEDMOBILEELEMENTINSERTIONS...6
3. TSD LENGTHDISTRIBUTIONSOFSUCCESSFULLYSIMULATEDMOBILEELEMENTINSERTIONS...7
4. MAXIMUMSQUAREDERROROFTHEPREDICTED VAF FORBOTHBENCHMARKSAMPLES...7
5. FLOWCHARTOFOVERLAPALGORITHM...8
6. SHAREDPREDICTEDSIMULATEDSOMATICINSERTIONSFORTHEOLDVERSIONOF MOBSTERANDSOMATICMODEOF XTEA...8
7. CUMULATIVEDETECTIONPERTOOLFORSOMATICBENCHMARK...9
MOBSTER:...9
XTEA:...9
JITTERBUG:...10
8. RECALLFORINCREASINGVARIANTALLELEFRACTIONSOFSIMULATEDSOMATICINSERTIONS...10
9. ME DETECTIONACCURACIESOFSIMULATEDSOMATICINSERTIONSFORDIFFERENT ME TYPES...10
10. DETECTIONACCURACIESFORTHEOLDVERSIONOF MOBSTERANDSOMATICMODEOFXTEA...12
11. THEPREDICTEDANDACTUAL VAFSOFTHESIMULATEDSOMATICINSERTIONS...12
MOBSTER:...12
XTEA:...13
BOTH:...13
12. CONFUSIONMATRICESFORSOMATIC ME TYPEMISMATCHESCOMPAREDTOTHEBENCHMARK...13
MOBSTER:...13
JITTERBUG:...13
13. CONFUSIONMATRICESFORSOMATIC TSD TYPEMISMATCHESCOMPAREDTOTHEBENCHMARK...14
MOBSTER:...14
XTEA:...14
14. TESTINGFORSIGNIFICANTDIFFERENCESIN MEI COUNTSBETWEENTOOLSANDCANCERTYPES...14
15. UNVERIFIEDOVERLAPPING MEI COUNTSPERCANCERTYPE...15
16. TESTINGFORSIGNIFICANTDIFFERENCESINVERIFIED MEI COUNTSBETWEENCANCERTYPES...15
17. TSD LENGTHOFALLVERIFIEDINSERTIONS...16
18. ME MISMATCHESEXCEPTTHEOUTLIERPATIENT...16
19. ME TYPESOFALLTOOLSEXCEPTTHEOUTLIERPATIENT...17
20. TSD MISMATCHESFORALLINSERTIONS...17 21. TSD TYPESOFONLYMATCHINGINSERTIONSBETWEENXTEAAND MOBSTERINCLUDINGTHEOUTLIERPATIENT. .17
22. ME TYPESOFONLYMATCHINGINSERTIONSBETWEENXTEAAND MOBSTEREXCEPTTHEOUTLIERPATIENT...18 23. CALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT...18 24. CALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSFORTHEOUTLIERPATIENT...18 25. TESTINGTHECALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT19 26. TESTINGTHECALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT19 27. PAN-CANCERASSOCIATIONSOFTHEOCCURRENCEOF SNVSAND SVSANDTHEOCCURRENCEOFATLEASTONE MEI...19 28. PERCANCERASSOCIATIONSOFTHEOCCURRENCEOF SNVSAND SVSANDTHEOCCURRENCEOFATLEASTONE MEI...20
Methods
1. Benchmarking: HG002 sample remapping command
The following command was used to remap the reads from the HG002 Novoalign aligned BAM file to a BWA aligned BAM file:
java -Xmx4G -jar <Picard 2.20.1> \ SamToFastq \
INPUT=${input_bam} \ FASTQ=/dev/stdout \ INTERLEAVE=true \ NON_PF=true | \
<bwa 0.7.13> mem -K 100000000 -p -v 3 -t 40 -Y <hg38 genome reference> /dev/stdin - |\
samtools view -1 - > <output>.bam
2. Benchmarking: Simulated ME characteristics.
The following settings were used when using the custom script (‘generate_te_insertions.py’) to simulate ME sequences:
All were limited to chromosome 1.
They at least have a distance of 100 bp from each other.
161 insertions were generated for every transposon type (LINE1, Alu or SVA) and allele fraction between 0.1-0.9 with intervals of 0.1.
The insertions were generated from Mobster’s consensus ME sequence library available at:
https://github.com/jyhehir/mobster/blob/master/resources/mobiome/hg19_54_active_
mobile_elements.fasta
The sequences were mutated with a random rate between 0 and 5%.
The chance for a target site duplication to occur was set to 0.9 with a length taken from a normal distribution with the top at 13 bp and cut-offs at 1 and 30 bp.
The chance for a target site deletion to occur was set to 0.05 with a length taken from a normal distribution with the top at 7 bp and cut-offs at 1 and 20 bp.
The chance for additional polyA to be added was set to 0.25 with a length taken from an uniform distribution between 1 and 50 bp.
Their positions were randomly chosen outside blacklisted regions in the file hg38- blacklist.v2.bed.
3. Description of Mobster alterations
We added two ways for removing germline insertions from tumor insertions to Mobster: (1) When a previous output file is provided, any insertion is removed if it occurs within a configurable number of base pairs around the insertions in the file (default is 90 bp). (2) Alternatively, the BAM file of a normal sample can be used as was done in our study. When doing the latter discordant and split reads from the tumor and normal sample are used to predict insertions concurrently. It is then recorded how many supporting reads each sample
contributes tot he predicted MEI event. The distinction between germline and somatic insertions were grouped into three types: a high confident somatic insertion with no
supporting reads in the germline sample; a germline insertion with a configurable number of supporting reads in the germline sample (default at least 3) and a somatic insertion below this threshold is flagged as a normal artifact. This can, for example, be due to tumor contamination in the normal sample. Other than this addition, the insert size metrics were averaged over all samples instead of only the first BAM file which was done in the previous version of Mobster.
The breakpoint and confidence window predictions were also improved. Based on clipped read positions a ‘end’ breakpoint was added which was then used in addition to the
‘start’ breakpoint during clustering and filtering. Prediction of the target site duplication was also extended to the case when discordant read clusters overlap in the absence of split read clusters. Previously, a target site duplication could only be predicted if both split clusters were present.
In addition, prediction of the variant allelic fraction (VAF) was included in Mobster. A prediction is made in the region spanning from 200 bp to the left to 200 bp to right of the breakpoints. In this region, the maximum ratio of the split and discordant read depth and the total depth was chosen as a proxy for the VAF. A 200 bp window was based on the fact that it produces a low MSE of the VAF in the germline and somatic benchmark samples (supplementary figure 4).
3. Jitterbug modifications
The following modifications were made to Jitterbug so it was usable for our study.
The ME mappings were saved as a String to the database instead of an object which became empty. In addition, Venn diagram visualization was disabled to it could run on the hpc. Input file detection was also changed to allowed bam index files with both ‘.bam.bai’ and ‘.bai’
extensions. The produced gff3 output also needed to be converted to VCF. For this, we developed a conversion script which takes the softclipped positions as the start (POS) and end (end tag) breakpoints. If these were not available, the middle of the prediction borders (Left of CIPOS en right of CIPEND) were taken as both the start and end breakpoints.
4. Overlap algorithm
When overlapping insertions between multiple prediction sets their breakpoints should occur within 100 bp of each other to group them together but their ME type is allowed to differ. Each insertion in a set cannot overlap with more than one in another set. When multiple groups are possible (supplementary figure 5: A, middle), only the one with the lowest score is kept (supplementary figure 5: A, right). The score is determined by first calculating the distance between the start breakpoints and between the end breakpoints for each pair of insertions within a group. Then the absolute start and end difference are
summed for each pair of insertions. At last, the resulting values are summed over all pairings to obtain the score. Whenever this results in an equal score between groups, first the maximum and then the minimum from the start and end difference is taken instead of the sum. All remaining insertions that are not in a group will then iteratively be grouped until no new groups are formed (again supplementary figure 5: A, right). In case more than two prediction sets are provided, the insertions will first be grouped over all prediction sets (supplementary figure 5: A, three prediction sets). Any remaining insertions that do not overlap in all prediction sets (the non-grayed out MEs in the figure) will subsequently be
required to only overlap in the number of prediction sets minus one (supplementary figure 5: B, two prediction sets). This continues until only pairs of overlapping insertions are formed between two prediction sets.
Figures and tables
1. Distribution of germline mobile element insertions present in HG002
Figure 1: The mobile element insertions that are present on chromosome 1 of the HG002 sample of the Genome in a Bottle consortium.
2. TSD types of successfully simulated mobile element insertions
Figure 2: The number and (TSD) type of event occurring at the target site of the simulated insertions in sample HG002. For other figures ‘no_tsd’ means either no duplication/deletion or occurred or it was unknown. As the TSD type is already known for the simulated insertions, ‘no_tsd’ means in this case that no duplication or deletion occurred at all.
3. TSD length distributions of successfully simulated mobile element insertions
Figure 3: The lengths of target site duplications and deletions of successfully simulated insertions are displayed here.
4. Maximum squared error of the predicted VAF for both benchmark samples
Figure 4:The mean squared error between the actual VAF and the one predicted by Mobster for the germline and somatic insertions is shown here for different prediction windows. This was done to choose the most optimal prediction window: 200 bp.
5. Flow chart of overlap algorithm
Figure 5: (A) When overlapping insertions in three prediction sets all possible triplets are determined first. Then, the lowest scoring groups are determined by iteratively looping over all triplets until no new groups can be formed. No insertion can be part of multiple groups. (B) Any insertions that were not part of any triplets (non-grayed out MEs) are then again grouped in pairs and the lowest scoring pairs are kept.
6. Shared predicted simulated somatic insertions for the old version of Mobster and somatic mode of xTea
Figure 6: This shows the number of shared insertions between the tools for the somatic benchmark dataset. However, the methods changed for two tools: The old version of Mobster and the tumor of xTea were used instead of the new version and the somatic mode of xTea.
7. Cumulative detection per tool for somatic benchmark Mobster:
Figure 7a: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only the new version of Mobster.
xTea:
Figure 7b: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only the tumor mode of xTea.
Jitterbug:
Figure 7c: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only Jitterbug.
8. Recall for increasing variant allele fractions of simulated somatic insertions
Figure 8: Here the recalls of the simulated somatic insertions for different VAFi are shown for the newest version of Mobster, tumor mode of xTea and Jitterbug.
9. ME detection accuracies of simulated somatic insertions for different ME types
Table 9: Here the number of true and false calls and the accuracy scores for detecting the somatic simulated insertions are shown separately for different ME types and tools (the newest version of Mobster, tumor mode of xTea and Jitterbug).
Tools
Overlap method
ME TP FP FN Recall Precision F1 Mobster single tool ALU 37
8
16 48 9
0.43598 6
0.95939 1
0.59952 4 Mobster single tool LINE
1 52
8
30 35 5
0.59796 1
0.94623 7
0.73282 4 Mobster single tool SVA 43
6
23 44 8
0.49321 3
0.94989 1
0.64929 3 xTea single tool ALU 37
0
20 49 7
0.42675 9
0.94871 8
0.58870 3 xTea single tool LINE
1 30
9
5 57 4
0.34994 3
0.98407 6
0.51629 1 xTea single tool SVA 39
2
9 49 2
0.44343 9
0.97755 6
0.61011 7 Jitterbug single tool ALU 12
4 33
5 74
3
0.14302 2
0.27015 3
0.18702 9 Jitterbug single tool LINE
1 23
7
90 64 6
0.26840 3
0.72477 1
0.39173 6 Jitterbug single tool SVA 16
6
16 71 8
0.18778 3
0.91208 8
0.31144 5 Mobster+xTea intersection ALU 32
5
0 54 2
0.37485 6
1.00000 0
0.54530 2 Mobster+xTea intersection LINE
1 26
7
0 61 6
0.30237 8
1.00000 0
0.46434 8 Mobster+xTea intersection SVA 38
7
0 49 7
0.43778 3
1.00000 0
0.60896 9 Mobster+xTea union ALU 42
3
36 44 4
0.48788 9
0.92156 9
0.63800 9 Mobster+xTea union LINE
1 57
0
35 31 3
0.64552 7
0.94214 9
0.76612 9 Mobster+xTea union SVA 44
1
32 44 3
0.49886 9
0.93234 7
0.64996 3 Mobster+Jitterbug intersection ALU 11
4
3 75 3
0.13148 8
0.97435 9
0.23170 7 Mobster+Jitterbug intersection LINE
1 22
0
0 66 3
0.24915 1
1.00000 0
0.39891 2 Mobster+Jitterbug intersection SVA 16
0
0 72 4
0.18099 5
1.00000 0
0.30651 3 Mobster+Jitterbug union ALU 38
8 34
8 47
9
0.44752 0
0.52717 4
0.48409 2 Mobster+Jitterbug union LINE
1 54
5 12
0 33
8
0.61721 4
0.81954 9
0.70413 4 Mobster+Jitterbug union SVA 44
2
39 44 2
0.50000 0
0.91891 9
0.64761 9
xTea+Jitterbug intersection ALU 11 3
0 75 4
0.13033 4
1.00000 0
0.23061 2 xTea+Jitterbug intersection LINE
1 11
9
0 76 4
0.13476 8
1.00000 0
0.23752 5 xTea+Jitterbug intersection SVA 14
8
1 73 6
0.16742 1
0.99328 9
0.28654 4 xTea+Jitterbug union ALU 38
1 35
4 48
6
0.43944 6
0.51836 7
0.47565 5 xTea+Jitterbug union LINE
1 42
7
95 45 6
0.48357 9
0.81800 8
0.60782 9 xTea+Jitterbug union SVA 41
0
25 47 4
0.46380 1
0.94252 9
0.62168 3 Mobster+xTea
+Jitterbug
intersection ALU 10 7
0 76 0
0.12341 4
1.00000 0
0.21971 3 Mobster+xTea
+Jitterbug
intersection LINE 1
11 4
0 76 9
0.12910 5
1.00000 0
0.22868 6 Mobster+xTea
+Jitterbug
intersection SVA 14 7
0 73 7
0.16629 0
1.00000 0
0.28516 0 Mobster+xTea
+Jitterbug
union ALU 42 7
36 7
44 0
0.49250 3
0.53778 3
0.51414 8 Mobster+xTea
+Jitterbug
union LINE 1
58 2
12 5
30 1
0.65911 7
0.82319 7
0.73207 5 Mobster+xTea
+Jitterbug
union SVA 44 6
48 43 8
0.50452 5
0.90283 4
0.64731 5 Mobster+xTea
+Jitterbug
two out of three
ALU 33 8
3 52 9
0.38985 0
0.99120 2
0.55960 3 Mobster+xTea
+Jitterbug
two out of three
LINE 1
37 8
0 50 5
0.42808 6
1.00000 0
0.59952 4 Mobster+xTea
+Jitterbug
two out of three
SVA 40 1
1 48 3
0.45362 0
0.99751 2
0.62363 9
10. Detection accuracies for the old version of Mobster and somatic mode of xTea
Table 10: Here the number of true and false calls and the accuracy scores for detecting the somatic simulated insertions are shown for the old version of Mobster, somatic mode of xTea and Jitterbug.
Tools Compariso n
TP FP FN Recall Precision F1
Jitterbug none 527 44 1
210 7
0.20007 6
0.54442 1
0.29261 5
Mobster none 135
0 13
0
128 4
0.51252 8
0.91216 2
0.65629 6 Mobster+Jitterbug intersection 498 16 213
6
0.18906 6
0.96887 2
0.31639 1 Mobster+Jitterbug union 137
9 55
5
125 5
0.52353 8
0.71303 0
0.60376 5
Mobster+xTea intersection 964 0 167 0
0.36598 3
1.00000 0
0.53585 3 Mobster+xTea union 143
1 13
0
120 3
0.54328 0
0.91672 0
0.68224 1 Mobster+xTea+Jitterbug intersection 367 0 226
7
0.13933 2
1.00000 0
0.24458 5 Mobster+xTea+Jitterbug two out of
three
110
3 16 153 1
0.41875 5
0.98570 2
0.58779 6 Mobster+xTea+Jitterbug union 145
2 55
5
118 2
0.55125 3
0.72346 8
0.62572 7
xTea none 104
5 0 158
9
0.39673 5
1.00000 0
0.56808 9 xTea+Jitterbug intersection 375 0 225
9
0.14236 9
1.00000 0
0.24925 2 xTea+Jitterbug union 119
7 44
1
143 7
0.45444 2
0.73076 9
0.56039 3
11. The predicted and actual VAFs of the simulated somatic insertions
Mobster:
Figure 11a: The actual and predicted VAFs of the simulated somatic insertions by the newest version of Mobster.
xTea:
Figure 11b: The actual and predicted VAFs of the simulated somatic insertions by the tumor mode of xTea.
Both:
Figure 11c: Scatterplot of the predicted VAFs of both the newest version of Mobster and the tumor mode of xTea
12. ME type predictions of the simulated somatic insertions Mobster:
Table 12a: Confusion matrix of the ME type predictions made by the newest version of Mobster. “Pred.” refers to the predicted ME types by Mobster in the first column, while “Act” refers to the actual ME types in the first row.
Pred./Act. ALU LINE 1
SVA
ALU 375 0 0
LINE1 1 526 0
SVA 2 2 436
Jitterbug:
Table 12b: Confusion matrix of the ME type predictions made by Jitterbug. “Pred.” refers to the predicted ME types by Mobster in the first column, while “Act” refers to the actual ME types in the first row.
Pred./Act. ALU LINE 1
SVA
ALU 121 2 1
LINE1 4 235 13
SVA 0 2 153
13. TSD type predictions of the simulated somatic insertions Mobster:
Table 13a: Confusion matrix of the TSD type predictions made by the newest version of Mobster. “Pred.” refers to the predicted TSD types by Mobster in the first column, while “Act” refers to the actual ME types in the first row. Mobster makes the distinction between insertions whose TSD type is ‘unknown’ or insertions that did not have a duplication or deletion (‘no_tsd’).
Pred./Act. deletio n
duplication no_ts d
deletion 57 2 3
duplication 0 1052 28
no_tsd 6 0 26
unknown 15 143 10
xTea:
Table 13b: Confusion matrix of the TSD type predictions made by the tumor mode of xTea. “Pred.” refers to the predicted TSD types by xTea in the first column, while “Act” refers to the actual ME types in the first row. When xTea predicts ‘no_tsd’
it can either refer to insertions whose TSD type is unknown or insertions that neither have a duplication nor a deletion.
Pred./Act. deletio n
duplication no_ts d
deletion 54 16 6
duplication 3 1052 34
no_tsd 3 15 1
14. Testing for significant differences in somatic MEI counts between tools and cancer types
Table 14: The statistical comparisons that were performed between the cancer types and tools in terms of the number of MEIs found. Comparisons were made for the different groups in ‘Groups’ that all belonged to the ‘Cancer type/Tool’.
test
Cancer type/Tool Groups p-value Adjusted p-value
Significan t
wilcoxon Nephroblastoma xTea + Mobster 5.388791e-
07 0.000002 True
wilcoxon Ewing sarcoma xTea + Mobster 1.907349e- 06
0.000008 True wilcoxon Embryonal xTea + Mobster 6.103516e- 0.000244 True
rhabdomyosarcoma 05 kruskal-
wallis
xTea Nephroblastoma + Embryonal rhabdomyosarcoma + ...
2.334348e- 03
0.001953 True
mann- whitney
Mobster Ewing sarcoma + Osteosarcoma
2.185058e- 04
0.002622 True kruskal-
wallis
Mobster Nephroblastoma + Embryonal rhabdomyosarcoma + ...
2.011519e- 04
0.003653 True
wilcoxon Osteosarcoma xTea + Mobster 9.765625e- 04
0.003906 True mann-
whitney Mobster Nephroblastoma +
Osteosarcoma 4.676780e-
04 0.005612 True
mann-
whitney xTea Nephroblastoma +
Osteosarcoma 1.144097e-
03 0.013729 True
mann- whitney
xTea Ewing sarcoma + Osteosarcoma
1.826373e- 03
0.021916 True mann-
whitney
Mobster Embryonal
rhabdomyosarcoma + Ewing sarcoma
9.282848e- 03
0.111394 False
mann-
whitney Mobster Nephroblastoma + Embryonal rhabdomyosarcoma
5.021677e-
02 0.602601 False mann-
whitney xTea Nephroblastoma +
Embryonal rhabdomyosarcoma
5.297484e-
02 0.635698 False mann-
whitney
Mobster Nephroblastoma + Ewing sarcoma
9.663498e- 02
1.159620 False mann-
whitney xTea Embryonal
rhabdomyosarcoma + Osteosarcoma
9.675407e-
02 1.161049 False mann-
whitney
xTea Embryonal
rhabdomyosarcoma + Ewing sarcoma
1.379859e- 01
1.655830 False
mann-
whitney xTea Nephroblastoma +
Ewing sarcoma 4.799072e-
01 5.758887 False mann-
whitney
Mobster Embryonal
rhabdomyosarcoma + Osteosarcoma
5.333465e- 01
6.400158 False
15. Unverified overlapping somatic MEI counts per cancer type
Figure 15: The number of found overlapping somatic insertions within 100 bp between xTea and Mobster per patient per cancer type are shown here. These insertions were not manually verified yet.
16. Testing for significant differences in verified somatic MEI counts between cancer types
Table 16: The statistical comparisons that were performed between the cancer types in terms of the number of overlapping somatic MEIs found. Comparisons were made for the different groups in ‘Groups’.
Test Groups p-value Adjusted p-
value Significan t
kruskal- wallis
Nephroblastoma + Embryonal rhabdomyosarcoma + ...
0.00026 0
0.000260 True mann-
whitney Nephroblastoma + Embryonal
rhabdomyosarcoma 0.10805
1 0.648306 False
mann-
whitney Nephroblastoma + Ewing sarcoma 0.01714
3 0.102861 False
mann-
whitney Nephroblastoma + Osteosarcoma 0.01361
5 0.081689 False
mann- whitney
Embryonal rhabdomyosarcoma + Ewing sarcoma
0.04807 1
0.288426 False mann-
whitney
Embryonal rhabdomyosarcoma + Osteosarcoma
0.48690 0
2.921402 False mann-
whitney Ewing sarcoma + Osteosarcoma 0.00132
3 0.007938 True
chi-square Nephroblastoma + Embryonal
rhabdomyosarcoma + ... 0.00013
9 0.000139 True
fischer Nephroblastoma + Embryonal
rhabdomyosarcoma 0.15244
7 0.914682 False
fischer Nephroblastoma + Ewing sarcoma 0.07221 9
0.433312 False fischer Nephroblastoma + Osteosarcoma 0.00393
7
0.023622 True fischer Embryonal rhabdomyosarcoma +
Ewing sarcoma 0.00308
3 0.018501 True
fischer Embryonal rhabdomyosarcoma +
Osteosarcoma 0.22619
5 1.357168 False
fischer Ewing sarcoma + Osteosarcoma 0.00005
9 0.000354 True
17. TSD length of all verified insertions
Figure 17: The distribution of the length of the target site duplications and deletions of the verified somatic insertions.
18. ME mismatches for the verified insertions except the outlier patient
Table 18: Confusion matrix of the ME type predictions made fo the verified insertions by Mobster and xTea. Insertions of the outlier patient were excluded.
xTea/Mobste r
ALU LINE 1
SVA
ALU 1 4 5
LINE1 1 19 0
19. ME types of the verified insertions except the outlier patient
Figure 19: The number of insertions with a certain ME type is shown separately for xTea and Mobster. This does not show which insertions have mismatching ME types. The insertions of the outlier patient were excluded.
20. TSD mismatches for all verified insertions
Table 20: Confusion matrix of the ME type predictions made by Mobster and xTea for the verified insertions. Mobster’s
‘unknown’ was first converted to ‘no_tsd’, as xTea, does not make the distinction between insertions for which the TSD type is unknown or insertions that did not have a duplication or deletion (‘no_tsd’). The insertions of the outlier patient are included.
xTea/Mobste r
deletio n
duplication no_ts d
deletion 1 2 3
duplication 0 122 12
no_tsd 0 6 8
21. TSD types of only all matching verified insertions between xTea and Mobster
Figure 21: Here the TSD type is only shown for verified insertions whose TSD types match between xTea and Mobster.
Insertions of the outlier patient are included.
22. ME types of only matching insertions between xTea and Mobster except the outlier patient
Figure 22: Here the ME type is only shown for verified insertions whose ME types match between xTea and Mobster.
Insertions of the outlier patient are excluded.
23. Calculated and observed probabilities for MEI affected gene components except for the outlier patient
Figure 23: The calculated and observed probabilities of a verified insertion not from the outlier patient affecting different gene components. The calculated probabilities are based on the proportion of the genomic size of all regions of a single component compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the indicated component type. Insertions of the outlier patient have been excluded. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.
24. Calculated and observed probabilities for gene components for the outlier patient
Figure 24: The calculated and observed probabilities of a verified insertion from the outlier patient affecting different gene components. The calculated probabilities are based on the proportion of the genomic size of all regions of a single
component compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the indicated component type. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.
25. Testing the calculated and observed probabilities for gene components except for the outlier patient
Table 25: Binomial tests between the calculated and observed probabilities. The calculated probabilities are based on the proportion of the genomic size of all components of a single type compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the certain component type. Insertions of the outlier patient have been excluded. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript.
Except for intronic only regions, insertions can be part of multiple other regions.
Genomic component Calculated Observed p-value Corrected Significant
probability probability p-value coding_genic 0.422138 0.333333 0.36053
8
2.163228 False noncoding_genic 0.188536 0.166667 1.00000
0
6.000000 False coding_exonic 0.035025 0.033333 1.00000
0
6.000000 False noncoding_exonic 0.016349 0.066667 0.08605
1
0.516308 False coding_only_intronic 0.387113 0.300000 0.35596
8
2.135810 False noncoding_only_intronic 0.172162 0.100000 0.46549
8
2.792987 False
26. Testing the calculated and observed probabilities for gene components for the outlier patient
Table 25: Binomial tests between the calculated and observed probabilities for the outlier patient. The calculated probabilities are based on the proportion of the genomic size of all components of a single type compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the certain component type.
‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.
Genomic component Calculated probability
Observed probabilit y
p-value Corrected p-value
Significant
coding_genic 0.422138 0.441558 0.62551 5
3.753092 False noncoding_genic 0.188536 0.214286 0.41011
0
2.460657 False coding_exonic 0.035025 0.032468 1.00000
0
6.000000 False noncoding_exonic 0.016349 0.019481 0.74296
2
4.457771 False coding_only_intronic 0.387113 0.409091 0.61978
0
3.718678 False noncoding_only_intronic 0.172162 0.194805 0.45493
3
2.729599 False
27. Pan-cancer associations of the occurrence of SNVs and SVs in cancer genes and the occurrence of at least one MEI
Table 27: Fischer exact tests to look for associations across all cancer types between the occurrence of at least one verified insertion and the occurrence of SNVs and SVs in cancer related genes.
Gene p-value Corrected p-value Significant RB1 0.00343
0
1.070254 False CNTNAP 0.00996 3.109685 False
2 7 ZFHX3 0.01441 6
4.497781 False BRD4 0.01441
6
4.497781 False TP53 0.04108
9
12.819722 False RUNX1 0.04799
0
14.972875 False MAPK1 0.06166
8
19.240506 False NCOR1 0.06166
8
19.240506 False NEGR1 0.06166
8
19.240506 False FHIT 0.06166
8
19.240506 False CARD11 0.06166
8
19.240506 False GPHN 0.06166
8
19.240506 False ZMYM3 0.06166
8
19.240506 False GID4 0.06166
8
19.240506 False NF1 0.09991
8
31.174355 False AGBL4 0.15617
3
48.725958 False ROBO1 0.15617
3
48.725958 False EYS 0.15617
3
48.725958 False BCORL1 0.15617
3
48.725958 False
28. Per cancer associations of the occurrence of SNVs and SVs in cancer genes and the occurrence of at least one MEI
Table 28: Fischer exact tests to look for associations for each cancer type between the occurrence of at least one verified insertion and the occurrence of SNVs and SVs in cancer related genes.
Cancer type Gene p-value Corrected p-value Significant embryonal_rhabdomyosarcoma NCOR
1
0.14285 7
10.714286 False osteosarcoma LRP1B 0.15151
5
16.818182 False
osteosarcoma RB1 0.18181 8
20.181818 False nephroblastoma FGFR3 0.18181
8
9.636364 False nephroblastoma MTOR 0.18181
8
9.636364 False nephroblastoma PHF6 0.18181
8
9.636364 False nephroblastoma BRD4 0.18181
8
9.636364 False