• No results found

Somatic mobile element detection in pediatric cancer

N/A
N/A
Protected

Academic year: 2023

Share "Somatic mobile element detection in pediatric cancer"

Copied!
1
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Somatic mobile element detection in pediatric cancer:

Supplementary materials

Student: Ramon van Amerongen (6132375) First examiner: Jayne Hehir-Kwa Second examiner: Josephine Daub

University: Utrecht University Faculty: Faculty of Science

Master: Bioinformatics and Biocomplexity

(2)

Contents

CONTENTS ... 2

METHODS ... 4

1. BENCHMARKING: HG002 SAMPLEREMAPPINGCOMMAND...4

2. BENCHMARKING: SIMULATED ME CHARACTERISTICS...4

3. DESCRIPTIONOF MOBSTERALTERATIONS...4

3. JITTERBUGMODIFICATIONS...5

4. OVERLAPALGORITHM...5

FIGURES AND TABLES ... 6

1. DISTRIBUTIONOFGERMLINEMOBILEELEMENTINSERTIONSPRESENTIN HG002...6

2. TSD TYPESOFSUCCESSFULLYSIMULATEDMOBILEELEMENTINSERTIONS...6

3. TSD LENGTHDISTRIBUTIONSOFSUCCESSFULLYSIMULATEDMOBILEELEMENTINSERTIONS...7

4. MAXIMUMSQUAREDERROROFTHEPREDICTED VAF FORBOTHBENCHMARKSAMPLES...7

5. FLOWCHARTOFOVERLAPALGORITHM...8

6. SHAREDPREDICTEDSIMULATEDSOMATICINSERTIONSFORTHEOLDVERSIONOF MOBSTERANDSOMATICMODEOF XTEA...8

7. CUMULATIVEDETECTIONPERTOOLFORSOMATICBENCHMARK...9

MOBSTER:...9

XTEA:...9

JITTERBUG:...10

8. RECALLFORINCREASINGVARIANTALLELEFRACTIONSOFSIMULATEDSOMATICINSERTIONS...10

9. ME DETECTIONACCURACIESOFSIMULATEDSOMATICINSERTIONSFORDIFFERENT ME TYPES...10

10. DETECTIONACCURACIESFORTHEOLDVERSIONOF MOBSTERANDSOMATICMODEOFXTEA...12

11. THEPREDICTEDANDACTUAL VAFSOFTHESIMULATEDSOMATICINSERTIONS...12

MOBSTER:...12

XTEA:...13

BOTH:...13

12. CONFUSIONMATRICESFORSOMATIC ME TYPEMISMATCHESCOMPAREDTOTHEBENCHMARK...13

MOBSTER:...13

JITTERBUG:...13

13. CONFUSIONMATRICESFORSOMATIC TSD TYPEMISMATCHESCOMPAREDTOTHEBENCHMARK...14

MOBSTER:...14

XTEA:...14

14. TESTINGFORSIGNIFICANTDIFFERENCESIN MEI COUNTSBETWEENTOOLSANDCANCERTYPES...14

15. UNVERIFIEDOVERLAPPING MEI COUNTSPERCANCERTYPE...15

16. TESTINGFORSIGNIFICANTDIFFERENCESINVERIFIED MEI COUNTSBETWEENCANCERTYPES...15

17. TSD LENGTHOFALLVERIFIEDINSERTIONS...16

18. ME MISMATCHESEXCEPTTHEOUTLIERPATIENT...16

19. ME TYPESOFALLTOOLSEXCEPTTHEOUTLIERPATIENT...17

20. TSD MISMATCHESFORALLINSERTIONS...17 21. TSD TYPESOFONLYMATCHINGINSERTIONSBETWEENXTEAAND MOBSTERINCLUDINGTHEOUTLIERPATIENT. .17

(3)

22. ME TYPESOFONLYMATCHINGINSERTIONSBETWEENXTEAAND MOBSTEREXCEPTTHEOUTLIERPATIENT...18 23. CALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT...18 24. CALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSFORTHEOUTLIERPATIENT...18 25. TESTINGTHECALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT19 26. TESTINGTHECALCULATEDANDOBSERVEDPROBABILITIESFORGENECOMPONENTSEXCEPTTHEOUTLIERPATIENT19 27. PAN-CANCERASSOCIATIONSOFTHEOCCURRENCEOF SNVSAND SVSANDTHEOCCURRENCEOFATLEASTONE MEI...19 28. PERCANCERASSOCIATIONSOFTHEOCCURRENCEOF SNVSAND SVSANDTHEOCCURRENCEOFATLEASTONE MEI...20

(4)

Methods

1. Benchmarking: HG002 sample remapping command

The following command was used to remap the reads from the HG002 Novoalign aligned BAM file to a BWA aligned BAM file:

java -Xmx4G -jar <Picard 2.20.1> \ SamToFastq \

INPUT=${input_bam} \ FASTQ=/dev/stdout \ INTERLEAVE=true \ NON_PF=true | \

<bwa 0.7.13> mem -K 100000000 -p -v 3 -t 40 -Y <hg38 genome reference> /dev/stdin - |\

samtools view -1 - > <output>.bam

2. Benchmarking: Simulated ME characteristics.

The following settings were used when using the custom script (‘generate_te_insertions.py’) to simulate ME sequences:

 All were limited to chromosome 1.

 They at least have a distance of 100 bp from each other.

 161 insertions were generated for every transposon type (LINE1, Alu or SVA) and allele fraction between 0.1-0.9 with intervals of 0.1.

 The insertions were generated from Mobster’s consensus ME sequence library available at:

https://github.com/jyhehir/mobster/blob/master/resources/mobiome/hg19_54_active_

mobile_elements.fasta

 The sequences were mutated with a random rate between 0 and 5%.

 The chance for a target site duplication to occur was set to 0.9 with a length taken from a normal distribution with the top at 13 bp and cut-offs at 1 and 30 bp.

 The chance for a target site deletion to occur was set to 0.05 with a length taken from a normal distribution with the top at 7 bp and cut-offs at 1 and 20 bp.

 The chance for additional polyA to be added was set to 0.25 with a length taken from an uniform distribution between 1 and 50 bp.

 Their positions were randomly chosen outside blacklisted regions in the file hg38- blacklist.v2.bed.

3. Description of Mobster alterations

We added two ways for removing germline insertions from tumor insertions to Mobster: (1) When a previous output file is provided, any insertion is removed if it occurs within a configurable number of base pairs around the insertions in the file (default is 90 bp). (2) Alternatively, the BAM file of a normal sample can be used as was done in our study. When doing the latter discordant and split reads from the tumor and normal sample are used to predict insertions concurrently. It is then recorded how many supporting reads each sample

(5)

contributes tot he predicted MEI event. The distinction between germline and somatic insertions were grouped into three types: a high confident somatic insertion with no

supporting reads in the germline sample; a germline insertion with a configurable number of supporting reads in the germline sample (default at least 3) and a somatic insertion below this threshold is flagged as a normal artifact. This can, for example, be due to tumor contamination in the normal sample. Other than this addition, the insert size metrics were averaged over all samples instead of only the first BAM file which was done in the previous version of Mobster.

The breakpoint and confidence window predictions were also improved. Based on clipped read positions a ‘end’ breakpoint was added which was then used in addition to the

‘start’ breakpoint during clustering and filtering. Prediction of the target site duplication was also extended to the case when discordant read clusters overlap in the absence of split read clusters. Previously, a target site duplication could only be predicted if both split clusters were present.

In addition, prediction of the variant allelic fraction (VAF) was included in Mobster. A prediction is made in the region spanning from 200 bp to the left to 200 bp to right of the breakpoints. In this region, the maximum ratio of the split and discordant read depth and the total depth was chosen as a proxy for the VAF. A 200 bp window was based on the fact that it produces a low MSE of the VAF in the germline and somatic benchmark samples (supplementary figure 4).

3. Jitterbug modifications

The following modifications were made to Jitterbug so it was usable for our study.

The ME mappings were saved as a String to the database instead of an object which became empty. In addition, Venn diagram visualization was disabled to it could run on the hpc. Input file detection was also changed to allowed bam index files with both ‘.bam.bai’ and ‘.bai’

extensions. The produced gff3 output also needed to be converted to VCF. For this, we developed a conversion script which takes the softclipped positions as the start (POS) and end (end tag) breakpoints. If these were not available, the middle of the prediction borders (Left of CIPOS en right of CIPEND) were taken as both the start and end breakpoints.

4. Overlap algorithm

When overlapping insertions between multiple prediction sets their breakpoints should occur within 100 bp of each other to group them together but their ME type is allowed to differ. Each insertion in a set cannot overlap with more than one in another set. When multiple groups are possible (supplementary figure 5: A, middle), only the one with the lowest score is kept (supplementary figure 5: A, right). The score is determined by first calculating the distance between the start breakpoints and between the end breakpoints for each pair of insertions within a group. Then the absolute start and end difference are

summed for each pair of insertions. At last, the resulting values are summed over all pairings to obtain the score. Whenever this results in an equal score between groups, first the maximum and then the minimum from the start and end difference is taken instead of the sum. All remaining insertions that are not in a group will then iteratively be grouped until no new groups are formed (again supplementary figure 5: A, right). In case more than two prediction sets are provided, the insertions will first be grouped over all prediction sets (supplementary figure 5: A, three prediction sets). Any remaining insertions that do not overlap in all prediction sets (the non-grayed out MEs in the figure) will subsequently be

(6)

required to only overlap in the number of prediction sets minus one (supplementary figure 5: B, two prediction sets). This continues until only pairs of overlapping insertions are formed between two prediction sets.

(7)

Figures and tables

1. Distribution of germline mobile element insertions present in HG002

Figure 1: The mobile element insertions that are present on chromosome 1 of the HG002 sample of the Genome in a Bottle consortium.

2. TSD types of successfully simulated mobile element insertions

Figure 2: The number and (TSD) type of event occurring at the target site of the simulated insertions in sample HG002. For other figures ‘no_tsd’ means either no duplication/deletion or occurred or it was unknown. As the TSD type is already known for the simulated insertions, ‘no_tsd’ means in this case that no duplication or deletion occurred at all.

(8)

3. TSD length distributions of successfully simulated mobile element insertions

Figure 3: The lengths of target site duplications and deletions of successfully simulated insertions are displayed here.

4. Maximum squared error of the predicted VAF for both benchmark samples

Figure 4:The mean squared error between the actual VAF and the one predicted by Mobster for the germline and somatic insertions is shown here for different prediction windows. This was done to choose the most optimal prediction window: 200 bp.

(9)

5. Flow chart of overlap algorithm

Figure 5: (A) When overlapping insertions in three prediction sets all possible triplets are determined first. Then, the lowest scoring groups are determined by iteratively looping over all triplets until no new groups can be formed. No insertion can be part of multiple groups. (B) Any insertions that were not part of any triplets (non-grayed out MEs) are then again grouped in pairs and the lowest scoring pairs are kept.

6. Shared predicted simulated somatic insertions for the old version of Mobster and somatic mode of xTea

Figure 6: This shows the number of shared insertions between the tools for the somatic benchmark dataset. However, the methods changed for two tools: The old version of Mobster and the tumor of xTea were used instead of the new version and the somatic mode of xTea.

(10)

7. Cumulative detection per tool for somatic benchmark Mobster:

Figure 7a: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only the new version of Mobster.

xTea:

Figure 7b: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only the tumor mode of xTea.

(11)

Jitterbug:

Figure 7c: This figure shows the fraction of simulated somatic insertions that are detected for increasing VAF for only Jitterbug.

8. Recall for increasing variant allele fractions of simulated somatic insertions

Figure 8: Here the recalls of the simulated somatic insertions for different VAFi are shown for the newest version of Mobster, tumor mode of xTea and Jitterbug.

9. ME detection accuracies of simulated somatic insertions for different ME types

(12)

Table 9: Here the number of true and false calls and the accuracy scores for detecting the somatic simulated insertions are shown separately for different ME types and tools (the newest version of Mobster, tumor mode of xTea and Jitterbug).

Tools

Overlap method

ME TP FP FN Recall Precision F1 Mobster single tool ALU 37

8

16 48 9

0.43598 6

0.95939 1

0.59952 4 Mobster single tool LINE

1 52

8

30 35 5

0.59796 1

0.94623 7

0.73282 4 Mobster single tool SVA 43

6

23 44 8

0.49321 3

0.94989 1

0.64929 3 xTea single tool ALU 37

0

20 49 7

0.42675 9

0.94871 8

0.58870 3 xTea single tool LINE

1 30

9

5 57 4

0.34994 3

0.98407 6

0.51629 1 xTea single tool SVA 39

2

9 49 2

0.44343 9

0.97755 6

0.61011 7 Jitterbug single tool ALU 12

4 33

5 74

3

0.14302 2

0.27015 3

0.18702 9 Jitterbug single tool LINE

1 23

7

90 64 6

0.26840 3

0.72477 1

0.39173 6 Jitterbug single tool SVA 16

6

16 71 8

0.18778 3

0.91208 8

0.31144 5 Mobster+xTea intersection ALU 32

5

0 54 2

0.37485 6

1.00000 0

0.54530 2 Mobster+xTea intersection LINE

1 26

7

0 61 6

0.30237 8

1.00000 0

0.46434 8 Mobster+xTea intersection SVA 38

7

0 49 7

0.43778 3

1.00000 0

0.60896 9 Mobster+xTea union ALU 42

3

36 44 4

0.48788 9

0.92156 9

0.63800 9 Mobster+xTea union LINE

1 57

0

35 31 3

0.64552 7

0.94214 9

0.76612 9 Mobster+xTea union SVA 44

1

32 44 3

0.49886 9

0.93234 7

0.64996 3 Mobster+Jitterbug intersection ALU 11

4

3 75 3

0.13148 8

0.97435 9

0.23170 7 Mobster+Jitterbug intersection LINE

1 22

0

0 66 3

0.24915 1

1.00000 0

0.39891 2 Mobster+Jitterbug intersection SVA 16

0

0 72 4

0.18099 5

1.00000 0

0.30651 3 Mobster+Jitterbug union ALU 38

8 34

8 47

9

0.44752 0

0.52717 4

0.48409 2 Mobster+Jitterbug union LINE

1 54

5 12

0 33

8

0.61721 4

0.81954 9

0.70413 4 Mobster+Jitterbug union SVA 44

2

39 44 2

0.50000 0

0.91891 9

0.64761 9

(13)

xTea+Jitterbug intersection ALU 11 3

0 75 4

0.13033 4

1.00000 0

0.23061 2 xTea+Jitterbug intersection LINE

1 11

9

0 76 4

0.13476 8

1.00000 0

0.23752 5 xTea+Jitterbug intersection SVA 14

8

1 73 6

0.16742 1

0.99328 9

0.28654 4 xTea+Jitterbug union ALU 38

1 35

4 48

6

0.43944 6

0.51836 7

0.47565 5 xTea+Jitterbug union LINE

1 42

7

95 45 6

0.48357 9

0.81800 8

0.60782 9 xTea+Jitterbug union SVA 41

0

25 47 4

0.46380 1

0.94252 9

0.62168 3 Mobster+xTea

+Jitterbug

intersection ALU 10 7

0 76 0

0.12341 4

1.00000 0

0.21971 3 Mobster+xTea

+Jitterbug

intersection LINE 1

11 4

0 76 9

0.12910 5

1.00000 0

0.22868 6 Mobster+xTea

+Jitterbug

intersection SVA 14 7

0 73 7

0.16629 0

1.00000 0

0.28516 0 Mobster+xTea

+Jitterbug

union ALU 42 7

36 7

44 0

0.49250 3

0.53778 3

0.51414 8 Mobster+xTea

+Jitterbug

union LINE 1

58 2

12 5

30 1

0.65911 7

0.82319 7

0.73207 5 Mobster+xTea

+Jitterbug

union SVA 44 6

48 43 8

0.50452 5

0.90283 4

0.64731 5 Mobster+xTea

+Jitterbug

two out of three

ALU 33 8

3 52 9

0.38985 0

0.99120 2

0.55960 3 Mobster+xTea

+Jitterbug

two out of three

LINE 1

37 8

0 50 5

0.42808 6

1.00000 0

0.59952 4 Mobster+xTea

+Jitterbug

two out of three

SVA 40 1

1 48 3

0.45362 0

0.99751 2

0.62363 9

10. Detection accuracies for the old version of Mobster and somatic mode of xTea

Table 10: Here the number of true and false calls and the accuracy scores for detecting the somatic simulated insertions are shown for the old version of Mobster, somatic mode of xTea and Jitterbug.

Tools Compariso n

TP FP FN Recall Precision F1

Jitterbug none 527 44 1

210 7

0.20007 6

0.54442 1

0.29261 5

Mobster none 135

0 13

0

128 4

0.51252 8

0.91216 2

0.65629 6 Mobster+Jitterbug intersection 498 16 213

6

0.18906 6

0.96887 2

0.31639 1 Mobster+Jitterbug union 137

9 55

5

125 5

0.52353 8

0.71303 0

0.60376 5

(14)

Mobster+xTea intersection 964 0 167 0

0.36598 3

1.00000 0

0.53585 3 Mobster+xTea union 143

1 13

0

120 3

0.54328 0

0.91672 0

0.68224 1 Mobster+xTea+Jitterbug intersection 367 0 226

7

0.13933 2

1.00000 0

0.24458 5 Mobster+xTea+Jitterbug two out of

three

110

3 16 153 1

0.41875 5

0.98570 2

0.58779 6 Mobster+xTea+Jitterbug union 145

2 55

5

118 2

0.55125 3

0.72346 8

0.62572 7

xTea none 104

5 0 158

9

0.39673 5

1.00000 0

0.56808 9 xTea+Jitterbug intersection 375 0 225

9

0.14236 9

1.00000 0

0.24925 2 xTea+Jitterbug union 119

7 44

1

143 7

0.45444 2

0.73076 9

0.56039 3

11. The predicted and actual VAFs of the simulated somatic insertions

Mobster:

Figure 11a: The actual and predicted VAFs of the simulated somatic insertions by the newest version of Mobster.

(15)

xTea:

Figure 11b: The actual and predicted VAFs of the simulated somatic insertions by the tumor mode of xTea.

Both:

Figure 11c: Scatterplot of the predicted VAFs of both the newest version of Mobster and the tumor mode of xTea

12. ME type predictions of the simulated somatic insertions Mobster:

Table 12a: Confusion matrix of the ME type predictions made by the newest version of Mobster. “Pred.” refers to the predicted ME types by Mobster in the first column, while “Act” refers to the actual ME types in the first row.

Pred./Act. ALU LINE 1

SVA

ALU 375 0 0

LINE1 1 526 0

SVA 2 2 436

(16)

Jitterbug:

Table 12b: Confusion matrix of the ME type predictions made by Jitterbug. “Pred.” refers to the predicted ME types by Mobster in the first column, while “Act” refers to the actual ME types in the first row.

Pred./Act. ALU LINE 1

SVA

ALU 121 2 1

LINE1 4 235 13

SVA 0 2 153

13. TSD type predictions of the simulated somatic insertions Mobster:

Table 13a: Confusion matrix of the TSD type predictions made by the newest version of Mobster. “Pred.” refers to the predicted TSD types by Mobster in the first column, while “Act” refers to the actual ME types in the first row. Mobster makes the distinction between insertions whose TSD type is ‘unknown’ or insertions that did not have a duplication or deletion (‘no_tsd’).

Pred./Act. deletio n

duplication no_ts d

deletion 57 2 3

duplication 0 1052 28

no_tsd 6 0 26

unknown 15 143 10

xTea:

Table 13b: Confusion matrix of the TSD type predictions made by the tumor mode of xTea. “Pred.” refers to the predicted TSD types by xTea in the first column, while “Act” refers to the actual ME types in the first row. When xTea predicts ‘no_tsd’

it can either refer to insertions whose TSD type is unknown or insertions that neither have a duplication nor a deletion.

Pred./Act. deletio n

duplication no_ts d

deletion 54 16 6

duplication 3 1052 34

no_tsd 3 15 1

14. Testing for significant differences in somatic MEI counts between tools and cancer types

Table 14: The statistical comparisons that were performed between the cancer types and tools in terms of the number of MEIs found. Comparisons were made for the different groups in ‘Groups’ that all belonged to the ‘Cancer type/Tool’.

test

Cancer type/Tool Groups p-value Adjusted p-value

Significan t

wilcoxon Nephroblastoma xTea + Mobster 5.388791e-

07 0.000002 True

wilcoxon Ewing sarcoma xTea + Mobster 1.907349e- 06

0.000008 True wilcoxon Embryonal xTea + Mobster 6.103516e- 0.000244 True

(17)

rhabdomyosarcoma 05 kruskal-

wallis

xTea Nephroblastoma + Embryonal rhabdomyosarcoma + ...

2.334348e- 03

0.001953 True

mann- whitney

Mobster Ewing sarcoma + Osteosarcoma

2.185058e- 04

0.002622 True kruskal-

wallis

Mobster Nephroblastoma + Embryonal rhabdomyosarcoma + ...

2.011519e- 04

0.003653 True

wilcoxon Osteosarcoma xTea + Mobster 9.765625e- 04

0.003906 True mann-

whitney Mobster Nephroblastoma +

Osteosarcoma 4.676780e-

04 0.005612 True

mann-

whitney xTea Nephroblastoma +

Osteosarcoma 1.144097e-

03 0.013729 True

mann- whitney

xTea Ewing sarcoma + Osteosarcoma

1.826373e- 03

0.021916 True mann-

whitney

Mobster Embryonal

rhabdomyosarcoma + Ewing sarcoma

9.282848e- 03

0.111394 False

mann-

whitney Mobster Nephroblastoma + Embryonal rhabdomyosarcoma

5.021677e-

02 0.602601 False mann-

whitney xTea Nephroblastoma +

Embryonal rhabdomyosarcoma

5.297484e-

02 0.635698 False mann-

whitney

Mobster Nephroblastoma + Ewing sarcoma

9.663498e- 02

1.159620 False mann-

whitney xTea Embryonal

rhabdomyosarcoma + Osteosarcoma

9.675407e-

02 1.161049 False mann-

whitney

xTea Embryonal

rhabdomyosarcoma + Ewing sarcoma

1.379859e- 01

1.655830 False

mann-

whitney xTea Nephroblastoma +

Ewing sarcoma 4.799072e-

01 5.758887 False mann-

whitney

Mobster Embryonal

rhabdomyosarcoma + Osteosarcoma

5.333465e- 01

6.400158 False

15. Unverified overlapping somatic MEI counts per cancer type

Figure 15: The number of found overlapping somatic insertions within 100 bp between xTea and Mobster per patient per cancer type are shown here. These insertions were not manually verified yet.

(18)

16. Testing for significant differences in verified somatic MEI counts between cancer types

Table 16: The statistical comparisons that were performed between the cancer types in terms of the number of overlapping somatic MEIs found. Comparisons were made for the different groups in ‘Groups’.

Test Groups p-value Adjusted p-

value Significan t

kruskal- wallis

Nephroblastoma + Embryonal rhabdomyosarcoma + ...

0.00026 0

0.000260 True mann-

whitney Nephroblastoma + Embryonal

rhabdomyosarcoma 0.10805

1 0.648306 False

mann-

whitney Nephroblastoma + Ewing sarcoma 0.01714

3 0.102861 False

mann-

whitney Nephroblastoma + Osteosarcoma 0.01361

5 0.081689 False

mann- whitney

Embryonal rhabdomyosarcoma + Ewing sarcoma

0.04807 1

0.288426 False mann-

whitney

Embryonal rhabdomyosarcoma + Osteosarcoma

0.48690 0

2.921402 False mann-

whitney Ewing sarcoma + Osteosarcoma 0.00132

3 0.007938 True

chi-square Nephroblastoma + Embryonal

rhabdomyosarcoma + ... 0.00013

9 0.000139 True

fischer Nephroblastoma + Embryonal

rhabdomyosarcoma 0.15244

7 0.914682 False

fischer Nephroblastoma + Ewing sarcoma 0.07221 9

0.433312 False fischer Nephroblastoma + Osteosarcoma 0.00393

7

0.023622 True fischer Embryonal rhabdomyosarcoma +

Ewing sarcoma 0.00308

3 0.018501 True

fischer Embryonal rhabdomyosarcoma +

Osteosarcoma 0.22619

5 1.357168 False

fischer Ewing sarcoma + Osteosarcoma 0.00005

9 0.000354 True

17. TSD length of all verified insertions

(19)

Figure 17: The distribution of the length of the target site duplications and deletions of the verified somatic insertions.

18. ME mismatches for the verified insertions except the outlier patient

Table 18: Confusion matrix of the ME type predictions made fo the verified insertions by Mobster and xTea. Insertions of the outlier patient were excluded.

xTea/Mobste r

ALU LINE 1

SVA

ALU 1 4 5

LINE1 1 19 0

19. ME types of the verified insertions except the outlier patient

Figure 19: The number of insertions with a certain ME type is shown separately for xTea and Mobster. This does not show which insertions have mismatching ME types. The insertions of the outlier patient were excluded.

20. TSD mismatches for all verified insertions

Table 20: Confusion matrix of the ME type predictions made by Mobster and xTea for the verified insertions. Mobster’s

‘unknown’ was first converted to ‘no_tsd’, as xTea, does not make the distinction between insertions for which the TSD type is unknown or insertions that did not have a duplication or deletion (‘no_tsd’). The insertions of the outlier patient are included.

xTea/Mobste r

deletio n

duplication no_ts d

deletion 1 2 3

duplication 0 122 12

no_tsd 0 6 8

(20)

21. TSD types of only all matching verified insertions between xTea and Mobster

Figure 21: Here the TSD type is only shown for verified insertions whose TSD types match between xTea and Mobster.

Insertions of the outlier patient are included.

22. ME types of only matching insertions between xTea and Mobster except the outlier patient

Figure 22: Here the ME type is only shown for verified insertions whose ME types match between xTea and Mobster.

Insertions of the outlier patient are excluded.

(21)

23. Calculated and observed probabilities for MEI affected gene components except for the outlier patient

Figure 23: The calculated and observed probabilities of a verified insertion not from the outlier patient affecting different gene components. The calculated probabilities are based on the proportion of the genomic size of all regions of a single component compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the indicated component type. Insertions of the outlier patient have been excluded. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.

24. Calculated and observed probabilities for gene components for the outlier patient

Figure 24: The calculated and observed probabilities of a verified insertion from the outlier patient affecting different gene components. The calculated probabilities are based on the proportion of the genomic size of all regions of a single

component compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the indicated component type. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.

25. Testing the calculated and observed probabilities for gene components except for the outlier patient

Table 25: Binomial tests between the calculated and observed probabilities. The calculated probabilities are based on the proportion of the genomic size of all components of a single type compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the certain component type. Insertions of the outlier patient have been excluded. ‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript.

Except for intronic only regions, insertions can be part of multiple other regions.

Genomic component Calculated Observed p-value Corrected Significant

(22)

probability probability p-value coding_genic 0.422138 0.333333 0.36053

8

2.163228 False noncoding_genic 0.188536 0.166667 1.00000

0

6.000000 False coding_exonic 0.035025 0.033333 1.00000

0

6.000000 False noncoding_exonic 0.016349 0.066667 0.08605

1

0.516308 False coding_only_intronic 0.387113 0.300000 0.35596

8

2.135810 False noncoding_only_intronic 0.172162 0.100000 0.46549

8

2.792987 False

26. Testing the calculated and observed probabilities for gene components for the outlier patient

Table 25: Binomial tests between the calculated and observed probabilities for the outlier patient. The calculated probabilities are based on the proportion of the genomic size of all components of a single type compared to the total genome size. The observed probability is the fraction of verified insertions that fall within the certain component type.

‘only_intronic’ refers to an insertion that did not occur in exonic regions of any (alternative) transcript. Except for intronic only regions, insertions can be part of multiple other regions.

Genomic component Calculated probability

Observed probabilit y

p-value Corrected p-value

Significant

coding_genic 0.422138 0.441558 0.62551 5

3.753092 False noncoding_genic 0.188536 0.214286 0.41011

0

2.460657 False coding_exonic 0.035025 0.032468 1.00000

0

6.000000 False noncoding_exonic 0.016349 0.019481 0.74296

2

4.457771 False coding_only_intronic 0.387113 0.409091 0.61978

0

3.718678 False noncoding_only_intronic 0.172162 0.194805 0.45493

3

2.729599 False

27. Pan-cancer associations of the occurrence of SNVs and SVs in cancer genes and the occurrence of at least one MEI

Table 27: Fischer exact tests to look for associations across all cancer types between the occurrence of at least one verified insertion and the occurrence of SNVs and SVs in cancer related genes.

Gene p-value Corrected p-value Significant RB1 0.00343

0

1.070254 False CNTNAP 0.00996 3.109685 False

(23)

2 7 ZFHX3 0.01441 6

4.497781 False BRD4 0.01441

6

4.497781 False TP53 0.04108

9

12.819722 False RUNX1 0.04799

0

14.972875 False MAPK1 0.06166

8

19.240506 False NCOR1 0.06166

8

19.240506 False NEGR1 0.06166

8

19.240506 False FHIT 0.06166

8

19.240506 False CARD11 0.06166

8

19.240506 False GPHN 0.06166

8

19.240506 False ZMYM3 0.06166

8

19.240506 False GID4 0.06166

8

19.240506 False NF1 0.09991

8

31.174355 False AGBL4 0.15617

3

48.725958 False ROBO1 0.15617

3

48.725958 False EYS 0.15617

3

48.725958 False BCORL1 0.15617

3

48.725958 False

28. Per cancer associations of the occurrence of SNVs and SVs in cancer genes and the occurrence of at least one MEI

Table 28: Fischer exact tests to look for associations for each cancer type between the occurrence of at least one verified insertion and the occurrence of SNVs and SVs in cancer related genes.

Cancer type Gene p-value Corrected p-value Significant embryonal_rhabdomyosarcoma NCOR

1

0.14285 7

10.714286 False osteosarcoma LRP1B 0.15151

5

16.818182 False

(24)

osteosarcoma RB1 0.18181 8

20.181818 False nephroblastoma FGFR3 0.18181

8

9.636364 False nephroblastoma MTOR 0.18181

8

9.636364 False nephroblastoma PHF6 0.18181

8

9.636364 False nephroblastoma BRD4 0.18181

8

9.636364 False

Referenties

GERELATEERDE DOCUMENTEN

The question of love and politics was a matter of debate for Arendt herself, who, despite her work on the crisis and the pursuit of ontological truths, made no attempt to disguise

The holder is flipped with the anterior tissues of the eye facing upwards, and pins are used to fix the periocular tissues (b). The cornea is centrally aligned. Coupling gel is added

In the next step, we applied the framework and the resulting design requirements to the design of a wearable breathing trainer to be used for physical therapy of children

Uitgaande van de grasopname bepaald volgens de uitmaaitechniek en de gemeten opname van snijmaïs- en mengvoer met de daarbij gemeten N-gehalten in deze voedermiddelen, kan

Although effects of N deposition on the loss of bare sand in inland dunes were partly masked by differences in geomorphology, it has been shown that -on a local scale-

Several potential sources of pheromones in lizards have been described, including epidermal and cloacal glands and the blood–skin barrier.(1) A comprehensive review of the

Keywords Constraint integer programming · linear programming · mixed-integer lin- ear programming · mixed-integer nonlinear programming · optimization solver · branch- and-cut

Time is embodied in practices in term of practice history and practice memory hold by actors, as well a in the teleological end points that actors pursue in real-time