University of Groningen Quantifying the transcriptome of a human pathogen Aprianto, Rieza

(1)

University of Groningen

Quantifying the transcriptome of a human pathogen

Aprianto, Rieza

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from

it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date:

2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Aprianto, R. (2018). Quantifying the transcriptome of a human pathogen: Exploring transcriptional

adaptation of Streptococcus pneumoniae under infection-relevant conditions. Rijksuniversiteit Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

QUANTIFYING

THE TRANSCRIPTOME

OF A HUMAN

PATHOGEN

Rieza Aprianto

(3)

Quantifying the transcriptome of a human pathogen

Exploring transcriptional adaptation of Streptococcus pneumoniae under infection-relevant conditions

The scientific studies presented in the thesis were performed in the Molecu-lar Genetics group of the Groningen BiomolecuMolecu-lar Sciences and Biotechnol-ogy Institute, Faculty of Science and Engineering, University of Groningen, The Netherlands. The studies were financially supported by the European Research Council (ERC) Starting Grant awarded to Jan-Willem Veening. Printing was supported by the Graduate School of Science and Engineer-ing and the University Library of the University of GronEngineer-ingen.

ISBN: 978-94-034-0759-3

978-94-034-0760-9 (ebook) Printing: Eikon +

Cover & layout: Lovebird design.

www.lovebird-design.com

Quantifying the transcriptome of

a human pathogen

PhD thesis

to obtain the degree of PhD at the University of Groningen

on the authority of the Rector Magnificus Prof. E. Sterken

and in accordance with the decision by the College of Deans. This thesis will be defended in public on

Friday, 13 July 2018 at 14.30 hours

by

Rieza Aprianto

born on 12 April 1986 in Bandung, Indonesia

(4)

Supervisors

Prof. J.-W. Veening Prof. O.P. Kuipers

Assessment Committee

Prof. M. Heinemann Prof. J.M. van Dijl Prof. N. van Sorge

Quantifying the transcriptome of

a human pathogen

Proefschrift

ter verkrijging van de graad van doctor aan de Rijksuniversiteit Groningen

op gezag van de

rector magnificus prof. dr. E. Sterken en volgens besluit van het College voor Promoties.

De openbare verdediging zal plaatsvinden op vrijdag 13 juli 2018 om 14.30 uur

door

Rieza Aprianto

geboren op 12 april 1986 te Bandung, Indonesië

(5)

Promotores

Prof. J.-W. Veening Prof. O.P. Kuipers

Beoordelingscommissie

Prof. M. Heinemann Prof. J.M. van Dijl Prof. N. van Sorge

O Sapientia, quae ex ore Altissimi prodiisti, attingens a fine usque ad finem,

fortiter suaviterque disponens omnia: veni ad docendum nos viam prudentiae.

(6)

CHAPTER 1

Introduction 13

CHAPTER 2

Deep genome annotation of the opportunistic human pathogen

Streptococcus pneumoniae D39 27

CHAPTER 3

High-resolution analysis of the pneumococcal transcriptome

under a wide range of infection-relevant conditions 67

CHAPTER 4

Bright fluorescent Streptococcus pneumoniae for live-cell imaging

of host-pathogen interactions 97

CHAPTER 5

Time-resolved dual RNA-seq reveals extensive rewiring of lung epithelial and pneumococcal transcriptomes during

early infection 127

CHAPTER 6

Discussion and future perspective 163

CHAPTER 7

Summary 177 Academic summary Academische samenvatting Ringkasan akademik Acknowledgements 191

(7)

1

(8)

1

The ancien

t sc

our

ge

Streptococcus pneumoniae: the ancient scourge of

modern society

Five centuries before the Common Era, Hippocrates of Kos (c. 460–367 B.C.), the Father of Medicine, described the diagnosis and remedy for pneu-monia — a form of infection into the lower respiratory tract1_{. In his}

in-fluential corpus, he referred to pneumonia as “those which the ancients named”, exemplifying that the scourge of pneumonia was known even by earlier societies than the Ancient Greeks. Two and a half millennia later, lower respiratory tract infections (LRTIs) are still very much a part of our modern society. A recent report showed that LRTIs are the deadliest communicable disease and the fifth most common cause of global death2_.

In addition, the infections cause principal loss of healthy life (disability- adjusted life years, i.e.: a combination of mortality and morbidity), right behind ischemic heart disease3_{. Pneumonia, an infection of the lung}

alve-oli4_{is the most important form of lower respiratory tract infection.}

The most prominent etiologic agent of pneumonia is the Gram posi-tive opportunistic pathogen Streptococcus pneumoniae. This bacterium is responsible for the majority of LRTIs cases while single-handedly plac-ing LRTIs as the deadliest infectious disease2_{. Aside from pneumonia,}

S. pneumoniae causes milder infections, such as otitis media and sinusitis,

and other severe and lethal infections, including meningitis and septice-mia5_{. These pneumococcal infections are distinguished by high mortality}

rates in young children with 59% of pneumococcal meningitis cases and 45% of septicemia cases resulting in death. In particular, pneumococcal- related mortality is higher in African children than in children from other continents6_{. Although developing countries tend to bear the}

pneumococ-cal brunt, developed wealthy societies recently reported high incidence of pneumococcal infections in the elderly population7_{, making the}

pneumo-coccus a general health issue to all human populations.

In most cases, pneumococcus resides in the host nasopharyngeal passage without symptoms. In fact, S. pneumoniae is part of the typ-ical microbiota of the upper respiratory tract8–10_{. Pneumococcal}

car-riage begins in the first two years of life11_{and colonization rates depend}

(9)

1

In tr oduction Comple x biologic al s ys tem

people in a household13_{, day-care attendance}14_{, number of other}

chil-dren15_{, bed-sharing and malnutrition}16_{. This asymptomatic colonization}

is a prerequisite for further pneumococcal infections17_{. Because of its}

im-pact on general health, vaccination programs against the pneumococcus have been introduced. Unfortunately, limited success has been reported on these programs, with vaccine-target strains being replaced by non- vaccine strains capable of causing invasive infections5_.

In the 1940s, the treatment of pneumococcal pneumonia greatly bene-fitted from the introduction of sulfonamides and penicillin in the clinics18_.

However, pneumococcal resistance to penicillin and other antimicrobials quickly spread worldwide19_{. Soon afterwards, resistance to more than one}

antibiotic was reported in S. pneumoniae and, more worryingly, half of invasive pneumococcal cases in the United States were resistant to at least one antibiotic20_{. In addition, pneumococcal resistance to a wide range of}

clinically-relevant antibiotics has been reported around the globe21,22_.

Unlikely help: pneumococcus assisting biological

research

Five years after Sternberg23_{and Pasteur}24_{reported independently the}

pathogenic potential of S. pneumoniae, Fraenekel25_{called the bacterium}

the pneumococcus due to its propensity for causing pneumonia. Later, the

bacterium was renamed Diplococcus pneumoniae by the Society of Amer-ican Microbiologists26_{, referring to its characteristic shape under the}

mi-croscope which resembles a pair of cocci. Finally, in 1974, the pneumococ-cus was reclassified under the genus Streptococcpneumococ-cus27_{. Since its discovery,}

the bacterium has been the subject of seminal breakthroughs, including its role in the discovery of Gram staining28_{, in the demonstration of}

anti-genic properties of polysaccharides29_{, and the first successful case of}

pen-icillin treatment in clinical infection30_.

The most influential role of the pneumococcus in biological research is the conclusive evidence that DNA exclusively carries the genetic code (later RNA-based virus was discovered to be the exception to this rule). Griffith was the first to show that phenotype, in this specific case,

expression of a capsule can be transferred from capsule-producing strain to non-capsular S. pneumoniae inside a murine host31_{. The result was then}

verified, expanded and optimized32–35_{. Avery et al. built on these}

observa-tions and fine-tuned the method to determine that DNA is the material which mediated the phenotypic transfer between the two pneumococcal strains36_{. We now recognize that all known capsule genes except in one}

pneumococcal serotype are encoded in a single operon, the cps operon, which is located between two conserved genes: dexB and aliA37_{. The}

ge-nomic arrangement of the capsule mediates easy and efficient transfer be-tween strains and, in turn, facilitates capsule switching among different serotypes. We now understand that by exploiting this strategy, S.

pneumo-niae may escape vaccine-induced immune pressure38_.

Biological systems, genome-wide approach and

RNA-sequencing

Reductionism has been a successful and crucial approach in molecular bi-ology, due to its ability to connect a gene or set of genes (genotype) to a measurable trait (phenotype). Reductionists clarified that the presence of a pneumococcal capsule and its serotype are determined by the presence and precise genotype of the cps operon. Unfortunately, reductionism fails to explain complicated biological phenomena, including interspecies inter-action, immunity and infection. In the last decade, reductionism has been conceding to holism, a perspective that considers interactions between ev-ery component of the system. This approach appreciates the continuous interaction between every component, biotic and abiotic, and their simul-taneous modification. Furthermore, while the environment constraints bi-ological components, the living component changes the environment to its needs39_{, emphasizing the transient and dynamic nature of the system.}

The complexity of this system causes novel properties to emerge, which in turn, determines the direction and characteristics of the system40_.

Bacterial infection is a classic example of such complex biological sys-tem. In infection sites, the pathogen multiplies and acquires nutrients to fuel its expansion while simultaneously evading the immune response.

(10)

1

In tr oduction Subtle virulenc e

On the other hand, the host struggles to remove invading pathogens by innate immune responses and specialized cells. The pathogen and the host interact intimately and alter the environment according to their re-spective needs. In order to elucidate emergent properties of such complex system and to have a bird eye’s view over the phenomenon, we need to exhaustively measure every parameters in a comprehensive manner41_.

In addition, the shift of the paradigm has been spurred by cutting edge advances, especially in sequencing technology and processing large bio-logical datasets. Transcriptomics, for example, allows researchers unprece-dented access to the genome-wide transcriptome and to the way it changes during a specific phenomenon. In the last four decades, sequencing tech-nology has been perfected into its current high- throughput incarnation. When introduced, sequencing was used to decipher the genetic code (ge-nomics) and then when coupled with cDNA generation, it was employed to decipher the transcriptome (transcriptomics), through RNA-seq42_{. Because}

of improvements in library preparation and sequencing efficiency, more nucleic bases can be sequenced in less time, driving down cost and justi-fying sequencing as a routine protocol43_{. Prior to the widespread use of}

se-quencing-based technology, array-based technologies, such as microarray and tiling array, were the platform of choice for transcriptomics studies. Compared to array-based approach, RNA-seq has a wider dynamic range, resulting in better detection of transcript boundaries and more powerful differential expression analysis44_.

Furthermore, recent reports dispel the myth of the simplicity of the prokaryotic transcriptome. Rather, the bacterial transcriptome is as com-plex as eukaryotic transcriptome45,46_{. In addition, sequencing results have}

permitted the elucidation of genomics architecture and regulatory struc-tures of gene expression. For example, bacteria employ a wide range of small RNAs to regulate gene expression, both cis- and trans-acting47_.

Ad-ditionally, multiple start sites permit alternative forms of operon48_,

fur-ther expanding its genomic potential. As an illustration, the genome-wide examination of the small bacterium Helicobacter pylori (1.7 Mbp) showed ubiquitous transcription start sites: inside and opposite of coding sequences and numerous sRNAs49_{generating an ample genomic}

reper-toire. Unbiased genome-wide surveys have identified novel non-coding

features and new annotated regions50_{and specific regulons}51_{. Multiple}

lay-ers of regulation combined with condition-specific regulatory features52,53

allow specialized and flexible regulation of gene expression.

Additionally, quantitative transcriptomics facilitates genome-wide differential analysis of gene expression. The precise measurement of transcript abundance facilitates the discovery of the effect of gene deple-tion54,55_{, the elucidation of stress response}56_{and the mapping of detailed}

expression of pathogenic islands57_{. In particular, Westermann et al}

pro-posed a simultaneous approach to measure host and pathogen transcrip-tomics during infection in a thought experiment58_{. Later, the approach}

elucidated a bacterial sRNA important for intracellular survival59_{. The}

approach has also been expanded into whole infected organ60_therefore

highlighting individual host responding to a pathogen. Finally, the com-bination of shifting paradigm into holistic approach, the availability of ( sequencing) technology and the discovery of complex bacterial gene reg-ulations specific to its niche has hastened renewed interest in the explora-tion of the prokaryotic transcriptome.

Subtlety and complexity of the pneumococcal

transcriptome

Pathogenicity island usually hosts genomic potential of pathogenic bac-teria including toxins and other virulence factors. Unfortunately, the de-termination of pathogenicity islands in S. pneumoniae has been proven to be impossible61–63_{. For example, the capsule-encoding cps operon, a}

well-described pneumococcal virulence factor, is conserved in both clin-ical strains of S. pneumoniae and other closely related non-pathogenic Streptococcal species64_{. The presence of genes and clusters of genes, it}

seems, does not determine pneumococcal virulence or pathogenicity. On the contrary, pneumococcal virulence might be determined by more subtle mutations that allow pneumococcus to precisely regulate the ex-pression of virulence factors in response to environmental signals65_{. For}

example, mutations in the untranslated regions preceding (5’-UTR) and following (3’-UTR) virulence genes may modify its expression level and,

(11)

1

In tr oduction R ef er enc es

thus, determine overall pneumococcal virulence. In addition, the mul-tiple small non-coding RNA that have been reported in S. pneumoniae further enrich the pneumococcal regulatory repertoire66–69_{, including the}

regulation of virulence factors.

Applying cutting edge sequencing technologies to

understand pneumococcal biology

In the following dissertation, we applied recent advances in high throughput sequencing (mostly RNA) to reveal detailed organization of genetic features and pneumococcal transcription in infection models (Fig. 1). In Chapter 2, we revisited the basic genomic information of S.

pneumoniae, strain D39, Veening lab: the sequence, the annotation and

the operon architecture — effectively rendering the strain to be the most well described pneumococcal strain up to date. The strain D39 has been a major work-horse of pneumococcal research70_{. Next, we generated a}

compendium of pneumococcal transcriptome by exposing the strain to conditions relevant to the bacterial lifestyle of colonization and infec-tion (Chapter 3). In the same chapter, we generated a simple yet powerful

gene network as a co-expression matrix. In Chapter 4, we established,

inter alia, an infection model which contain the pneumococcus and a live human confluent lung epithelial layer. Subsequently, we exploited a dual transcriptomics approach to the aforementioned infection model to simultaneously measure the dynamic transcriptional rewiring of epi-thelial cells and S. pneumoniae during early infection (Chapter 5). Finally,

in Chapter 6, we summarized the findings placing them in the current

scientific context.

References

1. Hippocrates. On Regimen in Acute Diseases. (CreateSpace Independent Publishing Platform, 400BC).

2. Troeger, C. et al. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory tract infections in 195 countries: a systematic analysis for the Global Burden of Disease Study 2015. Lancet Infect. Dis. 17, 1133–1161 (2017).

3. Kassebaum, N. J. et al. Global, regional, and national disability-adjusted life-years (DALYs) for 315 diseases and injuries and healthy life expectancy (HALE), 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. The Lancet 388, 1603–1658 (2016). 4. Finegold, S. M. & Johnson, C. C. Lower respiratory tract infection. Am. J. Med. 79, 73–77

(1985).

5. Henriques-Normark, B. & Tuomanen, E. I. The pneumococcus: epidemiology, microbiology, and pathogenesis. Cold Spring Harb. Perspect. Med. 3, (2013).

6. O Brien, K. L. et al. Burden of disease caused by Streptococcus pneumoniae in children younger than 5 years: global estimates. Lancet Lond. Engl. 374, 893–902 (2009).

7. Welte, T., Torres, A. & Nathwani, D. Clinical and economic burden of community-acquired pneumonia among adults in Europe. Thorax 67, 71–79 (2012).

8. Bosch, A. A. T. M. et al. Development of upper respiratory tract microbiota in infancy is af-fected by mode of delivery. EBioMedicine 9, 336–345 (2016).

9. Bosch, A. A. T. M. et al. Nasopharyngeal carriage of Streptococcus pneumoniae and other bacteria in the 7th year after implementation of the pneumococcal conjugate vaccine in the Netherlands. Vaccine 34, 531–539 (2016).

Fig. 1. Overview of thesis. We annotated the genome of S. pneumoniae D39 genome

and defined genome-wide transcriptional units by precisely mapping start and

ter-mination sites (Chapter 2). Next, we elucidated transcriptional responses in response

to wide array of conditions relevant to pneumococcal lifestyle (Chapter 3). We then

established a host-pathogen infection model (Chapter 4) which we exploited to

(12)

1

10. Miller, E., Andrews, N. J., Waight, P. A., Slack, M. P. & George, R. C. Herd immunity and se-rotype replacement 4 years after seven-valent pneumococcal conjugate vaccination in En-gland and Wales: an observational cohort study. Lancet Infect. Dis. 11, 760–768 (2011). 11. Dagan, R., Melamed, R., Muallem, M., Piglansky, L. & Yagupsky, P. Nasopharyngeal

coloniza-tion in southern Israel with antibiotic-resistant pneumococci during the first 2 years of life: relation to serotypes likely to be included in pneumococcal conjugate vaccines. J. Infect. Dis. 174, 1352–1355 (1996).

12. Adegbola, R. A. et al. Carriage of Streptococcus pneumoniae and other respiratory bacte-rial pathogens in low and lower-middle income countries: a systematic review and meta- analysis. PloS One 9, e103293 (2014).

13. Reisman, J. et al. Risk factors for pneumococcal colonization of the nasopharynx in Alaska native adults and children. J. Pediatr. Infect. Dis. Soc. 3, 104–111 (2014).

14. Hjuler, T. et al. Perinatal and crowding-related risk factors for invasive pneumococcal dis-ease in infants and young children: a population-based case-control study. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 44, 1051–1056 (2007).

15. Wyllie, A. L. et al. Molecular surveillance on Streptococcus pneumoniae carriage in non-el-derly adults; little evidence for pneumococcal circulation independent from the reservoir in children. Sci. Rep. 6, 34888 (2016).

16. Howie, S. R. C. et al. Childhood pneumonia and crowding, bed-sharing and nutrition: a case-control study from The Gambia. Int. J. Tuberc. Lung Dis. Off. J. Int. Union Tuberc. Lung Dis. 20, 1405–1415 (2016).

17. Bogaert, D., De Groot, R. & Hermans, P. W. M. Streptococcus pneumoniae colonisation: the key to pneumococcal disease. Lancet Infect. Dis. 4, 144–154 (2004).

18. Austrian, R. & Gold, J. Pneumococcal bacteremia with especial reference to bacteremic pneumococcal pneumonia. Ann. Intern. Med. 60, 759–776 (1964).

19. Jacobs, M. R. et al. Emergence of multiply resistant pneumococci. N. Engl. J. Med. 299, 735– 740 (1978).

20. Whitney, C. G. et al. Increasing prevalence of multidrug-resistant Streptococcus pneumoniae in the United States. N. Engl. J. Med. 343, 1917–1924 (2000).

21. Imai, S. et al. High prevalence of multidrug-resistant pneumococcal molecular epidemi-ology network clones among Streptococcus pneumoniae isolates from adult patients with community-acquired pneumonia in Japan. Clin. Microbiol. Infect. Off. Publ. Eur. Soc. Clin. Microbiol. Infect. Dis. 15, 1039–1045 (2009).

22. Riedel, S. et al. Antimicrobial use in Europe and antimicrobial resistance in Streptococcus pneu-moniae. Eur. J. Clin. Microbiol. Infect. Dis. Off. Publ. Eur. Soc. Clin. Microbiol. 26, 485–490 (2007). 23. Sternberg, G. M. A fatal form of septicaemia in the rabbit produced by the subcutaneous

injection of human saliva: an experimental research. (John Murphy & Company, 1881). 24. Pasteur, L. Sur une maladie nouvelle provoquée par la salive d’un enfant mort de la

rage. (1881). Available at: https://fr.wikisource.org/wiki/Page:Pasteur_-_%C5%92uvres_ compl%C3%A8tes,_ tome_6.djvu/12. (Accessed: 2nd February 2018)

25. Watson, D. A., Musher, D. M., Jacobson, J. W. & Verhoef, J. A brief history of the pneumococ-cus in biomedical research: a panoply of scientific discovery. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 17, 913–924 (1993).

26. Winslow, C. E. et al. The families and genera of the bacteria: final report of the Committee of the Society of American Bacteriologists on characterization and classification of bacterial types. J. Bacteriol. 5, 191–229 (1920).

27. Holt, J. G. Bergey’s manual of determinative bacteriology (7th ed.). Am. J. Public Health Na-tions Health 54, 544 (1964).

28. Austrian, R. Pneumococcus: the first one hundred years. Rev. Infect. Dis. 3, 183–189 (1981). 29. Heidelberger, M., Aisenberg, A. C. & Hassid, W. Z. Glycogen, an immunologically specific

polysaccharide. J. Exp. Med. 99, 343–353 (1954).

30. Epifano, L. D., Brandstetter, R. D. & Brandstetter, R. D. Historical aspects of pneumonia. in The Pneumonias 1–14 (Springer, New York, NY, 1993). doi:10.1007/978-1-4613-9766-3_1 31. Griffith, F. The significance of pneumococcal types. J. Hyg. (Lond.) 27, 113–159 (1928). 32. Alloway, J. L. The transformation in vitro of R pneumococci into S forms of different specific

types by the use of filtered pneumococcus extracts. J. Exp. Med. 55, 91–99 (1932).

33. Alloway, J. L. Further observations on the use of pneumococcus extracts in effecting trans-formation of type in vitro. J. Exp. Med. 57, 265–278 (1933).

34. Dawson, M. H. The interconvertibility of ‘R’ and ‘S’ forms of pneumococcus. J. Exp. Med. 47, 577–591 (1928).

35. Dawson, M. H. & Sia, R. H. P. In vitro transformation of pneumococcal types. J. Exp. Med. 54, 681–699 (1931).

36. Avery, O. T., MacLeod, C. M. & McCarty, M. Studies on the chemical nature of the substance inducing transformation of pneumoccocal types. J. Exp. Med. 79, 137–158 (1944).

37. Wyres, K. L. et al. Pneumococcal capsular switching: a historical perspective. J. Infect. Dis. 207, 439–449 (2013).

38. Geno, K. A. et al. Pneumococcal capsules and their types: past, present, and future. Clin. Mi-crobiol. Rev. 28, 871–899 (2015).

39. Galtier, N. & Dutheil, J. Coevolution within and between genes. Genome Dyn. 3, 1–12 (2007). 40. Mazzocchi, F. Complexity in biology. Exceeding the limits of reductionism and determinism

using complexity theory. EMBO Rep. 9, 10–14 (2008).

41. Sorek, R. & Cossart, P. Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat. Rev. Genet. 11, 9–16 (2010).

42. Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017).

43. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-genera-tion sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

44. Agarwal, A. et al. Comparison and calibration of transcriptome data from RNA-Seq and til-ing arrays. BMC Genomics 11, 383 (2010).

45. Guell, M. et al. Transcriptome complexity in a genome-reduced bacterium. Science 326, 1268–1271 (2009).

46. Selinger, D. W., Saxena, R. M., Cheung, K. J., Church, G. M. & Rosenow, C. Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript degradation. Genome Res. 13, 216–223 (2003).

47. Croucher, N. J. & Thomson, N. R. Studying bacterial transcriptomes using RNA-seq. Curr. Opin. Microbiol. 13, 619–624 (2010).

(13)

1

48. Qiu, Y. et al. Structural and operational complexity of the Geobacter sulfurreducens genome. Genome Res. 20, 1304–1311 (2010).

49. Sharma, C. M. et al. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 464, 250–255 (2010).

50. Perkins, T. T. et al. A strand-specific RNA-seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet. 5, e1000569 (2009).

51. Sittka, A. et al. Deep sequencing analysis of small noncoding RNA and mRNA targets of the global post-transcriptional regulator, Hfq. PLOS Genet. 4, e1000163 (2008).

52. Cho, B.-K. et al. The transcription unit architecture of the Escherichia coli genome. Nat. Bio-technol. 27, 1043–1049 (2009).

53. Yoder-Himes, D. R. et al. Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc. Natl. Acad. Sci. U. S. A. 106, 3976–3981 (2009). 54. Botella, L., Vaubourgeix, J., Livny, J. & Schnappinger, D. Depleting Mycobacterium

tubercu-losis of the transcription termination factor Rho causes pervasive transcription and rapid death. Nat. Commun. 8, 14731 (2017).

55. Liu, X. et al. High-throughput CRISPRi phenotyping identifies new essential genes in Strep-tococcus pneumoniae. Mol. Syst. Biol. 13, (2017).

56. Legeret, B. et al. Lipidomic and transcriptomic analyses of Chlamydomonas reinhardtii un-der heat stress unveil a direct route for the conversion of membrane lipids into storage lipids. Plant Cell Environ. 39, 834–847 (2016).

57. Kroger, C. et al. An infection-relevant transcriptomic compendium for Salmonella enterica Serovar Typhimurium. Cell Host Microbe 14, 683–695 (2013).

58. Westermann, A. J., Gorski, S. A. & Vogel, J. Dual RNA-seq of pathogen and host. Nat. Rev. Mi-crobiol. 10, 618–630 (2012).

59. Westermann, A. J. et al. Dual RNA-seq unveils noncoding RNA functions in host-pathogen interactions. Nature 529, 496–501 (2016).

60. Thanert, R., Goldmann, O., Beineke, A. & Medina, E. Host-inherent variability influences the transcriptional response of Staphylococcus aureus during in vivo infection. Nat. Commun. 8, 14268 (2017).

61. Blomberg, C. et al. Pattern of accessory regions and invasive disease potential in Streptococ-cus pneumoniae. J. Infect. Dis. 199, 1032–1042 (2009).

62. Obert, C. et al. Identification of a candidate Streptococcus pneumoniae core genome and regions of diversity correlated with invasive pneumococcal disease. Infect. Immun. 74, 4766–4777 (2006). 63. Silva, N. A. et al. Genomic diversity between strains of the same serotype and multilocus

sequence type among pneumococcal clinical isolates. Infect. Immun. 74, 3513–3518 (2006). 64. Donati, C. et al. Structure and dynamics of the pan-genome of Streptococcus pneumoniae

and closely related species. Genome Biol. 11, R107 (2010).

65. Tettelin, H. et al. Genomics, genetic variation, and regions of differences. in Streptococcus pneumoniae: Molecular Mechanisms of Host-Pathogen Interactions (eds. Brown, J., Hammer-schmidt, S. & Orihuela, C.) 5, 81–107 (Academic Press, 2015).

66. Acebo, P., Martin-Galiano, A. J., Navarro, S., Zaballos, A. & Amblar, M. Identification of 88 regulatory small RNAs in the TIGR4 strain of the human pathogen Streptococcus pneumo-niae. RNA 18, 530–546 (2012).

67. Kumar, R. et al. Identification of novel non-coding small RNAs from Streptococcus pneumo-niae TIGR4 using high-resolution genome tiling arrays. BMC Genomics 11, 350 (2010). 68. Mann, B. et al. Control of virulence by small RNAs in Streptococcus pneumoniae. PLoS

Pat-hog. 8, e1002788 (2012).

69. Tsui, H.-C. T. et al. Identification and characterization of noncoding small RNAs in Strepto-coccus pneumoniae serotype 2 strain D39. J. Bacteriol. 192, 264–279 (2010).

70. Lanie, J. A. et al. Genome sequence of Avery’s virulent serotype 2 strain D39 of Streptococcus pneumoniae and comparison with that of unencapsulated laboratory strain R6. J. Bacteriol. 189, 38–51 (2007).

(14)

2

Deep genome annotation of the

opportunistic human pathogen

Streptococcus pneumoniae D39

Jelle Slager

a,#

_{, Rieza Aprianto}

a,#

_{and Jan-Willem Veening}

a,b

a _{Molecular Genetics Group, Groningen Biomolecular Sciences and Biotechnology}

Institute, Centre for Synthetic Biology, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands

b _{Department of Fundamental Microbiology, Faculty of Biology and Medicine,}

University of Lausanne, Biophore Building, CH-1015 Lausanne, Switzerland

#_{The authors wish it to be known that, in their opinion, the first two authors}

should be regarded as joint first authors

bioRxiv 2018 | https://doi.org/10.1101/283663 | 22 March 2018

Under revision for Nucleic Acids Research

RA designed the research on transcriptional start and terminator sites, including leaderless genes, performed the experiments, generated the strains, analysed the data and wrote relevant sections. RA performed the same contributions for pyrimidine riboswitches

(15)

2

In

tr

oduction

Abstract

A precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence of Streptococcus pneumoniae strain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1,015 transcriptional start sites and 748 termination sites. Using this new genomic map, we identified several new small RNAs (sRNAs), riboswitches (including twelve previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 92 new protein- encoding genes, 39 sRNAs and 165 pseudogenes, bringing the S. pneumoniae D39 repertoire to 2,151 genetic elements. We report operon structures and observed that 9% of operons lack a 5’-UTR. The genome data is accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.

Introduction

Ceaseless technological advances have revolutionized our capability to determine genome sequences as well as our ability to identify and anno-tate functional elements, including transcriptional units on these genomes. Several resources have been developed to organize current knowledge on the important opportunistic human pathogen Streptococcus pneumoniae, or the pneumococcus1–3_{. However, an accurate genome map with an}

up-to-date and extensively curated genome annotation, is missing.

The enormous increase of genomic data on various servers, such as NCBI and EBI, and the associated decrease in consistency has, in recent years, led to the Prokaryotic RefSeq Genome Reannotation Project. Every bacterial genome present in the NCBI database was re-annotated us-ing the so-called Prokaryotic Genome Annotation Pipeline (PGAP)4_,_with

the goal of increasing the quality and consistency of the many available annotations. This Herculean effort indeed created a more consistent set of annotations that facilitates the propagation and interpolation of scien-tific findings in individual bacteria to general phenomena, valid in larger groups of organisms. On the other hand, a wealth of information is already available for well-studied bacteria like the pneumococcus. Therefore, a sep-arate, manually curated annotation is essential to maintain oversight of the current knowledge in the field. Hence, we generated a resource for the pneumococcal research community that contains the most up-to-date information on the D39 genome, including its DNA sequence, transcript boundaries, operon structures and functional annotation. Notably, strain D39 is one of the workhorses in research on pneumococcal biology and pathogenesis. We analyzed the genome in detail, using a combination of several different sequencing techniques and a novel, generally applicable analysis pipeline (Fig. 1).

Using Single Molecule Real-Time (SMRT, PacBio RS II) sequencing, we sequenced the genome of the stock of serotype 2 S. pneumoniae strain D39 in the Veening laboratory, hereafter referred to as strain D39V. This strain is a far descendant of the original Avery strain that was used to demon-strate that DNA is the carrier of hereditary information5₍_{Supplementary} Fig. S1). Combining Cappable-seq6_{, a novel sRNA detection method and}

(16)

2

Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ults

several bioinformatic annotation tools, we deeply annotated the pneumo-coccal genome and transcriptome.

Finally, we created PneumoBrowse, an intuitive and accessible genome browser (https://veeninglab.com/pneumobrowse), based on JBrowse7_.

PneumoBrowse provides a graphical and user-friendly interface to explore

the genomic and transcriptomic landscape of S. pneumoniae D39V and al-lows direct linking to gene expression and co-expression data in Pneumo-Express (Chapter 3). The reported annotation pipeline and accompanying

genome browser provide one of the best curated bacterial genomes cur-rently available and may facilitate rapid and accurate annotation of other bacterial genomes. We anticipate that PneumoBrowse will significantly accelerate the pneumococcal research field and hence speed-up the discov-ery of new drug targets and vaccine candidates for this devastating global opportunistic human pathogen.

Results

De novo assembly yields a single circular chromosome

We performed de novo genome assembly using SMRT sequencing data, followed by polishing with high-confidence Illumina reads, obtained in previous studies8,23_{. Since this data was derived from a derivative of D39,}

regions of potential discrepancy were investigated using Sanger sequenc-ing. In the end, we needed to correct the SMRT assembly in only one loca-tion. The described approach yielded a single chromosomal sequence of 2,046,572 base pairs, which was deposited to GenBank (accession number CP027540).

D39V did not suffer disruptive mutations compared to

ancestral strain NCTC 7466

We then compared the newly assembled genome with the previously es-tablished sequence of D3924_{(D39W), and observed similar sequences, but}

with some striking differences (Table 1, Fig. 2A). Furthermore, we

cross-checked both sequences with the genome sequence of the ancestral strain NCTC 7466 (ENA accession number ERS1022033), which was recently se-quenced with SMRT technology, as part of the NCTC 3000 initiative. In-terestingly, D39V matches NCTC 7466 in all gene-disruptive discrepancies (e.g. frameshifts and a chromosomal inversion, see below). Most of these sites are characterized by their repetitive nature (e.g. homopolymeric runs or long repeated sequences). Considering the sequencing technology

Biological

samples Genomic DNAWT D39V

Sequencing Treatment

WT D39V Total RNA (-rRNA) Size selection

> 4 kb 5’-enrichedCappable Untreatedcontrol SMRT (PacBio) Illumina single-end stranded Illumina paired-end stranded Mapping reads on genomeβ (bowtie2) de novo assembly (HGAP3) DNA methylation analysis Enriched 5’ base (start) Control 5’ base (start) Control 3’ base (end) Putative end peaks Small RNA featuresγ Terminators Terminator prediction (TransTermHP) Transcription start sites (TSS) Curated annotation D39Vα Genomic DNA Illumina paired-end unstranded Refined assembly Untreated Automated annotation (RAST/PGAP) Databases & Bioinformatic tools: PubMed BLASTP/BLASTX UniProtKB CDD tRNAscan-SE ISfinder BSRD RegPrecise MEME Suite D ata ana lysis Genome Transcriptome

Fig. 1. Data analysis pipeline used for genome assembly and annotation. Left. DNA

level, the genome sequence of D39V was determined by SMRT sequencing,

sup-ported by previously published Illumina data8,23_{. Automated annotation by the}

RAST10_{and PGAP}4_{annotation pipelines was followed by curation based on}

infor-mation from literature and a variety of databases and bioinformatic tools. Right.

RNA level, Cappable-seq6_{was utilized to identify transcription start sites.}

Simulta-neously, putative transcript ends were identified by combining reverse reads from paired-end, stranded sequencing of the control sample (i.e. not 5’-enriched). Ter-minators were annotated when such putative transcript ends overlapped with

stem loops predicted by TransTermHP20_{. Finally, local fragment size enrichment in}

the paired-end sequencing data was used to identify putative small RNA features.

α_{D39 derivative (bgaA::PssbB-luc; GEO accessions GSE54199 and GSE69729).}β_The

first 1 kbp of the genome file was duplicated at the end, to allow mapping over FASTA

boundaries. γ_{Analysis was performed with only sequencing pairs that map uniquely}

(17)

2

Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 In tr oduction Table 1. Diff er enc es betw

een old and new g

enome as sembly . The g enomic sequenc es o f the old (D3 9W , CP000410) and new (D3 9V , CP02 7540) g e-nome as semblie s w er e c omp ar ed, r ev ealing 14 SNP

s, 3 insertions, and 2 deletions. A

dditionally , a r epe at e xp ansion in pavB , sev er al r earr ang emen ts in the hsdS

locus and, mos

t s trikingly , a 1 62 kbp (8% o f the g enome) chr omosomal in ver sion w er e obser ved. Finally , both sequenc es w er e c om -par ed t o the r ec en tly r ele ased P acBio sequenc e o f anc es tr al s tr ain N CT C 7 46 6 (EN A a cc es sion n umber ERS1022033). F or e ach obser ved diff er enc e betw

een the old and new as

sembly

, the v

arian

t ma

tching the anc

es tr al s tr ain is displa yed in boldf ac e. αLocus f

alls within the in

vert ed ter r egion and the for w ar d str ain in the new as sembly is ther ef or e the rev er se c omplemen tar y o f the old sequenc e (CP000410). βRegion is part of a lar ger pseudog

ene in the new annota

tion. γOnly f ound in one o f tw o D3 9 s

tocks in our labor

at or y. D3 9W c oor dina te(s) D3 9V c oor dina te(s) Locus Chang e Consequenc e No te 806 63-8 3343 80 66 3-8379 9 SPD_0080 ( pavB ) Repe at e xp ansion (6x>7x) Repe at e xp ansion in P avB(6x>7x) 17 431 8 174 774 SPD_01 70 (ru vA ) G>A Ru vA V5 2V (G TG>G TA) 29 7022 29 7479 SPD_02 99-300 +T SPD_02 99 and S PD_0300 shifted in to s ame c oding fr ame 303240 303 69 7 SPD_0306 ( pbp2x ) A>G PBP2X N3 11D (AA T>G AT) 45 8088-4 62242 45 8545-4 62 69 9 SPD_0450-5 5 13 (hsdRMS, cr eX ) M ultiple r earr ang emen ts HsdS type A>F 46 22 12 458 575 SPD_0453 ( hsdS ) A>G Imperf ect > perf ect in vert ed r epe at Inside r earr ang ed region 67 59 50 67 6407 SPD_06 57 → / → S PD_06 58 ( prfB ) C>A In ter genic (+1 63/-5 1 n t) In 5’ UTR o f prfB 77 56 72 77 612 9 SPD_07 64 ( sufS ) G>A SufS G3 18R (GG A>A G A) 81 61 57 8166 15 SPD_0800 β +G Fr ame shift (34 7/ 360 n t) 9012 17-106 29 44 901 67 5-1 06 3403 SPD_088 9-103 7 In ver sion Sw ap o f 3’ ends o f phtB (S PD_103 7) and phtD (S PD_088 9) 934443 10301 77 SPD_09 21 ( ccrB ) A>G α CcrB Q2 86R (CA G>CGG) 951 536 101308 3 γ SPD_09 42 +C α Fr ame shift (1 98/78 3 n t) 10 35166 929 45 3 SPD_101 6 ( re xA ) C>A α Re xA A9 61D (GCT>G AT) 108011 9 10805 77 SPD_1050 ( lacD ) ΔT Fr ame shift (15 9/ 98 1 n t) 117 17 61 11 722 19 SPD_113 7 C>G H43 1Q (CA C>CA G) 12568 13 12 57 27 0 SPD_1224 ( budA ) ← / → S PD_122 5 ΔA In ter genic (-100/-42 n t) 1256 937 12 57 394 γ SPD_122 5 G>T R2 8L (CGC>CT C) 16 7208 4 16 72 541 SPD_1 660 ( rdgB ) G>A RdgB T11 7I (A CA>A TA) 16 76 516 16 76 97 3 SPD_1 66 4 ( tr eP ) C>T Tr eP G35 9D (GGC>G A C) 178 77 08 17 881 65 SP D _1793 C>T A2V (GCA>G TA) 19 7772 8 197 81 85 SPD_2002 ( dltD ) C>A DltD V2 52F (G TC>TT C) 2022 372 2022 82 9 SPD_2045 ( mr eC ) A>G Mr eC S18 6P (T CT>CCT) D39W (Winkler) D39V (Veening) SP49 SP61 SP64 500k 1,000k 1,500k 2,000k 0k

A

NCTC 7466 pDP1 IR2 SpnD39IIIF [1.2-2.3] creX [2.2] hsdR hsdM [1.1-2.1] IR2 IR3 IR3 hsdS-F IR1 IR1

B

C

* * CACNNNNNNNCTT GTGNNNNNNNGAA TCTAGA AGATCT TCGAG AGCTC AN6_{-Methyladenosine (}m6_A) Number of sites on genome 796 644 1509 Number of sites modified 796 643 1498 Responsible R-M system SpnD39IIIFβ HsdR-M-S SpnD39I SPV_1259α_-60 SpnD39II SPV_1079-80

Fig. 2. Multiple genome alignment. A. Multiple genome sequence alignment of

D39W, D39V, NCTC 7466, and clinical isolates SP49, SP61, and SP6430_reveals

mul-tiple ter-symmetrical chromosomal inversions. Identical colors indicate similar se-quences, while blocks shown below the main genome level and carrying a reverse arrow signify inverted sequences relative to the D39W assembly. The absence/pres-ence of the pDP1 (or similar) plasmid is indicated with a cross/checkmark. Asterisks

indicate the position of the hsdS locus. B. Genomic layout of the hsdS region. As

re-ported by Manso et al.26_{, the region contains three sets of inverted repeats (IR1-3),}

that are used by CreX to reorganize the locus. Thereby, six different variants (A-F) of methyltransferase specificity subunit HsdS can be generated, each leading to a distinct methylation motif. SMRT sequencing of D39V revealed that the locus exists predominantly in the F-configuration, consisting of N-terminal variant 2 (i.e. 1.2) and

C-terminal variant 3 (i.e. 2.3). C. Motifs that were detected to be specifically

modi-fied in D39V SMRT data. α_{SPV_1259 (encoding the R-M system endonuclease) is a}

pseudogene, due to a nonsense mutation. β_{Manso et al. reported the same motifs}

and reported the responsible methyltransferases. The observed CAC-N7-CTT motif perfectly matches the predicted putative HsdS-F motif.

(18)

2

used, these differences are likely to be the result of misassembly in D39W, rather than sites of true biological divergence. On the other hand, discrep-ancies between D39V and the ancestral strain are limited to SNPs, with unknown consequences for pneumococcal fitness. It seems plausible that these polymorphisms constitute actual mutations in D39V, emphasizing the dynamic nature of the pneumococcal genome. Notably, there are two sites where both the D39W and D39V assemblies differ from the ancestral strain. Firstly, the ancestral strain harbors a mutation in rrlC (SPV_1814), one of four copies of the gene encoding 23S ribosomal RNA. It is not clear if this is a technical artefact in one of the assemblies (due to the large re-peat size in this region), or an actual biological difference. Secondly, we observed a mutation in the upstream region of cbpM (SPV_1248) in both D39W and D39V.

Several SNPs and indel mutations observed in D39V

assembly

Fourteen single nucleotide polymorphisms (SNPs) were detected upon comparison of D39W and D39V assemblies. One of these SNPs results in a silent mutation in the gene encoding RuvA, the Holliday junction DNA helicase, while another SNP was located in the 5’-untranslated region (5’-UTR) of prfB, encoding peptide chain release factor 2. The other twelve SNPs caused amino acid changes in various proteins, including penicil-lin-binding protein PBP2X and cell shape-determining protein MreC. It should be noted that one of these SNPs, leading to an arginine to leu-cine change in the protein encoded by SPV_1225 (previously SPD_1225), was not found in an alternative D39 stock from our lab (Supplementary Fig. S1). The same applies to an insertion of a cytosine causing a frameshift

in the extreme 3’-end of SPV_0942 (previously SPD_0942; Supplemen-tary Fig. S2L). All other differences found, however, were identified in

both of our stocks and are therefore likely to be more widespread. Among these differences are four more indel mutations (insertions or deletions), the genetic context and consequences of which are shown in Supplemen-tary Fig. S2. One of the indels is located in the promoter region of two

diverging operons, with unknown consequences for gene expression. Sec-ondly, we found an insertion in the region corresponding to SPD_0800

(D39W annotation). Here, we report this gene to be part of a pseudogene together with SPD_0801 (annotated as SPV_2242). Hence, the insertion probably is of little consequence. Thirdly, a deletion was observed in the beginning of lacD, encoding an important enzyme in the D-tagatose-6-phosphate pathway, relevant in galactose metabolism. The consequential absence of functional LacD may explain why the inactivation of the alter-native Leloir pathway in D39 significantly hampered growth on galactose25_.

We repaired lacD in D39V and, as expected, observed restored growth on galactose (Supplementary Fig. S3). Finally, we observed a thymine

inser-tion that caused SPD_0299 and SPD_0300 to be shifted into the same cod-ing frame and form a scod-ingle 1.9 kb long CDS (SPV_2142). Since the inser-tion was found in a homopolymeric run of thymines and the assemblies of NCTC 7466 and D39V match, it seems plausible that instead of a true indel mutation, this actually reflects a sequencing error in the D39W assembly.

Varying repeat frequency in surface-exposed protein PavB

Pneumococcal adherence and virulence factor B (PavB) is encoded by SPV_0080. Our assembly shows that this gene contains a series of seven imperfect repeats of 450-456 bps in size. Interestingly, SPD_0080 in D39W contains only six of these repeats. If identical repeat units are indicated with an identical letter, the repeat region in SPV_0080 of D39V can be written as ABBCBDE, where E is truncated after 408 bps. Using the same letter code, SPD_0080 of D39W contains ABBCDE, thus lacking the third repeat of element B, which is isolated from the other copies in SPV_0080. Because D39V and NCTC 7466 contain the full-length version of the gene, we hypothesized that D39W lost one of the repeats, making the encoded protein 152 residues shorter.

Configuration of variable hsdS region matches observed

methylation pattern

A local rearrangement is found in the pneumococcal hsdS locus, encoding a three-component restriction-modification system (HsdRMS). Recombi-nase CreX facilitates local recombination, using three sets of inverted re-peats, and can thereby rapidly rearrange the region into six possible con-figurations (SpnD39IIIA-F). This process results in six different versions of

(19)

2

methyltransferase specificity subunit HsdS, each with its own sequence specificity and transcriptomic consequences26,27_{as defined by}

single-mol-ecule, real-time (SMRT. The region is annotated in the A-configuration in D39W, while the F-configuration is predominant in D39V (Fig. 2B).

More-over, we employed methylation data, intrinsically present in SMRT data28_,

and observed an enriched methylation motif that exactly matches the pu-tative SpnD39IIIF motif predicted by Manso et al. (Fig. 2C).

A large chromosomal inversion occurred multiple times in

pneumococcal evolution

We also observed a striking difference between D39V and D39W: a 162 kbp region containing the replication terminus was completely inverted (Figs. 2A and 3), with D39V matching the configuration of the ancestral

NCTC 7466. The inverted region is bordered by two inverted repeats of 1.3 kb in length. We noticed that the xerS/difSL site, responsible for chro-mosome dimer resolution and typically located directly opposite the ori-gin of replication29_{, is asymmetrically situated on the right replichore in}

D39V (Fig. 3A), while the locus is much closer to the halfway point of the

chromosome in the D39W assembly, suggesting that this configuration is the original one and the observed inversion in D39V and NCTC 7466 is a true genomic change, rather than merely a sequencing artefact. To firm this, we performed a PCR-based assay, in which the two possible con-figurations yield different product sizes. Indeed, the results showed that two possible configurations of the region exist in different pneumococcal strains; multiple D39 stocks, TIGR4, BHN100 and PMEN-14 have matching terminus regions, while the opposite configuration was found in R6, Rx1, PMEN-2 and PMEN18. We repeated the analysis for a set of seven and a set of five strains, each related by a series of sequential transformation events. All strains had the same ter orientation (not shown), suggesting that the inversion is relatively rare, even in competent cells. However, both config-urations are found in various branches of the pneumococcal phylogenetic tree, indicating multiple incidences of this chromosomal inversion. Inter-estingly, a similar, even larger inversion was observed in two out of three recently-sequenced clinical isolates of S. pneumoniae30₍_Fig. 2A),

suggest-ing a larger role for chromosomal inversions in pneumococcal evolution.

Antigenic variation of histidine triad protein PhtD

Surprisingly, the repeat regions bordering the chromosomal inversion are located in the middle of phtB and phtD (Fig. 3A), leading to an exchange

of the C-terminal parts of their respective products, PhtB and PhtD. These are two out of four pneumococcal histidine triad (Pht) proteins, which are surface- exposed, interact with human host cells and are considered to be good vaccine candidates31_{. In fact, PhtD was already used in several phase}

I/II clinical trials32,33_{. Yun et al. analyzed the diversity of phtD alleles from 172}

clinical isolates and concluded that the sequence variation was minimal34_.

However, this conclusion was biased by the fact that inverted chromosomes would not produce a PCR product in their set-up and a swap between PhtB and PhtD would remain undetected. Moreover, after detailed inspection of the mutations in the phtD alleles and comparison to other genes encoding Pht proteins (phtA, phtB and phtE), we found that many of the SNPs could be explained by recombination events between these genes, rather than by random mutation. For example, extensive exchange was seen between D39V phtA and phtD (Fig. 3B). Apparently, the repetitive nature of these

genes allows for intragenomic recombination, causing phtD to become mo-saic, rather than well-conserved. Finally, immediately downstream of phtE (Fig. 3A), we identified a pseudogene that originally encoded a fifth

histi-dine triad protein and which we named phtF (Fig. 3C). The gene is disrupted

by an inserted RUP element (see below) and several frameshifts and non-sense mutations, and therefore does not produce a functional protein. Nev-ertheless, phtF might still be relevant as a source of genetic diversity. Taken together, these findings raise caution on the use of PhtD as a vaccine target.

RNA-seq data and PCR analysis show loss of cryptic

plasmid from strain D39V

Since SMRT technology is known to miss small plasmids in the assembly pipeline, we performed a PCR-based assay to check the presence of the cryptic pDP1 plasmid, reported in D39W24,35_{. To our surprise, the plasmid}

is absent in D39V, while clearly present in the ancestral NCTC 7466, as confirmed by a PCR-based assay (Supplementary Fig. S4). Intriguingly, a

BLASTN search suggested that S. pneumoniae Taiwan19F-14 (PMEN-14, CP000921), among other strains, integrated a degenerate version of the

(20)

2

Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ults Fig. 3. A lar ge chr omosomal in ver -sion un veils an tig enic v aria tion o f pneumoc oc cal his tidine tria d pr o-teins. A. T op: chr omosomal loc a-tion o f the in vert ed 1 62 kb r egion (or ang e). R ed triangle s c onnect the loc ation o f the 1 kb in vert ed r epe ats bor dering the in vert ed r egion and a zoom o f the g enetic c on te xt o f the bor der ar

eas, also sho

wing tha t the in vert ed repe ats ar e loc aliz ed in the middle of gene s phtB and phtD . Ar -ro w s mark

ed with A, B and C indi

-ca te the tar get r egions o f olig on u-cleotide s used in PCR analy sis o f the r egion. Bott om: PCR analy sis o f sev er al pneumoc oc cal s tr ains (in

-cluding both our D3

9 s tocks and a st ock fr om the Gr ang eas se lab , L yon) sho w s tha t the in ver

sion is a true phenomenon, r

ather than a t echnic al art ef act. PCR r ea ctions ar e perf ormed with all thr ee primer s pr esen t, such tha t the obser ved pr oduct siz e reports on the chr omosomal configur ation. B. A fr agmen t o f a Clus tal Omeg a m ultiple sequenc e alignmen t o f 1 72 r eport ed phtD allele s 34 and D3 9V g ene s phtD and phtA e xemplifie s the dynamic na tur e o f the g ene s enc oding pneumoc oc cal his tidine tria d pr ot eins. Base s highligh ted in gr

een and purple ma

tch D3 9V phtD and phtA , r espectiv ely . Or ang e indic at es tha t a b ase is diff er en t fr om both D3 9V g ene s, while whit e b ase s ar e iden tic al in all sequenc es. C. N ewly iden tified pseudog ene, c on taining a R UP insertion and sev er al fr ame

shifts and nonsense m

uta

tions, tha

t originally enc

oded a fifth pneumoc

oc

cal his

tidine tria

d pr

ot

ein, and which w

e named

phtF

. Old

(D3

9W) and new annota

tion (D3 9V) ar e sho wn, along with c onser ved domains pr edict ed b y CD-Se ar ch 14.

plasmid into its chromosome. Indeed, the PCR assay showed positive re-sults for this strain. Additionally, we selected publicly available D39 RNA-seq datasets and mapped the RNA-sequencing reads specifically to the pDP1 reference sequence (Accession AF047696). The successful mapping of a significant number of reads indicated the presence of the plasmid in strains used in several studies (SRX261384536_{; SRX1725406}37_{; SRX47296626}26_{). In}

contrast, RNA-seq data of D39V8,23₍_{Chapter 3) contained zero reads that}

mapped to the plasmid, providing conclusive evidence that strain D39V lost the plasmid at some stage (Supplementary Fig. S1). Similarly, based

on Illumina DNA-seq data, we determined that of the three clinical iso-lates shown in Fig. 2A, only SP61 contained a similar plasmid18_.

Automation and manual curation yield up-to-date

pneumococcal functional annotation

An initial annotation of the newly assembled D39V genome was produced by combining output from the RAST annotation engine11_{and the NCBI}

prokary-otic genome annotation pipeline (PGAP)4_{. We, then, proceeded with}

exhaus-tive manual curation to produce the final genome annotation (see Methods

for details). All annotated CDS features without an equivalent feature in the D39W annotation or with updated coordinates are listed in Supplementary Table S3. Examples of the integration of recent research into the final

anno-tation include cell division protein MapZ38,39_{, pleiotropic RNA-binding}

pro-teins KhpA and KhpB/EloR40,41_{and cell elongation protein CozE}42_.

Additionally, we used tRNAscan-SE43_{to differentiate the four encoded}

tRNAs with a CAU anticodon into three categories (Supplementary Table S4): tRNAs used in either (i) translation initiation or (ii) elongation

and (iii) the post-transcriptionally modified tRNA-Ile2, which decodes the AUA isoleucine codon44_.

Next, using BLASTX12₍_{Methods), we identified and annotated}

165 pseudogenes (Supplementary Table S5), two-fold more than reported

previously24_{. These non-functional transcriptional units may be the result}

of the insertion of repeat regions, nonsense and/or frameshift mutations and/or chromosomal rearrangements. Notably, 71 of 165 pseudogenes were found on IS elements19_{, which are known to sometimes utilize alternative}

coding strategies, including programmed ribosomal slippage, producing

oriC phtA phtE phtF lmb phtB phtD C A B RUP SPD_0891 SPD_0893 SPD_0892

A

A+C = 2.2 kb A+B = 1.2 kb

C

Old annotation Conserved domains Coordinates (Old coordinates)

1,056,401 - 1,058,604 (-)

906,016 - 908,219 (+)

New annotation

(phtF

, SPV_2293)

Streptococcal histidine triad protein

B

phtD GACCATTATCACTTTATTCCTTATTCACAACTGTCACCTTTGGAAGAAAAATTG phtA GATCATTACCACTTCATCCCTTACTCTCAAATGTCTGAATTGGAAGAACGAATC 98x AACCATTACCACTTTATCCCTTATGAACAAATGTCTGAATTGGAAAAACGAATT 33x AACCATTACCACTTTATCCCTTATGAACAAATGTCTGAATTGGAAGAACGAATT 15x AACCATTACCACTTTATCCCTTACTCTCAAATGTCTGAATTGGAAGAACGAATT 14x AACCATTACCACTTTATCCCTTACTCTCAAATGTCTGAATTGGAAAAACGAATT 9x GACCATTATCACTTTATTCCTTATTCACAACTGTCACCTTTGGAAGAAAAATTG 3x AACCATTACCACTTTATCCCCTATGAACAAATGTCTGAATTGGAAAAACGAATT

A+B A+C 950 970 960 980 phtD alleles 990 Stop codon Frameshift xerS /dif SL 162 kb D39V 2,046,572 bps

(21)

2

a functional protein from an apparent pseudogene. Finally, we annotated 127 BOX elements16_{, 106 RUPs}17_{, 29 SPRITEs}18_{and 58 IS elements}19_.

RNA-seq coverage and transcription start site data allow

improvement of annotated feature boundaries

Besides functional annotation, we also corrected the genomic coordinates of several features. First, we updated tRNA and rRNA boundaries ( Sup-plementary Table S4), aided by RNA-seq coverage plots that were built

from deduced paired-end sequenced fragments, rather than from just the sequencing reads. Most strikingly, we discovered that the original annota-tion of genes encoding 16S ribosomal RNA (rrsA-D) excluded the sequence required for ribosome binding site (RBS) recognition45_{. Fortunately,}

nei-ther RAST or PGAP reproduced this erroneous annotation and the D39V annotation includes these sites. Subsequently, we continued with cor-recting annotated translational initiation sites (TISs, start codons). While accurate TIS identification is challenging, 45 incorrectly annotated start codons could be identified by looking at the relative position of the cor-responding transcriptional start sites (TSS, +1, described below). These TISs were corrected in the D39V annotation (Supplementary Table S3).

Finally, we evaluated the genome-wide quality of TISs using a statistical model that compares the observed and expected distribution of the po-sitions of alternative TISs relative to an annotated TIS15_{. The developers}

suggested that a correlation score below 0.9 is indicative of poorly anno-tated TISs. In contrast to the D39W (0.899) and PGAP (0.873) annotations, our curated D39V annotation (0.945) excels on the test, emphasizing our annotation’s added value to pneumococcal research.

Paired-end sequencing data contains the key to detection

of small RNA features

After the sequence- and database-driven annotation process, we pro-ceeded to study the transcriptome of S. pneumoniae. We pooled RNA from cells grown at four different conditions (Chapter 3), to maximize the

num-ber of expressed genes. Strand-specific, paired-end RNA-seq data of the control library was used to extract start and end points and fragment sizes of the sequenced fragments. In Fig. 4A, the fragment size distribution of

the entire library is shown, with a mode of approximately 150 nucleotides and a skew towards larger fragments. We applied a peakcalling routine to determine the putative 3’-ends of sequenced transcripts. For each of the identified peaks, we extracted all read pairs that were terminated in that specific peak region and compared the size distribution of that subset of sequenced fragments to the library-wide distribution to identify putative sRNAs (see Methods). We focused on sRNA candidates that were found

in intergenic regions. Using the combination of sequencing-driven detec-tion, Northern blotting (Supplementary Fig. S5), convincing homology

with previously validated sRNAs, and/or presence of two or more regula-tory features (e.g. TSSs and terminators, see below), we identified 63 small RNA features. We annotated 39 of these as sRNAs (Table 2) and 24 as

ribo-switches (Supplementary Table S6).

Until now, several small RNA features have been reliably validated by Northern blot in S. pneumoniae strains D39, R6 and TIGR446–50_{. Excluding}

most validation reports by Mann et al. due to discrepancies found in their data, 34 validated sRNAs were conserved in D39V. Among the 63 here-de-tected features, we recovered and refined the coordinates of 33 out of those 34 sRNAs, validating our sRNA detection approach.

One of the detected sRNAs is the highly abundant 6S RNA (Fig. 4B,

left), encoded by ssrS, which is involved in transcription regulation. No-tably, both automated annotations (RAST and PGAP) failed to report this RNA feature. We observed two different sizes for this feature, probably corresponding to a native and a processed transcript. Interestingly, we also observed a transcript containing both ssrS and the downstream tRNA gene. The absence of a TSS between the two genes, suggests that the tRNA is processed from this long transcript (Fig. 4B, right).

Other detected small transcripts include three type I toxin-antitoxin systems as previously predicted based on orthology51_{. Unfortunately,}

pre-vious annotations omit these systems. Type I toxin-antitoxin systems con-sist of a toxin peptide (SPV_2132/SPV_2448/SPV_2450) and an antitoxin sRNA (SPV_2131/SPV_2447/SPV_2449). Furthermore, SPV_2120 encodes a novel sRNA that is antisense to the 3’-end of mutR1 (SPV_0144), which encodes a transcriptional regulator (Fig. 4C) and might play a role in

University of Groningen Quantifying the transcriptome of a human pathogen Aprianto, Rieza