University of Groningen
Quantifying the transcriptome of a human pathogen
Aprianto, Rieza
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2018
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):
Aprianto, R. (2018). Quantifying the transcriptome of a human pathogen: Exploring transcriptional
adaptation of Streptococcus pneumoniae under infection-relevant conditions. Rijksuniversiteit Groningen.
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.
QUANTIFYING
THE TRANSCRIPTOME
OF A HUMAN
PATHOGEN
Rieza Aprianto
Quantifying the transcriptome of a human pathogen
Exploring transcriptional adaptation of Streptococcus pneumoniae under infection-relevant conditions
The scientific studies presented in the thesis were performed in the Molecu-lar Genetics group of the Groningen BiomolecuMolecu-lar Sciences and Biotechnol-ogy Institute, Faculty of Science and Engineering, University of Groningen, The Netherlands. The studies were financially supported by the European Research Council (ERC) Starting Grant awarded to Jan-Willem Veening. Printing was supported by the Graduate School of Science and Engineer-ing and the University Library of the University of GronEngineer-ingen.
ISBN: 978-94-034-0759-3
978-94-034-0760-9 (ebook) Printing: Eikon +
Cover & layout: Lovebird design.
www.lovebird-design.com
© R. Aprianto, Groningen, the Netherlands, 2018
All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, without written permission of the author.
Quantifying the transcriptome of
a human pathogen
Exploring transcriptional adaptation of Streptococcus pneumoniae under infection-relevant conditions
PhD thesis
to obtain the degree of PhD at the University of Groningen
on the authority of the Rector Magnificus Prof. E. Sterken
and in accordance with the decision by the College of Deans. This thesis will be defended in public on
Friday, 13 July 2018 at 14.30 hours
by
Rieza Aprianto
born on 12 April 1986 in Bandung, Indonesia
Supervisors
Prof. J.-W. Veening Prof. O.P. Kuipers
Assessment Committee
Prof. M. Heinemann Prof. J.M. van Dijl Prof. N. van Sorge
Quantifying the transcriptome of
a human pathogen
Exploring transcriptional adaptation of Streptococcus pneumoniae under infection-relevant conditions
Proefschrift
ter verkrijging van de graad van doctor aan de Rijksuniversiteit Groningen
op gezag van de
rector magnificus prof. dr. E. Sterken en volgens besluit van het College voor Promoties.
De openbare verdediging zal plaatsvinden op vrijdag 13 juli 2018 om 14.30 uur
door
Rieza Aprianto
geboren op 12 april 1986 te Bandung, Indonesië
Promotores
Prof. J.-W. Veening Prof. O.P. Kuipers
Beoordelingscommissie
Prof. M. Heinemann Prof. J.M. van Dijl Prof. N. van Sorge
O Sapientia, quae ex ore Altissimi prodiisti, attingens a fine usque ad finem,
fortiter suaviterque disponens omnia: veni ad docendum nos viam prudentiae.
Table of Contents
CHAPTER 1
Introduction 13
CHAPTER 2
Deep genome annotation of the opportunistic human pathogen
Streptococcus pneumoniae D39 27
CHAPTER 3
High-resolution analysis of the pneumococcal transcriptome
under a wide range of infection-relevant conditions 67
CHAPTER 4
Bright fluorescent Streptococcus pneumoniae for live-cell imaging
of host-pathogen interactions 97
CHAPTER 5
Time-resolved dual RNA-seq reveals extensive rewiring of lung epithelial and pneumococcal transcriptomes during
early infection 127
CHAPTER 6
Discussion and future perspective 163
CHAPTER 7
Summary 177 Academic summary Academische samenvatting Ringkasan akademik Acknowledgements 1911
1
The ancien
t sc
our
ge
Streptococcus pneumoniae: the ancient scourge of
modern society
Five centuries before the Common Era, Hippocrates of Kos (c. 460–367 B.C.), the Father of Medicine, described the diagnosis and remedy for pneu-monia — a form of infection into the lower respiratory tract1. In his
in-fluential corpus, he referred to pneumonia as “those which the ancients named”, exemplifying that the scourge of pneumonia was known even by earlier societies than the Ancient Greeks. Two and a half millennia later, lower respiratory tract infections (LRTIs) are still very much a part of our modern society. A recent report showed that LRTIs are the deadliest communicable disease and the fifth most common cause of global death2.
In addition, the infections cause principal loss of healthy life (disability- adjusted life years, i.e.: a combination of mortality and morbidity), right behind ischemic heart disease3. Pneumonia, an infection of the lung
alve-oli4 is the most important form of lower respiratory tract infection.
The most prominent etiologic agent of pneumonia is the Gram posi-tive opportunistic pathogen Streptococcus pneumoniae. This bacterium is responsible for the majority of LRTIs cases while single-handedly plac-ing LRTIs as the deadliest infectious disease2. Aside from pneumonia,
S. pneumoniae causes milder infections, such as otitis media and sinusitis,
and other severe and lethal infections, including meningitis and septice-mia5. These pneumococcal infections are distinguished by high mortality
rates in young children with 59% of pneumococcal meningitis cases and 45% of septicemia cases resulting in death. In particular, pneumococcal- related mortality is higher in African children than in children from other continents6. Although developing countries tend to bear the
pneumococ-cal brunt, developed wealthy societies recently reported high incidence of pneumococcal infections in the elderly population7, making the
pneumo-coccus a general health issue to all human populations.
In most cases, pneumococcus resides in the host nasopharyngeal passage without symptoms. In fact, S. pneumoniae is part of the typ-ical microbiota of the upper respiratory tract8–10. Pneumococcal
car-riage begins in the first two years of life11 and colonization rates depend
1
1
In tr oduction Comple x biologic al s ys tempeople in a household13, day-care attendance14, number of other
chil-dren15, bed-sharing and malnutrition16. This asymptomatic colonization
is a prerequisite for further pneumococcal infections17. Because of its
im-pact on general health, vaccination programs against the pneumococcus have been introduced. Unfortunately, limited success has been reported on these programs, with vaccine-target strains being replaced by non- vaccine strains capable of causing invasive infections5.
In the 1940s, the treatment of pneumococcal pneumonia greatly bene-fitted from the introduction of sulfonamides and penicillin in the clinics18.
However, pneumococcal resistance to penicillin and other antimicrobials quickly spread worldwide19. Soon afterwards, resistance to more than one
antibiotic was reported in S. pneumoniae and, more worryingly, half of invasive pneumococcal cases in the United States were resistant to at least one antibiotic20. In addition, pneumococcal resistance to a wide range of
clinically-relevant antibiotics has been reported around the globe21,22.
Unlikely help: pneumococcus assisting biological
research
Five years after Sternberg23 and Pasteur24 reported independently the
pathogenic potential of S. pneumoniae, Fraenekel25 called the bacterium
the pneumococcus due to its propensity for causing pneumonia. Later, the
bacterium was renamed Diplococcus pneumoniae by the Society of Amer-ican Microbiologists26, referring to its characteristic shape under the
mi-croscope which resembles a pair of cocci. Finally, in 1974, the pneumococ-cus was reclassified under the genus Streptococcpneumococ-cus27. Since its discovery,
the bacterium has been the subject of seminal breakthroughs, including its role in the discovery of Gram staining28, in the demonstration of
anti-genic properties of polysaccharides29, and the first successful case of
pen-icillin treatment in clinical infection30.
The most influential role of the pneumococcus in biological research is the conclusive evidence that DNA exclusively carries the genetic code (later RNA-based virus was discovered to be the exception to this rule). Griffith was the first to show that phenotype, in this specific case,
expression of a capsule can be transferred from capsule-producing strain to non-capsular S. pneumoniae inside a murine host31. The result was then
verified, expanded and optimized32–35. Avery et al. built on these
observa-tions and fine-tuned the method to determine that DNA is the material which mediated the phenotypic transfer between the two pneumococcal strains36. We now recognize that all known capsule genes except in one
pneumococcal serotype are encoded in a single operon, the cps operon, which is located between two conserved genes: dexB and aliA37. The
ge-nomic arrangement of the capsule mediates easy and efficient transfer be-tween strains and, in turn, facilitates capsule switching among different serotypes. We now understand that by exploiting this strategy, S.
pneumo-niae may escape vaccine-induced immune pressure38.
Biological systems, genome-wide approach and
RNA-sequencing
Reductionism has been a successful and crucial approach in molecular bi-ology, due to its ability to connect a gene or set of genes (genotype) to a measurable trait (phenotype). Reductionists clarified that the presence of a pneumococcal capsule and its serotype are determined by the presence and precise genotype of the cps operon. Unfortunately, reductionism fails to explain complicated biological phenomena, including interspecies inter-action, immunity and infection. In the last decade, reductionism has been conceding to holism, a perspective that considers interactions between ev-ery component of the system. This approach appreciates the continuous interaction between every component, biotic and abiotic, and their simul-taneous modification. Furthermore, while the environment constraints bi-ological components, the living component changes the environment to its needs39, emphasizing the transient and dynamic nature of the system.
The complexity of this system causes novel properties to emerge, which in turn, determines the direction and characteristics of the system40.
Bacterial infection is a classic example of such complex biological sys-tem. In infection sites, the pathogen multiplies and acquires nutrients to fuel its expansion while simultaneously evading the immune response.
1
1
In tr oduction Subtle virulenc eOn the other hand, the host struggles to remove invading pathogens by innate immune responses and specialized cells. The pathogen and the host interact intimately and alter the environment according to their re-spective needs. In order to elucidate emergent properties of such complex system and to have a bird eye’s view over the phenomenon, we need to exhaustively measure every parameters in a comprehensive manner41.
In addition, the shift of the paradigm has been spurred by cutting edge advances, especially in sequencing technology and processing large bio-logical datasets. Transcriptomics, for example, allows researchers unprece-dented access to the genome-wide transcriptome and to the way it changes during a specific phenomenon. In the last four decades, sequencing tech-nology has been perfected into its current high- throughput incarnation. When introduced, sequencing was used to decipher the genetic code (ge-nomics) and then when coupled with cDNA generation, it was employed to decipher the transcriptome (transcriptomics), through RNA-seq42. Because
of improvements in library preparation and sequencing efficiency, more nucleic bases can be sequenced in less time, driving down cost and justi-fying sequencing as a routine protocol43. Prior to the widespread use of
se-quencing-based technology, array-based technologies, such as microarray and tiling array, were the platform of choice for transcriptomics studies. Compared to array-based approach, RNA-seq has a wider dynamic range, resulting in better detection of transcript boundaries and more powerful differential expression analysis44.
Furthermore, recent reports dispel the myth of the simplicity of the prokaryotic transcriptome. Rather, the bacterial transcriptome is as com-plex as eukaryotic transcriptome45,46. In addition, sequencing results have
permitted the elucidation of genomics architecture and regulatory struc-tures of gene expression. For example, bacteria employ a wide range of small RNAs to regulate gene expression, both cis- and trans-acting47.
Ad-ditionally, multiple start sites permit alternative forms of operon48,
fur-ther expanding its genomic potential. As an illustration, the genome-wide examination of the small bacterium Helicobacter pylori (1.7 Mbp) showed ubiquitous transcription start sites: inside and opposite of coding sequences and numerous sRNAs49 generating an ample genomic
reper-toire. Unbiased genome-wide surveys have identified novel non-coding
features and new annotated regions50 and specific regulons51. Multiple
lay-ers of regulation combined with condition-specific regulatory features52,53
allow specialized and flexible regulation of gene expression.
Additionally, quantitative transcriptomics facilitates genome-wide differential analysis of gene expression. The precise measurement of transcript abundance facilitates the discovery of the effect of gene deple-tion54,55, the elucidation of stress response56 and the mapping of detailed
expression of pathogenic islands57. In particular, Westermann et al
pro-posed a simultaneous approach to measure host and pathogen transcrip-tomics during infection in a thought experiment58. Later, the approach
elucidated a bacterial sRNA important for intracellular survival59. The
approach has also been expanded into whole infected organ60 therefore
highlighting individual host responding to a pathogen. Finally, the com-bination of shifting paradigm into holistic approach, the availability of ( sequencing) technology and the discovery of complex bacterial gene reg-ulations specific to its niche has hastened renewed interest in the explora-tion of the prokaryotic transcriptome.
Subtlety and complexity of the pneumococcal
transcriptome
Pathogenicity island usually hosts genomic potential of pathogenic bac-teria including toxins and other virulence factors. Unfortunately, the de-termination of pathogenicity islands in S. pneumoniae has been proven to be impossible61–63. For example, the capsule-encoding cps operon, a
well-described pneumococcal virulence factor, is conserved in both clin-ical strains of S. pneumoniae and other closely related non-pathogenic Streptococcal species64. The presence of genes and clusters of genes, it
seems, does not determine pneumococcal virulence or pathogenicity. On the contrary, pneumococcal virulence might be determined by more subtle mutations that allow pneumococcus to precisely regulate the ex-pression of virulence factors in response to environmental signals65. For
example, mutations in the untranslated regions preceding (5’-UTR) and following (3’-UTR) virulence genes may modify its expression level and,
1
1
In tr oduction R ef er enc esthus, determine overall pneumococcal virulence. In addition, the mul-tiple small non-coding RNA that have been reported in S. pneumoniae further enrich the pneumococcal regulatory repertoire66–69, including the
regulation of virulence factors.
Applying cutting edge sequencing technologies to
understand pneumococcal biology
In the following dissertation, we applied recent advances in high throughput sequencing (mostly RNA) to reveal detailed organization of genetic features and pneumococcal transcription in infection models (Fig. 1). In Chapter 2, we revisited the basic genomic information of S.
pneumoniae, strain D39, Veening lab: the sequence, the annotation and
the operon architecture — effectively rendering the strain to be the most well described pneumococcal strain up to date. The strain D39 has been a major work-horse of pneumococcal research70. Next, we generated a
compendium of pneumococcal transcriptome by exposing the strain to conditions relevant to the bacterial lifestyle of colonization and infec-tion (Chapter 3). In the same chapter, we generated a simple yet powerful
gene network as a co-expression matrix. In Chapter 4, we established,
inter alia, an infection model which contain the pneumococcus and a live human confluent lung epithelial layer. Subsequently, we exploited a dual transcriptomics approach to the aforementioned infection model to simultaneously measure the dynamic transcriptional rewiring of epi-thelial cells and S. pneumoniae during early infection (Chapter 5). Finally,
in Chapter 6, we summarized the findings placing them in the current
scientific context.
References
1. Hippocrates. On Regimen in Acute Diseases. (CreateSpace Independent Publishing Platform, 400BC).
2. Troeger, C. et al. Estimates of the global, regional, and national morbidity, mortality, and aetiologies of lower respiratory tract infections in 195 countries: a systematic analysis for the Global Burden of Disease Study 2015. Lancet Infect. Dis. 17, 1133–1161 (2017).
3. Kassebaum, N. J. et al. Global, regional, and national disability-adjusted life-years (DALYs) for 315 diseases and injuries and healthy life expectancy (HALE), 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. The Lancet 388, 1603–1658 (2016). 4. Finegold, S. M. & Johnson, C. C. Lower respiratory tract infection. Am. J. Med. 79, 73–77
(1985).
5. Henriques-Normark, B. & Tuomanen, E. I. The pneumococcus: epidemiology, microbiology, and pathogenesis. Cold Spring Harb. Perspect. Med. 3, (2013).
6. O Brien, K. L. et al. Burden of disease caused by Streptococcus pneumoniae in children younger than 5 years: global estimates. Lancet Lond. Engl. 374, 893–902 (2009).
7. Welte, T., Torres, A. & Nathwani, D. Clinical and economic burden of community-acquired pneumonia among adults in Europe. Thorax 67, 71–79 (2012).
8. Bosch, A. A. T. M. et al. Development of upper respiratory tract microbiota in infancy is af-fected by mode of delivery. EBioMedicine 9, 336–345 (2016).
9. Bosch, A. A. T. M. et al. Nasopharyngeal carriage of Streptococcus pneumoniae and other bacteria in the 7th year after implementation of the pneumococcal conjugate vaccine in the Netherlands. Vaccine 34, 531–539 (2016).
Fig. 1. Overview of thesis. We annotated the genome of S. pneumoniae D39 genome
and defined genome-wide transcriptional units by precisely mapping start and
ter-mination sites (Chapter 2). Next, we elucidated transcriptional responses in response
to wide array of conditions relevant to pneumococcal lifestyle (Chapter 3). We then
established a host-pathogen infection model (Chapter 4) which we exploited to
1
1
In tr oduction R ef er enc es10. Miller, E., Andrews, N. J., Waight, P. A., Slack, M. P. & George, R. C. Herd immunity and se-rotype replacement 4 years after seven-valent pneumococcal conjugate vaccination in En-gland and Wales: an observational cohort study. Lancet Infect. Dis. 11, 760–768 (2011). 11. Dagan, R., Melamed, R., Muallem, M., Piglansky, L. & Yagupsky, P. Nasopharyngeal
coloniza-tion in southern Israel with antibiotic-resistant pneumococci during the first 2 years of life: relation to serotypes likely to be included in pneumococcal conjugate vaccines. J. Infect. Dis. 174, 1352–1355 (1996).
12. Adegbola, R. A. et al. Carriage of Streptococcus pneumoniae and other respiratory bacte-rial pathogens in low and lower-middle income countries: a systematic review and meta- analysis. PloS One 9, e103293 (2014).
13. Reisman, J. et al. Risk factors for pneumococcal colonization of the nasopharynx in Alaska native adults and children. J. Pediatr. Infect. Dis. Soc. 3, 104–111 (2014).
14. Hjuler, T. et al. Perinatal and crowding-related risk factors for invasive pneumococcal dis-ease in infants and young children: a population-based case-control study. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 44, 1051–1056 (2007).
15. Wyllie, A. L. et al. Molecular surveillance on Streptococcus pneumoniae carriage in non-el-derly adults; little evidence for pneumococcal circulation independent from the reservoir in children. Sci. Rep. 6, 34888 (2016).
16. Howie, S. R. C. et al. Childhood pneumonia and crowding, bed-sharing and nutrition: a case-control study from The Gambia. Int. J. Tuberc. Lung Dis. Off. J. Int. Union Tuberc. Lung Dis. 20, 1405–1415 (2016).
17. Bogaert, D., De Groot, R. & Hermans, P. W. M. Streptococcus pneumoniae colonisation: the key to pneumococcal disease. Lancet Infect. Dis. 4, 144–154 (2004).
18. Austrian, R. & Gold, J. Pneumococcal bacteremia with especial reference to bacteremic pneumococcal pneumonia. Ann. Intern. Med. 60, 759–776 (1964).
19. Jacobs, M. R. et al. Emergence of multiply resistant pneumococci. N. Engl. J. Med. 299, 735– 740 (1978).
20. Whitney, C. G. et al. Increasing prevalence of multidrug-resistant Streptococcus pneumoniae in the United States. N. Engl. J. Med. 343, 1917–1924 (2000).
21. Imai, S. et al. High prevalence of multidrug-resistant pneumococcal molecular epidemi-ology network clones among Streptococcus pneumoniae isolates from adult patients with community-acquired pneumonia in Japan. Clin. Microbiol. Infect. Off. Publ. Eur. Soc. Clin. Microbiol. Infect. Dis. 15, 1039–1045 (2009).
22. Riedel, S. et al. Antimicrobial use in Europe and antimicrobial resistance in Streptococcus pneu-moniae. Eur. J. Clin. Microbiol. Infect. Dis. Off. Publ. Eur. Soc. Clin. Microbiol. 26, 485–490 (2007). 23. Sternberg, G. M. A fatal form of septicaemia in the rabbit produced by the subcutaneous
injection of human saliva: an experimental research. (John Murphy & Company, 1881). 24. Pasteur, L. Sur une maladie nouvelle provoquée par la salive d’un enfant mort de la
rage. (1881). Available at: https://fr.wikisource.org/wiki/Page:Pasteur_-_%C5%92uvres_ compl%C3%A8tes,_ tome_6.djvu/12. (Accessed: 2nd February 2018)
25. Watson, D. A., Musher, D. M., Jacobson, J. W. & Verhoef, J. A brief history of the pneumococ-cus in biomedical research: a panoply of scientific discovery. Clin. Infect. Dis. Off. Publ. Infect. Dis. Soc. Am. 17, 913–924 (1993).
26. Winslow, C. E. et al. The families and genera of the bacteria: final report of the Committee of the Society of American Bacteriologists on characterization and classification of bacterial types. J. Bacteriol. 5, 191–229 (1920).
27. Holt, J. G. Bergey’s manual of determinative bacteriology (7th ed.). Am. J. Public Health Na-tions Health 54, 544 (1964).
28. Austrian, R. Pneumococcus: the first one hundred years. Rev. Infect. Dis. 3, 183–189 (1981). 29. Heidelberger, M., Aisenberg, A. C. & Hassid, W. Z. Glycogen, an immunologically specific
polysaccharide. J. Exp. Med. 99, 343–353 (1954).
30. Epifano, L. D., Brandstetter, R. D. & Brandstetter, R. D. Historical aspects of pneumonia. in The Pneumonias 1–14 (Springer, New York, NY, 1993). doi:10.1007/978-1-4613-9766-3_1 31. Griffith, F. The significance of pneumococcal types. J. Hyg. (Lond.) 27, 113–159 (1928). 32. Alloway, J. L. The transformation in vitro of R pneumococci into S forms of different specific
types by the use of filtered pneumococcus extracts. J. Exp. Med. 55, 91–99 (1932).
33. Alloway, J. L. Further observations on the use of pneumococcus extracts in effecting trans-formation of type in vitro. J. Exp. Med. 57, 265–278 (1933).
34. Dawson, M. H. The interconvertibility of ‘R’ and ‘S’ forms of pneumococcus. J. Exp. Med. 47, 577–591 (1928).
35. Dawson, M. H. & Sia, R. H. P. In vitro transformation of pneumococcal types. J. Exp. Med. 54, 681–699 (1931).
36. Avery, O. T., MacLeod, C. M. & McCarty, M. Studies on the chemical nature of the substance inducing transformation of pneumoccocal types. J. Exp. Med. 79, 137–158 (1944).
37. Wyres, K. L. et al. Pneumococcal capsular switching: a historical perspective. J. Infect. Dis. 207, 439–449 (2013).
38. Geno, K. A. et al. Pneumococcal capsules and their types: past, present, and future. Clin. Mi-crobiol. Rev. 28, 871–899 (2015).
39. Galtier, N. & Dutheil, J. Coevolution within and between genes. Genome Dyn. 3, 1–12 (2007). 40. Mazzocchi, F. Complexity in biology. Exceeding the limits of reductionism and determinism
using complexity theory. EMBO Rep. 9, 10–14 (2008).
41. Sorek, R. & Cossart, P. Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat. Rev. Genet. 11, 9–16 (2010).
42. Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 550, 345–353 (2017).
43. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-genera-tion sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).
44. Agarwal, A. et al. Comparison and calibration of transcriptome data from RNA-Seq and til-ing arrays. BMC Genomics 11, 383 (2010).
45. Guell, M. et al. Transcriptome complexity in a genome-reduced bacterium. Science 326, 1268–1271 (2009).
46. Selinger, D. W., Saxena, R. M., Cheung, K. J., Church, G. M. & Rosenow, C. Global RNA half-life analysis in Escherichia coli reveals positional patterns of transcript degradation. Genome Res. 13, 216–223 (2003).
47. Croucher, N. J. & Thomson, N. R. Studying bacterial transcriptomes using RNA-seq. Curr. Opin. Microbiol. 13, 619–624 (2010).
1
1
In tr oduction R ef er enc es48. Qiu, Y. et al. Structural and operational complexity of the Geobacter sulfurreducens genome. Genome Res. 20, 1304–1311 (2010).
49. Sharma, C. M. et al. The primary transcriptome of the major human pathogen Helicobacter pylori. Nature 464, 250–255 (2010).
50. Perkins, T. T. et al. A strand-specific RNA-seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet. 5, e1000569 (2009).
51. Sittka, A. et al. Deep sequencing analysis of small noncoding RNA and mRNA targets of the global post-transcriptional regulator, Hfq. PLOS Genet. 4, e1000163 (2008).
52. Cho, B.-K. et al. The transcription unit architecture of the Escherichia coli genome. Nat. Bio-technol. 27, 1043–1049 (2009).
53. Yoder-Himes, D. R. et al. Mapping the Burkholderia cenocepacia niche response via high-throughput sequencing. Proc. Natl. Acad. Sci. U. S. A. 106, 3976–3981 (2009). 54. Botella, L., Vaubourgeix, J., Livny, J. & Schnappinger, D. Depleting Mycobacterium
tubercu-losis of the transcription termination factor Rho causes pervasive transcription and rapid death. Nat. Commun. 8, 14731 (2017).
55. Liu, X. et al. High-throughput CRISPRi phenotyping identifies new essential genes in Strep-tococcus pneumoniae. Mol. Syst. Biol. 13, (2017).
56. Legeret, B. et al. Lipidomic and transcriptomic analyses of Chlamydomonas reinhardtii un-der heat stress unveil a direct route for the conversion of membrane lipids into storage lipids. Plant Cell Environ. 39, 834–847 (2016).
57. Kroger, C. et al. An infection-relevant transcriptomic compendium for Salmonella enterica Serovar Typhimurium. Cell Host Microbe 14, 683–695 (2013).
58. Westermann, A. J., Gorski, S. A. & Vogel, J. Dual RNA-seq of pathogen and host. Nat. Rev. Mi-crobiol. 10, 618–630 (2012).
59. Westermann, A. J. et al. Dual RNA-seq unveils noncoding RNA functions in host-pathogen interactions. Nature 529, 496–501 (2016).
60. Thanert, R., Goldmann, O., Beineke, A. & Medina, E. Host-inherent variability influences the transcriptional response of Staphylococcus aureus during in vivo infection. Nat. Commun. 8, 14268 (2017).
61. Blomberg, C. et al. Pattern of accessory regions and invasive disease potential in Streptococ-cus pneumoniae. J. Infect. Dis. 199, 1032–1042 (2009).
62. Obert, C. et al. Identification of a candidate Streptococcus pneumoniae core genome and regions of diversity correlated with invasive pneumococcal disease. Infect. Immun. 74, 4766–4777 (2006). 63. Silva, N. A. et al. Genomic diversity between strains of the same serotype and multilocus
sequence type among pneumococcal clinical isolates. Infect. Immun. 74, 3513–3518 (2006). 64. Donati, C. et al. Structure and dynamics of the pan-genome of Streptococcus pneumoniae
and closely related species. Genome Biol. 11, R107 (2010).
65. Tettelin, H. et al. Genomics, genetic variation, and regions of differences. in Streptococcus pneumoniae: Molecular Mechanisms of Host-Pathogen Interactions (eds. Brown, J., Hammer-schmidt, S. & Orihuela, C.) 5, 81–107 (Academic Press, 2015).
66. Acebo, P., Martin-Galiano, A. J., Navarro, S., Zaballos, A. & Amblar, M. Identification of 88 regulatory small RNAs in the TIGR4 strain of the human pathogen Streptococcus pneumo-niae. RNA 18, 530–546 (2012).
67. Kumar, R. et al. Identification of novel non-coding small RNAs from Streptococcus pneumo-niae TIGR4 using high-resolution genome tiling arrays. BMC Genomics 11, 350 (2010). 68. Mann, B. et al. Control of virulence by small RNAs in Streptococcus pneumoniae. PLoS
Pat-hog. 8, e1002788 (2012).
69. Tsui, H.-C. T. et al. Identification and characterization of noncoding small RNAs in Strepto-coccus pneumoniae serotype 2 strain D39. J. Bacteriol. 192, 264–279 (2010).
70. Lanie, J. A. et al. Genome sequence of Avery’s virulent serotype 2 strain D39 of Streptococcus pneumoniae and comparison with that of unencapsulated laboratory strain R6. J. Bacteriol. 189, 38–51 (2007).
2
Deep genome annotation of the
opportunistic human pathogen
Streptococcus pneumoniae D39
Jelle Slager
a,#, Rieza Aprianto
a,#and Jan-Willem Veening
a,ba Molecular Genetics Group, Groningen Biomolecular Sciences and Biotechnology
Institute, Centre for Synthetic Biology, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands
b Department of Fundamental Microbiology, Faculty of Biology and Medicine,
University of Lausanne, Biophore Building, CH-1015 Lausanne, Switzerland
#The authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint first authors
bioRxiv 2018 | https://doi.org/10.1101/283663 | 22 March 2018
Under revision for Nucleic Acids Research
RA designed the research on transcriptional start and terminator sites, including leaderless genes, performed the experiments, generated the strains, analysed the data and wrote relevant sections. RA performed the same contributions for pyrimidine riboswitches
2
In
tr
oduction
Abstract
A precise understanding of the genomic organization into transcriptional units and their regulation is essential for our comprehension of opportunistic human pathogens and how they cause disease. Using single-molecule real-time (PacBio) sequencing we unambiguously determined the genome sequence of Streptococcus pneumoniae strain D39 and revealed several inversions previously undetected by short-read sequencing. Significantly, a chromosomal inversion results in antigenic variation of PhtD, an important surface-exposed virulence factor. We generated a new genome annotation using automated tools, followed by manual curation, reflecting the current knowledge in the field. By combining sequence-driven terminator prediction, deep paired-end transcriptome sequencing and enrichment of primary transcripts by Cappable-Seq, we mapped 1,015 transcriptional start sites and 748 termination sites. Using this new genomic map, we identified several new small RNAs (sRNAs), riboswitches (including twelve previously misidentified as sRNAs), and antisense RNAs. In total, we annotated 92 new protein- encoding genes, 39 sRNAs and 165 pseudogenes, bringing the S. pneumoniae D39 repertoire to 2,151 genetic elements. We report operon structures and observed that 9% of operons lack a 5’-UTR. The genome data is accessible in an online resource called PneumoBrowse (https://veeninglab.com/pneumobrowse) providing one of the most complete inventories of a bacterial genome to date. PneumoBrowse will accelerate pneumococcal research and the development of new prevention and treatment strategies.
Introduction
Ceaseless technological advances have revolutionized our capability to determine genome sequences as well as our ability to identify and anno-tate functional elements, including transcriptional units on these genomes. Several resources have been developed to organize current knowledge on the important opportunistic human pathogen Streptococcus pneumoniae, or the pneumococcus1–3. However, an accurate genome map with an
up-to-date and extensively curated genome annotation, is missing.
The enormous increase of genomic data on various servers, such as NCBI and EBI, and the associated decrease in consistency has, in recent years, led to the Prokaryotic RefSeq Genome Reannotation Project. Every bacterial genome present in the NCBI database was re-annotated us-ing the so-called Prokaryotic Genome Annotation Pipeline (PGAP)4,with
the goal of increasing the quality and consistency of the many available annotations. This Herculean effort indeed created a more consistent set of annotations that facilitates the propagation and interpolation of scien-tific findings in individual bacteria to general phenomena, valid in larger groups of organisms. On the other hand, a wealth of information is already available for well-studied bacteria like the pneumococcus. Therefore, a sep-arate, manually curated annotation is essential to maintain oversight of the current knowledge in the field. Hence, we generated a resource for the pneumococcal research community that contains the most up-to-date information on the D39 genome, including its DNA sequence, transcript boundaries, operon structures and functional annotation. Notably, strain D39 is one of the workhorses in research on pneumococcal biology and pathogenesis. We analyzed the genome in detail, using a combination of several different sequencing techniques and a novel, generally applicable analysis pipeline (Fig. 1).
Using Single Molecule Real-Time (SMRT, PacBio RS II) sequencing, we sequenced the genome of the stock of serotype 2 S. pneumoniae strain D39 in the Veening laboratory, hereafter referred to as strain D39V. This strain is a far descendant of the original Avery strain that was used to demon-strate that DNA is the carrier of hereditary information5 (Supplementary Fig. S1). Combining Cappable-seq6, a novel sRNA detection method and
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ultsseveral bioinformatic annotation tools, we deeply annotated the pneumo-coccal genome and transcriptome.
Finally, we created PneumoBrowse, an intuitive and accessible genome browser (https://veeninglab.com/pneumobrowse), based on JBrowse7.
PneumoBrowse provides a graphical and user-friendly interface to explore
the genomic and transcriptomic landscape of S. pneumoniae D39V and al-lows direct linking to gene expression and co-expression data in Pneumo-Express (Chapter 3). The reported annotation pipeline and accompanying
genome browser provide one of the best curated bacterial genomes cur-rently available and may facilitate rapid and accurate annotation of other bacterial genomes. We anticipate that PneumoBrowse will significantly accelerate the pneumococcal research field and hence speed-up the discov-ery of new drug targets and vaccine candidates for this devastating global opportunistic human pathogen.
Results
De novo assembly yields a single circular chromosome
We performed de novo genome assembly using SMRT sequencing data, followed by polishing with high-confidence Illumina reads, obtained in previous studies8,23. Since this data was derived from a derivative of D39,
regions of potential discrepancy were investigated using Sanger sequenc-ing. In the end, we needed to correct the SMRT assembly in only one loca-tion. The described approach yielded a single chromosomal sequence of 2,046,572 base pairs, which was deposited to GenBank (accession number CP027540).
D39V did not suffer disruptive mutations compared to
ancestral strain NCTC 7466
We then compared the newly assembled genome with the previously es-tablished sequence of D3924 (D39W), and observed similar sequences, but
with some striking differences (Table 1, Fig. 2A). Furthermore, we
cross-checked both sequences with the genome sequence of the ancestral strain NCTC 7466 (ENA accession number ERS1022033), which was recently se-quenced with SMRT technology, as part of the NCTC 3000 initiative. In-terestingly, D39V matches NCTC 7466 in all gene-disruptive discrepancies (e.g. frameshifts and a chromosomal inversion, see below). Most of these sites are characterized by their repetitive nature (e.g. homopolymeric runs or long repeated sequences). Considering the sequencing technology
Biological
samples Genomic DNAWT D39V
Sequencing Treatment
WT D39V Total RNA (-rRNA) Size selection
> 4 kb 5’-enrichedCappable Untreatedcontrol SMRT (PacBio) Illumina single-end stranded Illumina paired-end stranded Mapping reads on genomeβ (bowtie2) de novo assembly (HGAP3) DNA methylation analysis Enriched 5’ base (start) Control 5’ base (start) Control 3’ base (end) Putative end peaks Small RNA featuresγ Terminators Terminator prediction (TransTermHP) Transcription start sites (TSS) Curated annotation D39Vα Genomic DNA Illumina paired-end unstranded Refined assembly Untreated Automated annotation (RAST/PGAP) Databases & Bioinformatic tools: PubMed BLASTP/BLASTX UniProtKB CDD tRNAscan-SE ISfinder BSRD RegPrecise MEME Suite D ata ana lysis Genome Transcriptome
Fig. 1. Data analysis pipeline used for genome assembly and annotation. Left. DNA
level, the genome sequence of D39V was determined by SMRT sequencing,
sup-ported by previously published Illumina data8,23. Automated annotation by the
RAST10 and PGAP4 annotation pipelines was followed by curation based on
infor-mation from literature and a variety of databases and bioinformatic tools. Right.
RNA level, Cappable-seq6 was utilized to identify transcription start sites.
Simulta-neously, putative transcript ends were identified by combining reverse reads from paired-end, stranded sequencing of the control sample (i.e. not 5’-enriched). Ter-minators were annotated when such putative transcript ends overlapped with
stem loops predicted by TransTermHP20. Finally, local fragment size enrichment in
the paired-end sequencing data was used to identify putative small RNA features.
αD39 derivative (bgaA::PssbB-luc; GEO accessions GSE54199 and GSE69729). βThe
first 1 kbp of the genome file was duplicated at the end, to allow mapping over FASTA
boundaries. γAnalysis was performed with only sequencing pairs that map uniquely
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 In tr oduction Table 1. Diff er enc es between old and new g
enome as sembly . The g enomic sequenc es o f the old (D3 9W , CP000410) and new (D3 9V , CP02 7540) g e-nome as semblie s w er e c omp ar ed, r ev ealing 14 SNP
s, 3 insertions, and 2 deletions. A
dditionally , a r epe at e xp ansion in pavB , sev er al r earr ang emen ts in the hsdS
locus and, mos
t s trikingly , a 1 62 kbp (8% o f the g enome) chr omosomal in ver sion w er e obser ved. Finally , both sequenc es w er e c om -par ed t o the r ec en tly r ele ased P acBio sequenc e o f anc es tr al s tr ain N CT C 7 46 6 (EN A a cc es sion n umber ERS1022033). F or e ach obser ved diff er enc e betw
een the old and new as
sembly
, the v
arian
t ma
tching the anc
es tr al s tr ain is displa yed in boldf ac e. αLocus f
alls within the in
vert ed ter r egion and the for w ar d str ain in the new as sembly is ther ef or e the rev er se c omplemen tar y o f the old sequenc e (CP000410). βRegion is part of a lar ger pseudog
ene in the new annota
tion. γOnly f ound in one o f tw o D3 9 s
tocks in our labor
at or y. D3 9W c oor dina te(s) D3 9V c oor dina te(s) Locus Chang e Consequenc e No te 806 63-8 3343 80 66 3-8379 9 SPD_0080 ( pavB ) Repe at e xp ansion (6x>7x) Repe at e xp ansion in P avB(6x>7x) 17 431 8 174 774 SPD_01 70 (ru vA ) G>A Ru vA V5 2V (G TG>G TA) 29 7022 29 7479 SPD_02 99-300 +T SPD_02 99 and S PD_0300 shifted in to s ame c oding fr ame 303240 303 69 7 SPD_0306 ( pbp2x ) A>G PBP2X N3 11D (AA T>G AT) 45 8088-4 62242 45 8545-4 62 69 9 SPD_0450-5 5 13 (hsdRMS, cr eX ) M ultiple r earr ang emen ts HsdS type A>F 46 22 12 458 575 SPD_0453 ( hsdS ) A>G Imperf ect > perf ect in vert ed r epe at Inside r earr ang ed region 67 59 50 67 6407 SPD_06 57 → / → S PD_06 58 ( prfB ) C>A In ter genic (+1 63/-5 1 n t) In 5’ UTR o f prfB 77 56 72 77 612 9 SPD_07 64 ( sufS ) G>A SufS G3 18R (GG A>A G A) 81 61 57 8166 15 SPD_0800 β +G Fr ame shift (34 7/ 360 n t) 9012 17-106 29 44 901 67 5-1 06 3403 SPD_088 9-103 7 In ver sion Sw ap o f 3’ ends o f phtB (S PD_103 7) and phtD (S PD_088 9) 934443 10301 77 SPD_09 21 ( ccrB ) A>G α CcrB Q2 86R (CA G>CGG) 951 536 101308 3 γ SPD_09 42 +C α Fr ame shift (1 98/78 3 n t) 10 35166 929 45 3 SPD_101 6 ( re xA ) C>A α Re xA A9 61D (GCT>G AT) 108011 9 10805 77 SPD_1050 ( lacD ) ΔT Fr ame shift (15 9/ 98 1 n t) 117 17 61 11 722 19 SPD_113 7 C>G H43 1Q (CA C>CA G) 12568 13 12 57 27 0 SPD_1224 ( budA ) ← / → S PD_122 5 ΔA In ter genic (-100/-42 n t) 1256 937 12 57 394 γ SPD_122 5 G>T R2 8L (CGC>CT C) 16 7208 4 16 72 541 SPD_1 660 ( rdgB ) G>A RdgB T11 7I (A CA>A TA) 16 76 516 16 76 97 3 SPD_1 66 4 ( tr eP ) C>T Tr eP G35 9D (GGC>G A C) 178 77 08 17 881 65 SP D _1793 C>T A2V (GCA>G TA) 19 7772 8 197 81 85 SPD_2002 ( dltD ) C>A DltD V2 52F (G TC>TT C) 2022 372 2022 82 9 SPD_2045 ( mr eC ) A>G Mr eC S18 6P (T CT>CCT) D39W (Winkler) D39V (Veening) SP49 SP61 SP64 500k 1,000k 1,500k 2,000k 0k
A
NCTC 7466 pDP1 IR2 SpnD39IIIF [1.2-2.3] creX [2.2] hsdR hsdM [1.1-2.1] IR2 IR3 IR3 hsdS-F IR1 IR1B
C
* * CACNNNNNNNCTT GTGNNNNNNNGAA TCTAGA AGATCT TCGAG AGCTC AN6-Methyladenosine (m6A) Number of sites on genome 796 644 1509 Number of sites modified 796 643 1498 Responsible R-M system SpnD39IIIFβ HsdR-M-S SpnD39I SPV_1259α-60 SpnD39II SPV_1079-80Fig. 2. Multiple genome alignment. A. Multiple genome sequence alignment of
D39W, D39V, NCTC 7466, and clinical isolates SP49, SP61, and SP6430 reveals
mul-tiple ter-symmetrical chromosomal inversions. Identical colors indicate similar se-quences, while blocks shown below the main genome level and carrying a reverse arrow signify inverted sequences relative to the D39W assembly. The absence/pres-ence of the pDP1 (or similar) plasmid is indicated with a cross/checkmark. Asterisks
indicate the position of the hsdS locus. B. Genomic layout of the hsdS region. As
re-ported by Manso et al.26, the region contains three sets of inverted repeats (IR1-3),
that are used by CreX to reorganize the locus. Thereby, six different variants (A-F) of methyltransferase specificity subunit HsdS can be generated, each leading to a distinct methylation motif. SMRT sequencing of D39V revealed that the locus exists predominantly in the F-configuration, consisting of N-terminal variant 2 (i.e. 1.2) and
C-terminal variant 3 (i.e. 2.3). C. Motifs that were detected to be specifically
modi-fied in D39V SMRT data. αSPV_1259 (encoding the R-M system endonuclease) is a
pseudogene, due to a nonsense mutation. βManso et al. reported the same motifs
and reported the responsible methyltransferases. The observed CAC-N7-CTT motif perfectly matches the predicted putative HsdS-F motif.
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ultsused, these differences are likely to be the result of misassembly in D39W, rather than sites of true biological divergence. On the other hand, discrep-ancies between D39V and the ancestral strain are limited to SNPs, with unknown consequences for pneumococcal fitness. It seems plausible that these polymorphisms constitute actual mutations in D39V, emphasizing the dynamic nature of the pneumococcal genome. Notably, there are two sites where both the D39W and D39V assemblies differ from the ancestral strain. Firstly, the ancestral strain harbors a mutation in rrlC (SPV_1814), one of four copies of the gene encoding 23S ribosomal RNA. It is not clear if this is a technical artefact in one of the assemblies (due to the large re-peat size in this region), or an actual biological difference. Secondly, we observed a mutation in the upstream region of cbpM (SPV_1248) in both D39W and D39V.
Several SNPs and indel mutations observed in D39V
assembly
Fourteen single nucleotide polymorphisms (SNPs) were detected upon comparison of D39W and D39V assemblies. One of these SNPs results in a silent mutation in the gene encoding RuvA, the Holliday junction DNA helicase, while another SNP was located in the 5’-untranslated region (5’-UTR) of prfB, encoding peptide chain release factor 2. The other twelve SNPs caused amino acid changes in various proteins, including penicil-lin-binding protein PBP2X and cell shape-determining protein MreC. It should be noted that one of these SNPs, leading to an arginine to leu-cine change in the protein encoded by SPV_1225 (previously SPD_1225), was not found in an alternative D39 stock from our lab (Supplementary Fig. S1). The same applies to an insertion of a cytosine causing a frameshift
in the extreme 3’-end of SPV_0942 (previously SPD_0942; Supplemen-tary Fig. S2L). All other differences found, however, were identified in
both of our stocks and are therefore likely to be more widespread. Among these differences are four more indel mutations (insertions or deletions), the genetic context and consequences of which are shown in Supplemen-tary Fig. S2. One of the indels is located in the promoter region of two
diverging operons, with unknown consequences for gene expression. Sec-ondly, we found an insertion in the region corresponding to SPD_0800
(D39W annotation). Here, we report this gene to be part of a pseudogene together with SPD_0801 (annotated as SPV_2242). Hence, the insertion probably is of little consequence. Thirdly, a deletion was observed in the beginning of lacD, encoding an important enzyme in the D-tagatose-6-phosphate pathway, relevant in galactose metabolism. The consequential absence of functional LacD may explain why the inactivation of the alter-native Leloir pathway in D39 significantly hampered growth on galactose25.
We repaired lacD in D39V and, as expected, observed restored growth on galactose (Supplementary Fig. S3). Finally, we observed a thymine
inser-tion that caused SPD_0299 and SPD_0300 to be shifted into the same cod-ing frame and form a scod-ingle 1.9 kb long CDS (SPV_2142). Since the inser-tion was found in a homopolymeric run of thymines and the assemblies of NCTC 7466 and D39V match, it seems plausible that instead of a true indel mutation, this actually reflects a sequencing error in the D39W assembly.
Varying repeat frequency in surface-exposed protein PavB
Pneumococcal adherence and virulence factor B (PavB) is encoded by SPV_0080. Our assembly shows that this gene contains a series of seven imperfect repeats of 450-456 bps in size. Interestingly, SPD_0080 in D39W contains only six of these repeats. If identical repeat units are indicated with an identical letter, the repeat region in SPV_0080 of D39V can be written as ABBCBDE, where E is truncated after 408 bps. Using the same letter code, SPD_0080 of D39W contains ABBCDE, thus lacking the third repeat of element B, which is isolated from the other copies in SPV_0080. Because D39V and NCTC 7466 contain the full-length version of the gene, we hypothesized that D39W lost one of the repeats, making the encoded protein 152 residues shorter.
Configuration of variable hsdS region matches observed
methylation pattern
A local rearrangement is found in the pneumococcal hsdS locus, encoding a three-component restriction-modification system (HsdRMS). Recombi-nase CreX facilitates local recombination, using three sets of inverted re-peats, and can thereby rapidly rearrange the region into six possible con-figurations (SpnD39IIIA-F). This process results in six different versions of
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ultsmethyltransferase specificity subunit HsdS, each with its own sequence specificity and transcriptomic consequences26,27 as defined by
single-mol-ecule, real-time (SMRT. The region is annotated in the A-configuration in D39W, while the F-configuration is predominant in D39V (Fig. 2B).
More-over, we employed methylation data, intrinsically present in SMRT data28,
and observed an enriched methylation motif that exactly matches the pu-tative SpnD39IIIF motif predicted by Manso et al. (Fig. 2C).
A large chromosomal inversion occurred multiple times in
pneumococcal evolution
We also observed a striking difference between D39V and D39W: a 162 kbp region containing the replication terminus was completely inverted (Figs. 2A and 3), with D39V matching the configuration of the ancestral
NCTC 7466. The inverted region is bordered by two inverted repeats of 1.3 kb in length. We noticed that the xerS/difSL site, responsible for chro-mosome dimer resolution and typically located directly opposite the ori-gin of replication29, is asymmetrically situated on the right replichore in
D39V (Fig. 3A), while the locus is much closer to the halfway point of the
chromosome in the D39W assembly, suggesting that this configuration is the original one and the observed inversion in D39V and NCTC 7466 is a true genomic change, rather than merely a sequencing artefact. To firm this, we performed a PCR-based assay, in which the two possible con-figurations yield different product sizes. Indeed, the results showed that two possible configurations of the region exist in different pneumococcal strains; multiple D39 stocks, TIGR4, BHN100 and PMEN-14 have matching terminus regions, while the opposite configuration was found in R6, Rx1, PMEN-2 and PMEN18. We repeated the analysis for a set of seven and a set of five strains, each related by a series of sequential transformation events. All strains had the same ter orientation (not shown), suggesting that the inversion is relatively rare, even in competent cells. However, both config-urations are found in various branches of the pneumococcal phylogenetic tree, indicating multiple incidences of this chromosomal inversion. Inter-estingly, a similar, even larger inversion was observed in two out of three recently-sequenced clinical isolates of S. pneumoniae30 (Fig. 2A),
suggest-ing a larger role for chromosomal inversions in pneumococcal evolution.
Antigenic variation of histidine triad protein PhtD
Surprisingly, the repeat regions bordering the chromosomal inversion are located in the middle of phtB and phtD (Fig. 3A), leading to an exchange
of the C-terminal parts of their respective products, PhtB and PhtD. These are two out of four pneumococcal histidine triad (Pht) proteins, which are surface- exposed, interact with human host cells and are considered to be good vaccine candidates31. In fact, PhtD was already used in several phase
I/II clinical trials32,33. Yun et al. analyzed the diversity of phtD alleles from 172
clinical isolates and concluded that the sequence variation was minimal34.
However, this conclusion was biased by the fact that inverted chromosomes would not produce a PCR product in their set-up and a swap between PhtB and PhtD would remain undetected. Moreover, after detailed inspection of the mutations in the phtD alleles and comparison to other genes encoding Pht proteins (phtA, phtB and phtE), we found that many of the SNPs could be explained by recombination events between these genes, rather than by random mutation. For example, extensive exchange was seen between D39V phtA and phtD (Fig. 3B). Apparently, the repetitive nature of these
genes allows for intragenomic recombination, causing phtD to become mo-saic, rather than well-conserved. Finally, immediately downstream of phtE (Fig. 3A), we identified a pseudogene that originally encoded a fifth
histi-dine triad protein and which we named phtF (Fig. 3C). The gene is disrupted
by an inserted RUP element (see below) and several frameshifts and non-sense mutations, and therefore does not produce a functional protein. Nev-ertheless, phtF might still be relevant as a source of genetic diversity. Taken together, these findings raise caution on the use of PhtD as a vaccine target.
RNA-seq data and PCR analysis show loss of cryptic
plasmid from strain D39V
Since SMRT technology is known to miss small plasmids in the assembly pipeline, we performed a PCR-based assay to check the presence of the cryptic pDP1 plasmid, reported in D39W24,35. To our surprise, the plasmid
is absent in D39V, while clearly present in the ancestral NCTC 7466, as confirmed by a PCR-based assay (Supplementary Fig. S4). Intriguingly, a
BLASTN search suggested that S. pneumoniae Taiwan19F-14 (PMEN-14, CP000921), among other strains, integrated a degenerate version of the
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ults Fig. 3. A lar ge chr omosomal in ver -sion un veils an tig enic v aria tion o f pneumoc oc cal his tidine tria d pr o-teins. A. T op: chr omosomal loc a-tion o f the in vert ed 1 62 kb r egion (or ang e). R ed triangle s c onnect the loc ation o f the 1 kb in vert ed r epe ats bor dering the in vert ed r egion and a zoom o f the g enetic c on te xt o f the bor der areas, also sho
wing tha t the in vert ed repe ats ar e loc aliz ed in the middle of gene s phtB and phtD . Ar -ro w s mark
ed with A, B and C indi
-ca te the tar get r egions o f olig on u-cleotide s used in PCR analy sis o f the r egion. Bott om: PCR analy sis o f sev er al pneumoc oc cal s tr ains (in
-cluding both our D3
9 s tocks and a st ock fr om the Gr ang eas se lab , L yon) sho w s tha t the in ver
sion is a true phenomenon, r
ather than a t echnic al art ef act. PCR r ea ctions ar e perf ormed with all thr ee primer s pr esen t, such tha t the obser ved pr oduct siz e reports on the chr omosomal configur ation. B. A fr agmen t o f a Clus tal Omeg a m ultiple sequenc e alignmen t o f 1 72 r eport ed phtD allele s 34 and D3 9V g ene s phtD and phtA e xemplifie s the dynamic na tur e o f the g ene s enc oding pneumoc oc cal his tidine tria d pr ot eins. Base s highligh ted in gr
een and purple ma
tch D3 9V phtD and phtA , r espectiv ely . Or ang e indic at es tha t a b ase is diff er en t fr om both D3 9V g ene s, while whit e b ase s ar e iden tic al in all sequenc es. C. N ewly iden tified pseudog ene, c on taining a R UP insertion and sev er al fr ame
shifts and nonsense m
uta
tions, tha
t originally enc
oded a fifth pneumoc
oc
cal his
tidine tria
d pr
ot
ein, and which w
e named
phtF
. Old
(D3
9W) and new annota
tion (D3 9V) ar e sho wn, along with c onser ved domains pr edict ed b y CD-Se ar ch 14.
plasmid into its chromosome. Indeed, the PCR assay showed positive re-sults for this strain. Additionally, we selected publicly available D39 RNA-seq datasets and mapped the RNA-sequencing reads specifically to the pDP1 reference sequence (Accession AF047696). The successful mapping of a significant number of reads indicated the presence of the plasmid in strains used in several studies (SRX261384536; SRX172540637; SRX4729662626). In
contrast, RNA-seq data of D39V8,23 (Chapter 3) contained zero reads that
mapped to the plasmid, providing conclusive evidence that strain D39V lost the plasmid at some stage (Supplementary Fig. S1). Similarly, based
on Illumina DNA-seq data, we determined that of the three clinical iso-lates shown in Fig. 2A, only SP61 contained a similar plasmid18.
Automation and manual curation yield up-to-date
pneumococcal functional annotation
An initial annotation of the newly assembled D39V genome was produced by combining output from the RAST annotation engine11 and the NCBI
prokary-otic genome annotation pipeline (PGAP)4. We, then, proceeded with
exhaus-tive manual curation to produce the final genome annotation (see Methods
for details). All annotated CDS features without an equivalent feature in the D39W annotation or with updated coordinates are listed in Supplementary Table S3. Examples of the integration of recent research into the final
anno-tation include cell division protein MapZ38,39, pleiotropic RNA-binding
pro-teins KhpA and KhpB/EloR40,41 and cell elongation protein CozE42.
Additionally, we used tRNAscan-SE43 to differentiate the four encoded
tRNAs with a CAU anticodon into three categories (Supplementary Table S4): tRNAs used in either (i) translation initiation or (ii) elongation
and (iii) the post-transcriptionally modified tRNA-Ile2, which decodes the AUA isoleucine codon44.
Next, using BLASTX12 (Methods), we identified and annotated
165 pseudogenes (Supplementary Table S5), two-fold more than reported
previously24. These non-functional transcriptional units may be the result
of the insertion of repeat regions, nonsense and/or frameshift mutations and/or chromosomal rearrangements. Notably, 71 of 165 pseudogenes were found on IS elements19, which are known to sometimes utilize alternative
coding strategies, including programmed ribosomal slippage, producing
oriC phtA phtE phtF lmb phtB phtD C A B RUP SPD_0891 SPD_0893 SPD_0892
A
A+C = 2.2 kb A+B = 1.2 kbC
Old annotation Conserved domains Coordinates (Old coordinates)1,056,401 - 1,058,604 (-)
906,016 - 908,219 (+)
New annotation
(phtF
, SPV_2293)
Streptococcal histidine triad protein
B
phtD GACCATTATCACTTTATTCCTTATTCACAACTGTCACCTTTGGAAGAAAAATTG phtA GATCATTACCACTTCATCCCTTACTCTCAAATGTCTGAATTGGAAGAACGAATC 98x AACCATTACCACTTTATCCCTTATGAACAAATGTCTGAATTGGAAAAACGAATT 33x AACCATTACCACTTTATCCCTTATGAACAAATGTCTGAATTGGAAGAACGAATT 15x AACCATTACCACTTTATCCCTTACTCTCAAATGTCTGAATTGGAAGAACGAATT 14x AACCATTACCACTTTATCCCTTACTCTCAAATGTCTGAATTGGAAAAACGAATT 9x GACCATTATCACTTTATTCCTTATTCACAACTGTCACCTTTGGAAGAAAAATTG 3x AACCATTACCACTTTATCCCCTATGAACAAATGTCTGAATTGGAAAAACGAATTA+B A+C 950 970 960 980 phtD alleles 990 Stop codon Frameshift xerS /dif SL 162 kb D39V 2,046,572 bps
2
2
Deep g enome annota tion o f the opportunis tic human p athog en Str eptoc oc cus pneumoniae D 39 R es ultsa functional protein from an apparent pseudogene. Finally, we annotated 127 BOX elements16, 106 RUPs17, 29 SPRITEs18 and 58 IS elements19.
RNA-seq coverage and transcription start site data allow
improvement of annotated feature boundaries
Besides functional annotation, we also corrected the genomic coordinates of several features. First, we updated tRNA and rRNA boundaries ( Sup-plementary Table S4), aided by RNA-seq coverage plots that were built
from deduced paired-end sequenced fragments, rather than from just the sequencing reads. Most strikingly, we discovered that the original annota-tion of genes encoding 16S ribosomal RNA (rrsA-D) excluded the sequence required for ribosome binding site (RBS) recognition45. Fortunately,
nei-ther RAST or PGAP reproduced this erroneous annotation and the D39V annotation includes these sites. Subsequently, we continued with cor-recting annotated translational initiation sites (TISs, start codons). While accurate TIS identification is challenging, 45 incorrectly annotated start codons could be identified by looking at the relative position of the cor-responding transcriptional start sites (TSS, +1, described below). These TISs were corrected in the D39V annotation (Supplementary Table S3).
Finally, we evaluated the genome-wide quality of TISs using a statistical model that compares the observed and expected distribution of the po-sitions of alternative TISs relative to an annotated TIS15. The developers
suggested that a correlation score below 0.9 is indicative of poorly anno-tated TISs. In contrast to the D39W (0.899) and PGAP (0.873) annotations, our curated D39V annotation (0.945) excels on the test, emphasizing our annotation’s added value to pneumococcal research.
Paired-end sequencing data contains the key to detection
of small RNA features
After the sequence- and database-driven annotation process, we pro-ceeded to study the transcriptome of S. pneumoniae. We pooled RNA from cells grown at four different conditions (Chapter 3), to maximize the
num-ber of expressed genes. Strand-specific, paired-end RNA-seq data of the control library was used to extract start and end points and fragment sizes of the sequenced fragments. In Fig. 4A, the fragment size distribution of
the entire library is shown, with a mode of approximately 150 nucleotides and a skew towards larger fragments. We applied a peakcalling routine to determine the putative 3’-ends of sequenced transcripts. For each of the identified peaks, we extracted all read pairs that were terminated in that specific peak region and compared the size distribution of that subset of sequenced fragments to the library-wide distribution to identify putative sRNAs (see Methods). We focused on sRNA candidates that were found
in intergenic regions. Using the combination of sequencing-driven detec-tion, Northern blotting (Supplementary Fig. S5), convincing homology
with previously validated sRNAs, and/or presence of two or more regula-tory features (e.g. TSSs and terminators, see below), we identified 63 small RNA features. We annotated 39 of these as sRNAs (Table 2) and 24 as
ribo-switches (Supplementary Table S6).
Until now, several small RNA features have been reliably validated by Northern blot in S. pneumoniae strains D39, R6 and TIGR446–50. Excluding
most validation reports by Mann et al. due to discrepancies found in their data, 34 validated sRNAs were conserved in D39V. Among the 63 here-de-tected features, we recovered and refined the coordinates of 33 out of those 34 sRNAs, validating our sRNA detection approach.
One of the detected sRNAs is the highly abundant 6S RNA (Fig. 4B,
left), encoded by ssrS, which is involved in transcription regulation. No-tably, both automated annotations (RAST and PGAP) failed to report this RNA feature. We observed two different sizes for this feature, probably corresponding to a native and a processed transcript. Interestingly, we also observed a transcript containing both ssrS and the downstream tRNA gene. The absence of a TSS between the two genes, suggests that the tRNA is processed from this long transcript (Fig. 4B, right).
Other detected small transcripts include three type I toxin-antitoxin systems as previously predicted based on orthology51. Unfortunately,
pre-vious annotations omit these systems. Type I toxin-antitoxin systems con-sist of a toxin peptide (SPV_2132/SPV_2448/SPV_2450) and an antitoxin sRNA (SPV_2131/SPV_2447/SPV_2449). Furthermore, SPV_2120 encodes a novel sRNA that is antisense to the 3’-end of mutR1 (SPV_0144), which encodes a transcriptional regulator (Fig. 4C) and might play a role in