University of Groningen Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel

(1)

University of Groningen

Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants. University of Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Lionel Morgado

Computational Methods for

High-Throughput

(3)

The research described in this thesis was carried out at the Groningen Bioinformatics Centre (GBiC), Faculty of Sciences and Engineering, University of Groningen, The Netherlands.

This thesis should be cited as:

Morgado, L (2017) Computational Methods for High-Throughput Small RNA Analysis in Plants. PhD Thesis, University of Groningen, Groningen, The Netherlands.

No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without the prior permission of the author.

Printing: Ridderprint BV | www.ridderprint.nl ISBN (Print): 978-94-034-0541-4

(4)

Computational Methods for

High-Throughput

Small RNA Analysis in Plants

PhD thesis

to obtain the degree of PhD at the University of Groningen

on the authority of the Rector Magnificus Prof. E. Sterken

and in accordance with

the decision by the College of Deans.

This thesis will be defended in public on

Monday 26 March 2018 at 09.00 hours

by

Lionel Monteiro Morgado born on 24 January 1984 in Saint-Chamond, France

(5)

Supervisor Prof. R. C. Jansen Co-supervisor Prof. F. Johannes Assessment Committee Prof. J. Kok Prof. O. C. M. Sibon Prof. D. de Ridder

(6)

CHAPTER 1

(9)

8

1.1 Small regulatory RNAs: historical milestones

The disclosure of the DNA double helix in 1953 [1] is an important reference for contemporary genetics. It cemented the “one gene-one enzyme” hypothesis formulated in the 1940s [2] and the idea of DNA as a fundamental molecule for life, igniting the hype felt till our times around deciphering the “genetic code”. Although the “one gene-one enzyme” model evolved into a broader concept that encompasses all proteins as “functional” products originated from DNA encoded genes, non-coding regulatory elements have remained widely underestimated. For long time, RNA was mostly seen as a template (messenger RNA) and infrastructural molecule (ribosomal RNA and transfer RNA) matching the archetype of the central dogma of molecular biology defined by Crick in 1958 [3], where genetic information flows from DNA to RNA, and ends translated into a protein.

It was only in the early 1990s that gene silencing was accidentally detected in petunia flowers [4]. Transgenic lines engineered to have a deep purple color as a result of an increased expression of chalcone synthase (a pigment producing gene), resulted in variegated flowers instead. The silencing of the transgene was by then linked to antisense RNA, and because both the transgene and the homologous endogenous gene were affected, the phenomenon was termed co-suppression [4, 5]. A similar mechanism reported in the fungus Neurospora crassa was named quelling [6, 7], and a related phenomenon identified in animals was baptized as RNA interference (RNAi). The molecular processes underlying such observations remained unknown for almost a decade. It was only in 1998 that RNAi was described both in plants [8] and c. elegans [9]. This last work awarded Andrew Fire and Craig Mello the “Nobel Prize in Physiology or Medicine” in 2006, bolstering the importance of RNAi as an experimental tool and its therapeutic potential. In addition, it was found that post-transcriptional silencing (PTS) can be mediated by another class of RNA molecules with a short length but with a distinct biogenesis: the microRNA (miRNA). The first miRNA was identified in studies with c. elegans where lin-4, a gene without protein coding capacity, revealed influence in the timing of larval development through the production of tiny regulatory RNA [10]. Only almost a decade later, miRNAs were identified in plants and shown to be involved in various developmental events including flowering [11–13]. In 1994, transcriptional silencing (TS) guided by short length RNA was discovered, introducing the heterochromatic small interfering RNA (hc-siRNA). By then, it was observed in tobacco plants

(10)

9 that viroid RNA can induce methylation of its own complementary DNA, in a mechanism coined RNA-directed DNA methylation (RdDM) [14]. Given the similarities among the regulatory mechanisms described above and the characteristic short size of the RNA molecules involved, they were included under the general designation of small RNA (sRNA). Research with the model organism Arabidopsis thaliana added two new elements to the list of sRNAs found in plants. In 2004, the plant-specific trans-acting (ta)-siRNA was described as a secondary product of sRNA activity [15, 16]; and the year after, taught us that naturally occurring antisense transcripts independently synthesized in the genome can hybridize and originate natural antisense transcript (nat)-siRNA [17]. In the meantime, key components for sRNA activity, such as Dicer [18] and Argonaute [19], were identified.

The advent of Next Generation Sequencing (NGS) platforms in the turn of the millennium has permitted to access millions of sRNA sequences in a single experiment [20], showing that some sRNAs can be conserved among species, and that many are tissue and lineage specific. The exact number of sRNAs is unknown but the latest studies suggest that innumerous sequences remain to be characterized [21], many of which may be part of intricate regulatory networks [22] that have been largely overlooked to date.

1.2 Endogenous small RNAs in plants

Although sRNAs can have a source external to the organism where they act, most of the sequences found in natural plants are of internal origin. sRNAs have been shown to have key regulatory functions in development, response to biotic and abiotic stressors, genome stability and transposon control [23]. In plants, sRNA sequences are mostly 21 to 24 nucleotides (nt) in length, and result from cleavage of double-stranded RNA substrates by dicer-like (DCL) enzymes (Figure 1.1). The RNA substrates, themselves, can originate either from a single-stranded RNA precursor with a stem-loop conformation, from overlapping RNA segments produced independently of each other, or from the action of a RNA-dependent RNA polymerase (RDR) over single stranded RNA.

sRNA can be broadly divided into 2 main groups: hairpin-derived sRNA and small interfering RNAs (siRNA). The popular miRNA falls in the first group, while secondary siRNA (including ta-siRNA), nat-siRNA and hc-siRNA fall into the second group. To become active, sRNAs must

(11)

10

load into Argonaute proteins, after which they guide silencing complexes to their targets according to sequence pairing principles. Genomic regulation can then be exerted through TS or PTS. Both modes of action have been intensively studied, but PTS mechanisms, such as messenger RNA (mRNA) cleavage and translation inhibition, are better understood.

1.2.1 Hairpin sRNA and microRNA

A hairpin sRNA (hpsRNA) is defined as a sRNA derived from a precursor that has the capacity to fold into a hairpin-like shape. This precursor can then be processed by any DCL to produce sRNA mostly in the 20-24 nt range. Despite miRNA being an extensively studied subclass under the umbrella of hpsRNA, this parent category is not as well characterized; hpsRNA are more heterogeneous and therefore more difficult to describe [24].

Pol II 21-22 nt Pol II 21-22 nt Pol II/IV DCL2/4 RDR6 RDR2 DCL1 dsRNA dsRNA AGO1 AGO1 Translational repression mRNA cleavage AGO1/7 mRNA cleavage AGO4/6/9 Pol II/IV 21-22 nt DCL1/2 AGO1/7 mRNA cleavage AGO1/7 dsRNA 24 nt DCL3 DNA methylation/ Chromatin remodeling M M M Pol IV

Figure 1.1. Main endogenous sRNA pathways in plants.

b) nat-siRNA

a) miRNA c) Secondary siRNA d) hc-siRNA

(12)

11 In plants, a primary transcript (pri-miRNA) is produced in the nucleus by RNA polymerase Pol II and further processed into the hairpin-structure precursor (miRNA) by DCL1. A pre-miRNA can be composed of thousands of nucleotides and incorporates inverted repeats that allow the molecule to acquire the hairpin or stem-loop conformation. After folding into a double-stranded structure, DCL1 produces the mature miRNA duplexes. Typically, only one of the strands (the “guide”) gives rise to an active miRNA, while the opposite strand (the “passenger”) is degraded. Nonetheless, cases where both strands become active have been identified [25, 26].

Although in animals pre-miRNAs typically range from 60 to 80 nt [27], in plants they can be several hundreds of nucleotides in length. Perhaps as a result, the stem-loop structure in plant miRNAs is more variable in size (usually larger) and can contain big bugles. Moreover, in plants the pairing between the arms of the stem-loop conformation shows an overall higher degree of complementarity in the miRNA region [28–30]. Unlike in animals, about 80% of mature plant miRNAs contain a uracil at the 5’ ends, which seems to be essential for proper binding and activation of the RNA-induced silencing complex (RISC).

The role of miRNAs in the directed regulation of gene expression is well-known. Individual mature miRNAs, with a characteristic length of 21-22 nt, can target different transcripts but also several miRNAs can target the same transcript [31, 32]. Interestingly, recent evidence indicates that miRNAs can also regulate the concentration of non-coding RNA, including sRNAs [33]. The targeting of non-coding transcripts has been suggested as a mechanism for negative regulation of miRNA concentration in a process called “target mimicry” [34].

miRNAs are thought to be subject to strong evolutionary constraints and tend to be highly conserved across species [13]. Lineage-specific miRNAs differ significantly from more conserved sequences, lacking targets or using unknown non-canonic target criteria, have low abundance, heterogeneous processing from the precursor and are normally encoded by single genes instead of multiple paralogs [24].

1.2.2 Natural antisense transcript siRNA

The main difference between nat-siRNAs and other sRNAs is that nat-siRNAs originate from a RNA duplex formed after the hybridization of a pair of natural antisense transcripts (NATs).

(13)

12

NATs are RNA transcripts that may have or not protein-coding capacity and share complementarity to other RNA transcripts independently produced within the cell [35–37]. Considering the physical distance between NAT producing loci, two main categories emerge:

cis-NAT and trans-NAT. cis-NATs are transcribed from the same genomic locus but typically

from opposite DNA strands and thus form perfect pairs, while trans-NATs are transcribed from distant genomic locations.

Cis-NAT overlapping regions do not have a characteristic length and can occur in five

orientations [38]:

1. Head-to-head – consists in the interception in the 5' ends of both transcripts; 2. Tail-to-tail – comprises the interception in the 3' ends of both transcripts;

3. Completely overlapping – a transcript in one strand of the genome is overlapped by the entire length of the other transcript in the opposite strand;

4. Nearby head-to-head – nearby transcripts in a head-to-head manner where the 5′-end of a transcript is near the 5′-5′-end of another transcript in the genome;

5. Nearby tail-to-tail – nearby transcripts in a tail-to-tail manner where the 3′-end of a transcript is near the 3′-end of another transcript in the genome.

Research in multiple species indicates that NATs can assume a variety of significant regulatory roles, such as in alternative splicing [39, 40], RNA editing [41, 42], DNA methylation [43], genomic imprinting [44–49] and animal X-chromosome inactivation [50]; but regulation by NATs is not well understood [51]. In general, cis-NATs have very specific targets operating mostly locally and in a one-to-one fashion [52], but trans-NATs can share complementarity with multiple transcripts and form complex regulatory networks in processes such as plant response to stress [17]. NAT sequences shared among species such as rice and Arabidopsis have been identified [53], with cis-NATs showing high positional conservation between different species [53, 54].

While the molecular requirements for nat-siRNA biogenesis remain elusive, it is clear that the mere existence of a NAT pair is, by itself, insufficient for siRNA production [51]. Accumulation of nat-siRNAs was observed under diverse stressors, positioning this category of sRNA in the adaptation to biotic and abiotic stress [51]. Once annealed, NATs can be fragmented into 21 nt segments by DCL1, enabling PTS; but there are also reports of 24 nt segments produced by DCL3 that may guide DNA methylation [51].

(14)

13

1.2.3 Secondary and trans-acting siRNA

Secondary siRNAs seem to be part of a mechanism to amplify silencing signals intended to affect pathways with a large number of genes, such as in disease resistance and development [55–58]. Two steps describe the core of secondary siRNA biogenesis: the assembly of a double stranded precursor prompted by cleavage of a single stranded RNA segment that afterwards is targeted by RDR to synthesize its complementary strand; and the subsequent processing into siRNA by DCL enzymes. The initial cleavage event is facilitated by a sRNA, after which multiple molecular complexes are recruited for downstream processing. Secondary siRNAs come often from RNA precursors processed by DCL enzymes in a sequential manner from its beginning [59], a phenomenon called phasing. Although not mandatory, this pattern is a strong indicator of siRNA production and has thus been used for secondary siRNA loci detection. Multiple-hit target transcripts are more prone to generate secondary siRNA, especially if susceptible to sRNAs loaded to AGO1 and AGO7 [24].

When targeting involves transcripts produced at distant loci, a secondary siRNA can further be classified as a ta-siRNA. Involvement of ta-siRNA in developmental timing and patterning has been shown [60]. They were first identified in Arabidopsis [15, 16] as a new kind of non-coding RNA that shared similarities with miRNA but that had key differences. Unlike miRNA, a ta-siRNA locus (known as a TAS gene) produces a non-protein-coding transcript that evolves into a double-stranded RNA segment assisted by RDR6. This form is then processed by DCL4 or DCL2, after an initial cleavage event typically facilitated by miRNA, resulting respectively in 21 or 22 nt long RNA segments [61], that can subsequently be incorporated into RISC and direct the cleavage of target mRNA [33, 62].

The ta-siRNA pathway appears to be specific to plants. TAS genes were detected in many plants such as maize [63] and rice [64], and secondary siRNA in general show high conservation among species [59].

1.2.4 Heterochromatic siRNA

hc-siRNAs are typically 24 nt long and mostly derive from transposons, repeats and heterochromatic regions. Their biogenesis is primarily connected to the PolIV-RDR2-DCL3 pathway [65, 66]. hc-siRNA is central for RdDM, which is the pathway responsible for de

novo DNA methylation. A hallmark of RdDM is the presence of cytosine methylation in all

(15)

14

DNA sequence contexts (CG, CHG, and CHH, where H can be C, A or T) [67, 68]. DNA methylation that is independent of sRNA, by contrast is generally confined to CG and CHG contexts.

There is emerging evidence that the methylation status of plant genomes is altered in response to attack by pathogens [69–71], and that this relies on hc-siRNA production [72– 74]. Recent reports connect RdDM with plant immune response through priming of defense genes in processes such as antibacterial resistance [75, 76]. Because DNA methylation can be rapidly reversed by biotic stress, it has been proposed that dampening defense gene expression through active RdDM would provide an effective mode of regulation of host defense responses in plants [77], in a process where defense activation is accompanied by up-regulation of defense genes due to loss of methylation in TEs/repeats in their promoter regions. Such stress-induced changes in sRNA activity can sometimes be transmitted to the progeny in a process termed “transgenerational priming”. This sRNA-mediated process can facilitate quick and transiently adaptive responses, which may be particularly beneficial in fluctuating environments.

1.3 Small RNA in plant transgenerational epigenetic inheritance

The traditional view in genetics assumes that all heritable variation between individuals of a population is encoded in DNA. However, many examples have been reported where DNA alone cannot explain the phenotypic differences observed. Epigenetics, which refers to the study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence [78, 79], has brought new insights into the molecular mechanisms that may underlie this missing heritability. DNA methylation and several histone chemical modifications are widely accepted as epigenetic mechanisms. Some authors further include sRNA in this list as experimental evidence emerges for the transgenerational preservation of meaningful patterns in sRNA populations [80]. However, a stringent definition of epigenetics dictates that those marks must be passed not only through cell divisions but for at least two generations [81]. Such extended criterion aims at discerning epigenetics from what is known as “parental effects”, in which the environment of the parents can influence their offspring. For example, it is known that in some plants the mother has a strong effect on the seeds it produces and that at the time of dispersal, the

(16)

15 offspring is still surrounded by the maternal tissues of the seed coat that play a crucial role for example in the hormonal regulation of seed germination. Because sRNAs can travel long distances inside a plant organism [82] and even be transmitted from the maternal tissues of a plant to the offspring [83], they are considered by many specialists as not being a true epigenetic mark. Either way, their close relationship with other epigenetic footprints like DNA methylation [67], gives sRNAs a solid role as mediators of transgenerational epigenetic inheritance. Compared to animals, in plants the inheritance of epigenetic marks seems more stable. The existence of sRNA-mediated molecular pathways for de novo DNA methylation and its maintenance explains such steady transmission. DNA methyltransferases are active during gametogenesis and embryogenesis, allowing plants to evade the reset that happens in animals. Epigenetic inheritance in plants is frequently associated with sRNA originated in transposable elements and that regulates nearby genes [84, 85], which can culminate in inheritable epialleles with long-lasting phenotypic effects. As sessile organisms, plants can benefit by passing to their progeny environmental cues since seeds disperse mostly locally, which increases the chance for new plants to experience environments similar to their ancestors.

As for its relevance for biology and agricultural sciences, transgenerational epigenetic inheritance in plants has attracted much attention, but the vast majority of studies have been focused on DNA methylation, leaving unanswered many questions about the impact of sRNA in such phenomenon.

1.4 Interrelationship between sRNA pathways

If miRNA structure and regulation is quite well studied, the same cannot be said for other sRNA categories. The lack of clear physical features is a significant issue, further aggravated by the fact that many genomic loci can be involved in the biogenesis of sRNA from distinct pathways. It has been noted that some transposon families like Athila, can switch the production of siRNA from 24 nt to 21-22 nt when methylation is lost [86, 87]. This transition starts with the synthesis of transcripts by Pol II that are afterwards degraded into 21-22 nt siRNA. While most of these siRNAs mediate PTS, some can enter a non-canonical RdDM pathway dependent on Pol II, RDR6 and DCL2/4 [87]. This mechanism has the capacity to reestablish TE methylation and correct lost silencing marks that can be further reinforced by

(17)

16

other pathways for DNA methylation. Remarkably, transitions from PTS to TS have also been observed and studied in detail in ÉVADÉ (EVD) [88]. Loss of methylation in this retrotransposon, triggers the production of 21-22 nt siRNA via the action of DCL2 and DCL4. Nevertheless, very high levels of EVD transcripts can saturate the available DCL2 and DCL4, redirecting the siRNA precursors to vacant DCL3 proteins and thus shifting siRNA production from 21-22 nt (mostly) PTS-siRNA to 24 nt hc-siRNA. hc-siRNA can further attenuate the production of PTS-sRNA units via DNA methylation. The numerous cases of epigenetically activated sRNA that have been recently recorded demonstrate clearly the antagonist and complex TS-PTS relationship [22]. It is further known that loss of TE methylation can result in epigenetic activation and consequent transcription, whereby transposon mRNA becomes preferentially targeted by miRNA. To further complicate the equation, the genes for these miRNAs can be controlled by DNA methylation.

It is clear from these examples that plant organisms have large and very intricate sRNA networks connecting genomic loci that may be involved in multiple pathways both as sRNA producers but also as targets. For a clear understanding of sRNA regulation it is therefore fundamental to identify accurately which sequences are involved in which pathways. Only by doing so, we can crack the complex networks formed by these regulatory elements.

1.5 Analyzing sRNA: computational challenges from the “dry lab”

A considerable number of experimental methods for sRNA detection have been developed over the years. Traditional approaches include northern blotting, reverse transcription polymerase chain reaction (RT-PCR), microarrays and more recently NGS protocols like sRNA-seq [89]. Each method has its own limitations, but the use of sRNA-seq has gained momentum in recent years due to its capacity to record in a single run millions of sequences at a genome-wide scale for a relatively low cost. The existence of such large number of sequences makes impractical the application of other experimental assays to determine additional properties for each of the sequences captured under the current technology. For example, gene-specific experimental validation of PTS targets using well established methods like real-time reverse transcription polymerase chain reaction (qRT-PCR), luciferase reporter assays and western blot, are labor intensive and not easily scalable to genome-wide studies [90, 91]. In the case of TS, direct experimental target validation has only been

(18)

17 proposed recently [92] and therefore it is still in its infancy. This is where computational methods come into play, aiding the extraction of additional information from sRNA primary structure, and prioritizing candidates for experimental validation. Nevertheless, the computational analysis of sRNA-seq data is far from being a trivial task. Libraries are composed of mixtures of sRNA sequences with diversified origin, structural properties and involved in distinct modes of silencing. In addition, the identification of a sRNA sequence by itself is not a guarantee of sRNA regulation since cells can produce inactive short-length RNA. Despite the great interest in separating those units which can guide silencing from other sequences, no computational methods for general high-throughput function detection existed prior to the work reported in this thesis. Using the existing software tools, function detection could only be inferred partially for PTS target prediction, implying that a large share of sRNAs which often guides TS remained ignored.

Once functional sRNAs are identified, determining the mode of silencing for each sequence is the next step. Doing so, gives important clues about the duration/transmission of silent states and the type of biological pathways that induce/maintain these states. PTS is interpreted as a faster response than TS, but with a less lasting effect. Again, no computational methods were publically available to determine sRNA-mediated TS by the time the work of this thesis started. In addition, it is well known that tools for PTS target prediction are characterized by high false positive rates, which complicates laboratory validation. Software to distinguish TS- from PTS-sRNA can in principle improve PTS inference. Although the central role of plant Argonaute (AGO) proteins in defining the mode of silencing is widely accepted and frequently mentioned in sRNA research, no computational methods have been devised to explore the potential of a given sRNA sequence to associate with a specific AGO. Determining such capacity has been strictly dependent on sRNA sequencing after AGO immunoprecipitation.

Another limitation for sRNA studies is the high variability found within and across sRNA species, as well their dynamic nature which impacts our ability to capture them in the lab. Indeed, the detection of sRNA can be experimentally difficult as their expression can be low or dependent on specific developmental stages, cell types or stimulation. In silico analysis of artificial sRNA (artsRNA) can give preliminary answers to theoretical models and motivate experimental validation. The exploration of artsRNA is not a new concept, as conservation principles have been used in the past to detect miRNA [93]. This philosophy can be revised

1

(19)

18

to develop programs for general sRNA detection in an era where new genomes are published on a daily basis.

As seen so far, mining sRNA-seq data involves innumerous analytical steps and can even go beyond the inspection of the mature sRNA sequences, precursors and targets. Studies regarding variation in sRNA abundance among samples and many other downstream analyses (e.g.: gene ontology, network analysis, etc.) are common practice. Currently, there are many individual programs for very specific sRNA-related applications dispersed over the internet, but only a small number of integrative frameworks exist. These often emphasize miRNAs and ignore other important sRNA categories. This is especially evident in plants where only a handful of (frankly incomplete) software platforms can be found. Biologists are often not acquainted with programming and lack the necessary computational skills to build the pipelines they need, which culminates in a tremendous time investment to achieve only modest solutions. There is a clear and urgent demand for novel computational tools to empower sRNA researchers. This PhD thesis has aimed to meet this demand by delivering novel computational methods to improve sRNA detection and characterization. These novel tools are embedded in a single computational platform and integrated with existing software programs, thus facilitating comprehensive and flexible analyses of sRNA in plants.

(20)

(21)

(22)

21

CHAPTER 2 Computational tools for plant small RNA

detection and categorization

Lionel Morgado and Frank Johannes

Abstract

Small RNAs (sRNAs) are important short-length molecules with regulatory functions essential for plant development and plasticity. High-throughput sequencing of total sRNA populations has revealed that the largest share of sRNAs remains uncategorized. To better understand the role of sRNA-mediated cellular regulation, it is necessary to create accurate and comprehensive catalogues of sRNAs and their sequence features, a task that currently relies on non-trivial bioinformatic approaches. Although a large number of computational tools have been developed to predict features of sRNA sequences, these tools are mostly dedicated to microRNAs and none integrates the functionalities necessary to describe units from all sRNA pathways thus far discovered in plants. Here we review the different classes of sRNA found in plants and describe available bioinformatics tools that can help in their detection and categorization.

(23)

22

2.1 Introduction

Over the past few years, the scientific community has centered efforts to unravel the complex world of RNA molecules that are not translated into a protein, but that rather have a regulatory function in the cell [94]. Such regulatory RNAs are involved in the control of the concentration of messenger RNA (mRNA) and comprise, among other subclasses, the small RNA (sRNA). sRNAs have been shown to have key regulatory functions in development, response to biotic and abiotic stressors, genome stability and transposon control [23]. With the advent of next-generation sequencing of small RNA (sRNA-seq) it has become feasible to survey entire sRNA populations from diverse plant species, cell types, developmental time-points or from different experimental treatments. The identification and classification of sRNAs from such high-throughput data is a non-trivial computation task, since plants can produce millions of sRNAs from diverse pathways which are collectively captured in a single sequencing experiment. Accurate sorting of sRNA by class requires categorizing sRNA according to their precursors, structural properties of the mature molecules, as well as functional aspects, such as their potential target sites (Figure 2.1). A number of computational tools have been developed to detect known sRNA in newly synthesized sequencing libraries, and to help in the identification of novel candidates. For many biologists a key bottleneck to in silico sRNA analysis is to find software that is tailored to their specific research question and to the data type at hand. In this chapter, we provide an inventory of various computational tools for the identification and categorization of plant sRNA. We start by describing a simplified classification scheme based on structural and functional sRNA properties, which is adopted in subsequent sections to organize currently available computational tools.

2.2 Small RNA in Plants

Plant sRNA biology has been reviewed in the introduction chapter of this thesis. Briefly, in plants, sRNAs are mostly 21 to 24 nucleotides (nt) in length, and result from cleavage of double-stranded RNA substrates by dicer-like (DCL) enzymes. The RNA substrates, themselves, can originate either from a single-stranded RNA precursor with a stem-loop

(24)

23 Figure 2. 1 . A st ra tifi ed cl as sifi ca ti on sch eme for sR N A in p la nt s.

2

(25)

24

conformation, or from a double-helix. If the sRNA originates from a hairpin structure they are referred to as hairpin-derived sRNA (hpsRNA); and if they originate from a double helix they are referred to as small interfering RNA (siRNA). The hpsRNA class can be further considered a microRNA (miRNA) if the hairpin is processed in such a way that it produces only one or very few functional units. siRNA comprises all other classes of known sRNA: secondary siRNA such as trans-acting (ta)-siRNA, natural antisense transcript (nat)-siRNA and heterochromatic (hc)-siRNA.

In the case of secondary siRNA, two non-mutually exclusive groups can be defined: phased siRNA, which is originated from a precursor that is processed in a precise and sequential manner; and ta-siRNA which is a plant-specific sRNA type with targets originated in trans. In the case of nat-siRNA, the precursor double-helix is derived from overlapping RNA segments produced independently of each other, while secondary siRNA and hc-siRNA precursors are preceded by the action of a RNA-dependent RNA polymerase (RDR) over single stranded RNA. Considering the physical distance between NAT producing loci, two main categories emerge: cis-NAT and trans-NAT. cis-NATs are transcribed from the same genomic loci but typically from opposite DNA strands and thus form perfect pairs, while trans-NATs are transcribed from distant genomic locations. Cis-NAT overlapping regions do not have a characteristic length and can occur in five orientations [38]:

1. Head-to-head - consists in the interception in the 5' ends of both transcripts; 2. Tail-to-tail - comprises the interception in the 3' ends of both transcripts;

3. Completely overlapping - a transcript in one strand of the genome is overlapped by the entire length of the other transcript in the opposite strand;

4. Nearby head-to-head - nearby transcripts in a head-to-head manner where the 5′-end of a transcript is near the 5′-5′-end of another transcript in the genome;

5. Nearby tail-to-tail - nearby transcripts in a tail-to-tail manner where the 3′-end of a transcript is near the 3′-end of another transcript in the genome.

To become active in plants, sRNA must load into AGO proteins which guide silencing complexes to their targets according to sequence pairing principles. When associated with AGO, sRNA can regulate genomes at the transcriptional (TS) or post-transcriptional (PTS) level depending on the specific AGO to which the sRNA binds. Both modes of action have been intensively studied, but PTS mechanisms such as mRNA cleavage and translation

(26)

25 inhibition, are better understood. PTS is typically observed for miRNA, secondary siRNA and nat-siRNA; while TS is often associated with the action of hc-siRNA. Functional siRNA characterization is key to identify hc-siRNA since no clear structural features to discriminate between hc-siRNA and other siRNA have been defined to date.

2.3 Computational approaches for sRNA detection and categorization

Software for sRNA categorization is typically designed to deal with sequences as input. Some software tools identify precursors in segments with a length of dozens or even hundreds of nucleotides, while others focus on short matured fragments of about 20 nt, and there are also platforms that combine information from both forms. A comprehensive overview of tools is given in the following sections, and further detailed in Table 2.1. Depending on the application, sRNA analysis can be complex, often involving preliminary steps such as data pre-processing and quality controls, as well as downstream analysis such as gene ontology annotation and pathway discovery. A number of recent computational platforms have tried to integrate various software modules by stringing together existing tools into single analysis pipelines [95–101]. The description of modules that do not directly deal with sRNA identification and categorization are outside of the scope of this chapter and will not be discussed here.

2.3.1 Detecting known sRNA

An obvious choice when trying to identify new sRNA candidates is to search for known sequences that have been experimentally confirmed. Owing to their popularity, a large number of databases of experimentally validated miRNA have been built up, comprising several species and kingdoms. miRBase [102] is the most famous miRNA repository, and provides extensive information on precursors, mature sequences and their targets. Unfortunately, similarly detailed databases for other sRNA categories do not currently exist. To our knowledge, there is only one public database (tasiRNAdb) for secondary sRNA in plants, which is strictly dedicated to ta-siRNA [103]. tasiRNAdb provides information not only about mature ta-siRNA, but also their precursors and targets in a total of 18 plant species. We are not aware of repositories of mature nat-siRNA or hc-iRNA. Still, there is one database

(27)

26

of NATs in plants: PlantNATsDB [104]. This resource contains a large inventory of pre-computed NATs for 70 plant species, but it focuses on genes ignoring extensive intergenic regions.

Sequence aligners such as BLAST [105] are commonly employed to query such databases. In fact, the platforms supporting tasiRNAdb and PlantNATsDB implement their own online BLAST modules. Searches for long precursors can be performed using standard BLAST parameters; however, mature sequences pose additional challenges due to their very small size. When searching for mature sequences, perfect matches reduce the odds of getting hits by chance when compared with the use of mismatches and gaps. On the other hand, allowing for mismatches and gaps is often necessary when studying close inter-species homologues. While single nucleotide polymorphisms are a well-known source of genomic variation, mature sRNA can be subject to further modifications. For example, miRNA variants (i.e. “isomiRs”) have been identified as a result of inaccurate DCL cleavage, sequence editing events and even nucleotide additions to the mature sRNA [106, 107]. To deal with isomiRs, several tools rely on alignment algorithms and a preprocessing scheme that consists on sequence terminal trimming and nucleotide additions to simulate known sequence modifications. This is the case in applications such as seqBuster [108], QuickMIRSeq [109], IsomiRage [110], sRNAbench [111], isomiRex [112] and isomiRID [113]. Since “template isomiRs” are a result of dicing shifts, they can be detected if perfect complementarity between the sRNA candidate and a known miRNA precursor (or pre-miRNA) is observed. On the other hand, simulating “non-template isomiRs” by creating all possible combinations with 1 to 3 nt extensions in the 5’ and 3’ ends of known miRNA, and by trimming canonical mature sequences, has been central for their identification. To reduce false positives, some of these tools perform additional processing steps typically exploring features of sRNA-seq libraries. The simplest procedure consists in using read abundance cutoffs as done in isomiRID. seqBuster employs several filters to eliminate sequences with low read abundance and computes z-scores to distinguish true isoforms from sequencing errors. QuickMIRSeq uses multiple samples simultaneously with the rationale that noisy background reads are not captured consistently in multiple samples unlike true miRNA, even if they show low expression levels.

(28)

27

2.3.2 Computer-aided de novo sRNA categorization

Once known mature sRNAs are identified in sRNA-seq data, the remaining sequences (which usually comprise the large majority of the initial sRNA-seq sets) are typically mapped to a genomic reference (if available) to eliminate sequencing chimeras and artifacts. Mature sequences mapping to previously characterized ribosomal RNA (rRNA), transfer RNA (tRNA), small nucleolar RNA (snoRNA) and small nuclear RNA (snRNA) (from databases like for example Rfam [114]) are filtered out by some computational frameworks [115, 116], since these are thought to be mostly non-DCL fragmentation products that have a low chance of entering functional sRNA pathways [28, 116]. Still, doing so can eliminate true sRNA, since tRNA-, rRNA- and snoRNA-derived sRNAs have been identified in multiple plant species [117]. Removing low complexity and low copy reads is also a common practice to reduce noisy data [115, 118], but again, care must be taken in doing so because important sequences can be missed (e.g., hc-siRNA are typically derived from repetitive regions). If BLAST is a popular mapping tool suitable to work with long precursor sequences, aligners like BWA [119] and bowtie [120] are primary choices when it comes to mapping libraries of short reads to a reference. An accurate mapping is of importance in the sense that meaningful clues for sRNA categorization can be obtained from the sequence and chromatin context of the mapped loci.

The identification and classification of putatively functional sRNA is a challenging computational task. While the majority of computational tools have been tailored to animal data, several of these tools can also be applied to other species, including plants. However, sRNA biology differs considerably between plants and animals, and several plant-specific computational tools have been developed. Existing computational methods can be broadly divided in 5 main groups: i) those that explore conservation principles; ii) those that rely on structural features such as the spatial conformation of the precursor(s); iii) inspired by machine learning; iv) rule-based and v) target-centered. In practice, this distinction can be difficult as most modern tools consist of pipelines involving a mixture of these methods. Below we discuss computational tools specific to each plant sRNA class. Because sRNA biogenesis and function can be treated separately, emphasis is given to each of these facets in distinct sections.

(29)

28

sequences can be missed (e.g. hc-siRNA are typically derived from repetitive regions). If BLAST is a popular Tab le 2.1. M ai n f e atu re s o f co mp u tati o n al to o ls fo r sRN A c h a rac te ri zati o n . Mo d u le s fo r d o w n str eam an alys is o r th at are n o t ap p lic ab le to p la n ts are n o t me n ti o n e d . DF : d eg ra d ati o n frag me n ts ; R L: R N A s eq u en ci n g; SA : sR N A -s e q ali gn me n ts ; SL : sR N A -s eq l ib rar y; SS : sR N A s eq u en ce ; PI : p h as e in iti ato r se q u en ce ; PS : p re cu rs o r se q u e n ce ; TR : tr an sc ri p t; T G : tr an sc ri p t o r g e n o mic se q u en ce . # Too l Type Focus Inpu t A nal ys is Re fe re nc e Local Web serv er P re cur so r M at ur e Targ e t Con serv ation St ruct ur e D ici ng Machi ne le arni ng Cons erv ation Iso mir detect ion Mac hine l earni ng P TS Hai rpi n NAT Pre cise exci sion Phasi ng Com plem ent ari ty De grad ome Machi ne le arni ng Expr ess ion 1 Se qB us te r x m iR N A SL+ TG x x x x P an tan o et a l, 2010 2 Qu ic kM IR Se q x m iR N A SL+ TG x x x x Zha o et a l, 2017 3 Iso m iR age x m iR N A SL+ TG x x x x M ul le r et a l, 2014 4 sR N A be nc h x m iR N A SL+ TG x x x x x B ar tur en et a l, 2014 5 iso m iR ex x m iR N A SL+ TG x x x x Sab lo k et a l, 2013 6 iso m iR ID x m iR N A SL+ TG x x x x D e O liv e ir a et a l, 2013 7 R N A Fo ld x x hp sR N A TG x H o fac ke r et a l, 2003 8 UN A Fo ld x hp sR N A TG x M ar kha m et a l, 2003 9 M ir inh o x m iR N A SL+ TG x x x H ig as hi et a l, 2015 10 m iR N A Fo ld x x hp sR N A TG x Tav et a l, 2016 11 M IR FIND ER x m iR N A SS +T G x x x B o nn et et a l, 2004 12 M IR che ck x m iR N A SS +T G x x x Jo ne s-R ho ad es et a l, 2004 13 m ic ro H ar ve st er x x m iR N A SS +T G x x x D ez ul ian et a l, 2006 14 M iM at che r x m iR N A TS +T G x x Li nd o w et a l, 2005 15 m ir To ur x m iR N A SS +T G x x x M ile v et a l, 2011 16 C -m ii x m iR N A SS +T G x x x x N um na rk et a l, 2012 17 m ir D ee pFi nd er x m iR N A SL+ TG +D F x x x x x X ie et a l, 2012 18 P lan tM iR N A P re d x m iR N A TG x x X ua n et a l, 2011 19 pl an tM ir P x m iR N A TG x x Yao et a l, 2016 20 N O VO M IR x m iR N A TG x x Te un e et a l, 2010 21 H un tM i x m iR N A TG x x G ud ys et a l, 2013 22 m iR N AP re d ic ti o n x m iR N A SS +T G + x Wi lli ams et a l, 2012 23 Spl am iR x m iR N A TG x + x Thi em e et a l, 2011 24 m iP lan tP re M at x m ir R N A TG x x x M eng et a l, 2014 25 M iR P ar a x m iR N A TG x x W u et a l, 2011 26 m iR D up x m iR N A SS +T G x x Le cl er q et a l, 2013 27 m iR du pl exS VM x m iR N A TG x x Kar at ha na si s et a l, 2015 28 M at ur eP re d x m iR N A TG x x X ua n et a l, 2011 29 m iR Lo cat o r x m iR N A TG x x C ui et a l, 2015 30 Sho rt St ac k x hp sR N A /m iR N A /t a-si R N A SA +T G x x x x x A xt el l, 2013

(30)

29 Tab le 2.1. (c o n ti n u e d ) # To ol Type Focus Input A nal ys is R e fe re nc e Local We bse rve r P re cur so r M at ur e Targ e t Cons erv ation St ruct ur e D ici ng Machi ne le arni ng Cons erv ation Iso mir detect ion Machi ne le arni ng P TS Hai rpi n NAT Pre cise exci sion Phasi ng Com plem ent ari ty Deg radom e Machi ne le arni ng Expr ess ion 31 m ir D ee p -P x m iR N A SL+ TG x x x Yan g et a l, 2011 32 m iR P lan t x m iR N A SL+ TG x x x A n et a l, 2014 33 m iR A x m iR N A SA +T G x x Ev er s et a l, 2015 34 P IP m iR x m iR N A SL+ TG x x x B re ak fi el d et a l, 2012 35 M ir -P R EF eR x m iR N A SA +T G x x Le i et a l, 2014 36 m iR C at 2 x m iR N A SL+ TG x x P ai cu et a l, 2017 37 m iR ead er x m iR N A SL x x A sh wani et a l, 2013 38 UE A s R N A W o rk be nc h x m iR N A /t a-si R N A SL+ TG +D x x x x x x x St o ck s et a l, 2012 39 ps sR N A M ine r x ta -si R N A SL+ TG +P I x x D ai et a l, 2008 40 Sho rt ran x m iR N A /t a-si R N A SL+ TG x x x x x G up ta et a l, 2012 41 Tas ExpAna ly si s x ta -si R N A SL+ TG +D x x x x Zha ng et a l, 2014 42 P ha se Tan k x Se co nd ar y/ph as ed si R N A SL+ TG x x G uo et a l, 2015 43 N A ST I-se q/R x nat -si R N A SL x Li et a l, 2013 44 N AT p ip e x n at -s iR N A SL+ TG +D x x x Yu et a l, 2016 45 C le av el an d x PTS SS +T G +D F X x B ro us se et a l, 2014 46 P A R Esn ip x PTS SS +T G +D F X x Fo lk es et a l, 2012 47 So M A R T x m iR N A /t a-si R N A SL+ TG + D F x x x x x X x Li et a l, 2012 48 Se qTar x PTS SL+ TG +D F x Zhe ng et a l, 2012 49 m iR N A D ig ge r x m iR N A SL+ TG +D F x x x x x x Yu et a l, 2016 50 ps R N A Tar ge t x PTS SS +T G X D ai et a l, 2011 51 TA P IR x x PTS SS +T G X B o nn et et a l, 2010 52 Tar ge tf ind er x PTS SS +T G X Fah lg re n et a l, 2010 53 Tar ge t-al ig n x x PTS SS +T G X X ie et a l, 2010 54 P sR o bo t x x hp sR N A /m iR N A SS +T G x x X x W u et a l, 2012 55 p -T A R EF x x m iR N A SS +T G x x Jha et a l, 2011 56 m ic ro R N A -T ar ge t x m iR N A SS +T G x x M eng et a l, 2014 57 P lan tM ir na T x m iR N A SS +R L+ TG +D F x x x x R he e et a l, 2015 58 M Ti d e x m iR N A SL+ TG +D F x x x x x x Zh an g et a l, 2015 59 sPA R TA x PTS SS +T S+ D F x x Kak ran a et a l, 2015 60 im iR TP x m iR N A SS +T S+ D F x x D ing et a l, 2011

2

(31)

30

2.3.2.1 Identification of hairpin structures and miRNA classification

The root concept underlying hpsRNA and miRNA categorization is the biogenesis from a RNA transcript with capacity to fold into a hairpin shaped precursor. Since hpsRNA are not well understood and given the popularity of miRNA, most hairpin detectors were developed having in mind pre-miRNA. In truth, a large number of tools branded as miRNA detectors are no more than hairpin or pre-miRNA predictors since they do not even provide a location for the putative mature miRNA inside the pre-miRNA. These tools must therefore be examined carefully to avoid erroneous conclusions [121].

The inference of RNA secondary structure is central to many computational methods designed to detect hairpin structures. Algorithms such as RNAfold [122] and UNAFold [123], explore thermodynamic principles applied to RNA runners and under the premise that the minimal folding free energy index for miRNA precursors is significantly lower than for other products frequently captured during sequencing such as tRNA, rRNA or mRNA [124]. It is important to recall that plant miRNA stem-loops are more heterogeneous when compared to animals (usually they are larger and can contain big bugles); hence, the parameters of the algorithms need to be adjusted according to whether the input data is derived from plant or animal species [125, 126]. Once calibrated, these ab initio methods can predict hairpin structures without additional knowledge.

Traditional RNA folders are computationally intensive, characterized by a cubic time complexity which is suboptimal for large inputs. Mirinho [127] and miRNAFold [128] are folders with a square time complexity, recently introduced to tackle this problem.

Because homology search is inherently simpler than folding estimation, most software combines conservation principles with RNA secondary structure predictions to decrease processing time, but also to increase accuracy. In the case of MIRFINDER [93], a miRNA reference set from Arabidopsis is used to tune a pipeline similar to the one described above. This tool performs a search for new miRNA by comparing queries against a reference, and explores three principles: (i) the reference miRNA sequence is conserved between the query and the reference species, independently of the rest of the precursor sequence having diverged; (ii) the precursor sequence must be able to form a stem-loop secondary structure; and (iii) for two miRNA orthologs the location on the arm of the stem-loop secondary structures is the same in both species. To fulfill the first condition a search for mature miRNA

(32)

31 is made with BLAST, and the second condition is verified with RNAfold. Other tools that use similar comparative genomics principles include MIRcheck [129], microHarvester [130], MiMatcher [131], miRTour [132], C-mii [133] and miRDeepFinder [115]. For example, miRDeepFinder uses a set of miRNA candidates as queries and searches in a reference for segments with potential to form pre-miRNA. The hit sites are extended (700 nt by default) up and downstream to capture and examine precursor candidates with the miRNA located in one arm of the stem at either the 5′ or 3′ end. After a miRNA candidate is identified, miRDeepFinder extracts the complementary miRNA* sequence considering a 3′ overhang of 2 nt characteristic of the miRNA-miRNA* duplex.

In a slightly different approach, machine learning methods have been used to train classifiers capable of distinguishing plant pre-miRNA from other RNA sequences. One argument in favor of this latter approach is that comparative methods have limited capacity to detect miRNA sequences and precursors with low similarity to the reference set while machine learning models can capture more general features that overcome this weakness. PlantMiRNAPred [134] and plantMirP [135] are part of this list, both of which were developed using support vector machines (SVMs). PlantMiRNAPred uses a classifier trained with data from several plant species. A set of 68 features extracted from pre-miRNA and optimized using information gain and feature similarity criteria, was considered for training the final classifier, including information about the sequence composition, k-mers, secondary structure, energy and thermodynamics related parameters. Interestingly, the authors compared PlantMiRNAPred to triplet-SVM [136] and microPred [137], two tools following a similar philosophy but developed using human data. Indeed, these two methods show discriminative capacity when applied to plants but the accuracy of PlantMiRNAPred is considerably higher, illustrating the need for kingdom-specific tools. Other machine learning algorithms have been applied to pre-miRNA detection, including Markov models in NOVOMIR [138], random forest in HuntMi [139] and C5.0 decision trees in miRNAprediction [140]. In a less usual scheme, SplamiR [141] combines software for detecting primary transcripts that undergo splicing events with a machine learning classification system to identify candidate pre-miRNA among generated putative pre-miRNA.

Searches for genomic sequences with the potential to form fold-back stem-loop structures do not yield high-confidence putatively functional miRNA, since many more inverted repeats can be found than the number of miRNA expected for a given organism. For example, in A.

(33)

32

thaliana 138,864 inverted repeat structures have been identified [129] but less than 1000

miRNA confirmed [102]. To increase miRNA detection accuracy, miPlantPreMat [142] and miRPara [143] feed properties of mature miRNA sequences and their precursors to machine learning models. Both combine SVM classifiers in a hierarchical architecture. miPlantPreMat works with classifiers individually trained to recognize mature and precursor sequences, while miRPara explores inter-kingdom differences.

Although the determinants for miRNA location inside a precursor remain poorly understood, efforts have been made to develop computational procedures for their detection. For example, miRDup [121] was designed to infer the precise positions and length of mature miRNA within a candidate pre-miRNA through random forest classifiers that use sequence and structural features. In addition, tools such as MiRduplexSVM [144], MaturePred [145] and miRLocator [146], use classifiers to extract the position of miRNA duplexes from hairpins.

High confidence miRNA classification requires additional criteria [147]. For example, the precursor must be diced at very specific loci producing only one or a reduced number of mature miRNAs. When that happens, piles of sRNA accumulate at these genomic positions. To inspect this feature, it is necessary to check the layout of sRNAs along the precursor. This can be directly assessed using the short-read alignment patterns from sRNA-seq data. Such a functionality can be found in most modern tools, including Shortstack [118], mirDeep-P [148], mirPlant [149], miRA [150], PIPmiR [151], miR-PREFeR [152] and miRCat2 [153]. MirDeep-P and mirPlant are extensions of the popular tool mirDeep [154]. While mirDeep was developed for animal applications, mirDeep-P and mirPlant were specifically designed for plant-based sRNA analysis. Following the mapping of sRNA reads to a reference genome with bowtie, mirDeep-P extracts RNA segments to further determine secondary structure and checks if sRNA spatial distribution patterns are compatible with dicer activity. The mature candidates and the respective pre-miRNA are then filtered according to plant specific criteria based on known properties of plant miRNA genes. A significant difference between mirPlant and mirDeep-P is that in the latter case, the precursor region is determined based on the genomic region overlapping reads while in miRPlant the precursor region is determined based on the highest expressed read, which is presumably the mature miRNA. The authors of miRPlant argue that this strategy reduces the number of false negatives, as it guarantees that the mature miRNA is located at the end of one arm of the hairpin. In the

(34)

33 case of Shortstack, giving the mapped reads and a reference as input, a de novo sRNA cluster discovery is performed by analyzing local patterns of read coverage. Each genomic region overlapping a sRNA cluster is then subjected to a hairpin-folding analysis with RNAfold. Afterwards, hairpins are annotated either as hpsRNA or miRNA loci, depending on how strong is the evidence for precise excision given by local sRNA patterns. miRA tries to maximize the flexibility in parameter settings to enable a conservation-independent miRNA analysis; the authors argue that the use of standard parameters for all plant species is suboptimal because of the complex and non-homogeneous nature of miRNA precursors in plants. In miR-PREFeR, expression information from multiple sRNA-seq libraries can, in addition, be used to decrease false positives and improve the reliability of the predictions. Another less common solution is miReader [155], which aims at identifying mature miRNA directly from sRNA-seq data thanks to an embedded algorithm for de novo contig assembly using short reads.

2.3.2.2 Detection of secondary siRNA and ta-siRNA

In contrast with miRNA-encoding MIR genes, secondary siRNA precursors such as those encoded by ta-siRNA loci or TAS genes, lack a specific secondary structure, and thus require alternative computational prediction strategies. The computational identification of new secondary siRNA is strongly focused on the detection of phasing patterns. This kind of analysis requires sRNA-seq data and a genomic reference, and can be executed with tools such as UEA small RNA Workbench [96], ShortStack [118], pssRNAMiner [156] and shortran [157], which implement variants of the method described in [55]. In this approach, sRNA clusters are determined from the mapped reads and the occurrence of significant phasing patterns inside these regions (see Figure 2.1) is calculated considering a hypergeometric distribution. The sRNAs thought to be phase-initiators can also be mapped to the reference to help identify the start and stop coordinates of the precursor, and restrict the inspection of secondary siRNA candidates to clusters inside that region [156]. In the “one hit” initiator case, functional siRNA must be searched in both the 5’ and 3’ direction of the initial cleavage coordinate, since phasing is a bidirectional process. To mimic patterns of DCL slicing, both UEA small RNA Workbench and ShortStack, introduce a shift that pushes the start position of the segment located in the opposite strand 2 nt downstream.

(35)

34

The TasExpAnalysis module, available online through the tasiRNAdb [103] platform, combines phasing detection with a search for known TAS and ta-siRNA in user-provided sRNA and sequencing reads from endonucleolytic mRNA cleavage products, also known as degradome. This tool follows a target centered approach, where after mapping sRNA-seq and degradome reads to a TAS candidate, it checks the consistency between the TAS cleavage position and the 5’ end of the degradome fragments. Next, it searches for a sRNA from the provided library that can fit the role of phase initiator. Statistical tests to detect phasing are then performed using the mapped sRNA and assuming a hypergeometric distribution.

PhaseTank [158] implements a slightly different methodology. After defining phased clusters that contain at least 4 phased sRNA in 84 nt regions, a non-statistical phased score is computed to express the chance of a region to be a producer of phased siRNA. This score depends on patterns of sRNA distribution and abundance in the region. The triggering sRNA is then determined following sequence complementarity principles along with the fact that the cleavage site must occur at positions 9-11 nt of the sRNA from its 5’ terminal.

Some tools like UEA small RNA Workbench do not provide an indication of the phase initiator(s). Employing standard tools for PTS target prediction to test diverse sRNA candidates that fall around the initial cleavage site(s) can be a solution. On the other hand, ignoring positional information about the initial cleavage allows a more liberal approach, not restricted to known sRNA but considering potentially unidentified sRNA to be starters of the process [55, 159]. Distinguishing ta-siRNA from other secondary siRNA is done simply by comparing the mapping locations of the siRNA and its target transcript; if in trans, the siRNA is incorporated in the ta-siRNA group.

2.3.2.3 Finding NAT pairs and nat-siRNA

Genome-wide identification of NATs from multiple organisms is nowadays possible using the large collections of sequencing data freely available online. Annotated genomes have been used in combination with other highly abundant expressed sequences. In silico methods for detecting NATs suffer from several shortcomings depending on the source of sequence information [53]. For example, the use of mRNA can come with information about the orientations of the transcripts, but the amount of mRNA sequence information available can

(36)

35 be limited, reflecting specific tissues or development stages [160]. Either way, computational resources and databases dedicated to NATs are scarce.

Current methods for the detection of new NATs are simplistic and based on two main pillars: the sequence complementarity between candidate pairs and the potential for transcripts to hybridize. Although the main criterion for the recognition of NAT pairs is the presence of overlapping transcript clusters, the length of the overlay is a parameter artificially defined and very variable from study to study [38]. Other parameters to define NATs have a very heuristic basis and lack clear standardization. As an example, in [53] cis-NAT pairs from Arabidopsis were studied using annotated and anchored full-length cDNA, by applying the following criteria: (i) cDNAs of both transcripts can be uniquely mapped to the genome with at least 96% sequence identity; (ii) the two transcripts are derived from overlapping loci but opposite strands; (iii) the size of the overlaid fragment must be longer than 50 nucleotides; (iv) the sense and antisense transcripts must have distinct splicing patterns. Other studies implemented comparable but slightly altered approaches for rice [36, 161] and Arabidopsis [162].

The NASTI-seq R package [163] is one of the few computational tools currently available for NAT discovery. This software is specialized in cis-NAT detection using strand-specific RNA sequencing data. It models the probability of finding read enrichment in each strand using a binomial model and identifies cis-NAT conditional on additional spatial criteria such as the location in opposite strands and the proximity in the genome.

To our knowledge, the only tool implementing an engine for generic NAT search (including

trans-NAT) is NATpipe [164]. This is a pipeline for NAT prediction in organisms without a

reference genome. Using transcriptomic data, it performs a BLAST-based search to pre-select NAT pair candidates. Then, the annealing potential of these candidates is explored by RNAplex [165]. The secondary structure is analyzed and instances containing bubbles in the annealed region comprising more than 10% of its length are rejected. If sRNA-seq data is available, NATpipe can perform a search for prospective nat-siRNA by looking to phasing patterns in the annealed region in a similar way to what is done by tools for ta-siRNA detection. To distinguish trans-nat-siRNA from cis-nat-siRNA, it is necessary to keep in mind that the concept of trans implies transcripts not sharing a common genomic location.

(37)

36

2.3.2.4 Detecting hc-siRNA and the respective generating loci

In plants, hc-siRNAs are typically 24 nt long and mostly derive from transposons, repeats and heterochromatic regions. Their biogenesis is primarily connected with the activity of PolIV-RDR2-DCL3 [88, 166]. hc-siRNAs are central for RNA-directed DNA methylation (RdDM), which is the pathway responsible for de novo DNA methylation. A hallmark of RdDM is the presence of cytosine methylation in all DNA sequence contexts (CG, CHG, and CHH, where H can be C, A, or T) [166, 167]. Some transposon families can switch the production of siRNA from 24 nt to 21-22 nt when methylation is lost [86, 87]. This transition starts with the synthesis of transcripts by Pol II that are afterwards degraded into 21-22 nt siRNA. Some of these siRNA can enter non-canonical RdDM dependent on Pol II, RDR6 and DCL2/4 [87]. To date, there are no public tools specifically developed to detect hc-siRNA. This limitation is in part due to the fact that hc-siRNA biology remains unclear, and experimental tests to functionally validate hc-siRNA are difficult to establish. Although certain families of TEs have been described to produce hc-siRNA when epigenetic marks are changed [86, 87], the reason for their involvement in hc-siRNA biogenesis remains poorly understood. So far, the identification of hc-siRNA has focused mostly on the abundance of 24 nt sRNA mapping to differentially methylated regions that arise in RdDM mutants [86–88, 166–169], the correlation with the presence or absence of histone marks [170] and variations in gene expression. However, numerous epigenetically activated sRNAs have been identified, which seem to lack the functional properties to be included in the hc-siRNA category. Rather they show PTS activity [22, 86], but the structural features that separate these from true hc-siRNA are unclear. Although the length of mature sRNA sequences is somewhat predictive, and has been taken as a way to discriminate putative hc-siRNA from other types of siRNA (hc-siRNA are typically 24nt in length), this feature - by itself - can be misleading. For example, miR163 is an example of a 24 nt long miRNA which has an exceptionally long length for a DCL1-dependent sequence, and that despite its length primarily binds to AGO1 exerting post-transcriptional regulation [168].

2.3.3 Function prediction in plants

Established nomenclature for miRNA annotation [28] does not require the identification of a functional target sequence, as target prediction can be notoriously difficult. Moreover,

University of Groningen Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel

Lionel Morgado

Computational Methods for

High-Throughput

Computational Methods for

High-Throughput

Small RNA Analysis in Plants

PhD thesis

CONTENTS

CHAPTER 1

1.1 Small regulatory RNAs: historical milestones

1.2 Endogenous small RNAs in plants

1.2.1 Hairpin sRNA and microRNA

1.2.2 Natural antisense transcript siRNA

1.2.3 Secondary and trans-acting siRNA

1.2.4 Heterochromatic siRNA

1.3 Small RNA in plant transgenerational epigenetic inheritance

1.4 Interrelationship between sRNA pathways

1.5 Analyzing sRNA: computational challenges from the “dry lab”

1

1

CHAPTER 2

Computational tools for plant small RNA

detection and categorization

2.1 Introduction

2.2 Small RNA in Plants

2

2.3 Computational approaches for sRNA detection and categorization

2.3.1 Detecting known sRNA

2.3.2 Computer-aided de novo sRNA categorization

2

2.3.2.1 Identification of hairpin structures and miRNA classification

2.3.2.2 Detection of secondary siRNA and ta-siRNA

2.3.2.3 Finding NAT pairs and nat-siRNA

2.3.2.4 Detecting hc-siRNA and the respective generating loci

2.3.3 Function prediction in plants