University of Groningen Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel

(1)

University of Groningen

Computational Methods for High-Throughput Small RNA Analysis in Plants Monteiro Morgado, Lionel

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Monteiro Morgado, L. (2018). Computational Methods for High-Throughput Small RNA Analysis in Plants. University of Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

(3)

43

CHAPTER 3

Learning sequence patterns of AGO-sRNA affinity from

high-throughput sequencing libraries to improve in silico

functional small RNA detection and classification in plants

Lionel Morgado, Ritsert C Jansen and Frank Johannes

Abstract

The loading of small RNA (sRNA) into Argonaute complexes is a crucial step in all regulatory pathways identified so far in plants that depend on such non-coding sequences. Important transcriptional and post-transcriptional silencing mechanisms can be activated depending on the specific plant Argonaute (AGO) protein to which a sRNA binds. It is known that sRNA-AGO associations are at least partly encoded in the sRNA primary structure, but the sequence features that drive this association have not been fully explored. Here we train support vector machines (SVMs) on sRNA sequencing data obtained from AGO-immunoprecipitation experiments to identify features that determine sRNA affinity to specific AGOs. Our SVM models reveal that AGO affinity is strongly determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length and the base composition of the first nucleotide. Moreover, we find that these k-mers tend to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation. We integrated the learned SVMs in a computational pipeline that can be used for de novo functional classification of sRNA sequences. This tool, called SAILS, is provided as a web portal accessible at http://sails.lionelmorgado.eu.

(4)

44

3.1 Introduction

The small RNA (sRNA) is a class of non-coding RNA with significant roles in developmental biology, physiology, pathogen interactions, and more recently in genome stability and transposable element control [194]. Plants have two major classes of sRNA: the microRNA (miRNA), which is processed from imperfectly self-folded hairpin precursors derived from miRNA genes [195]; and the small interfering RNA (siRNA) that is produced from double-stranded RNA duplexes [196]. siRNAs can be further divided into three major groups: secondary siRNA such as trans-acting (ta)-siRNAs, which are promoted by miRNA cleavage of messenger RNA; natural antisense transcript (nat)-siRNAs derived from the overlapping regions of antisense transcript pairs naturally present in the genome; and heterochromatic (hc)-siRNAs, mostly generated from transposons, heterochromatic and repetitive genomic regions and involved in DNA methylation and heterochromatin formation. Apart from their biogenesis, the mode of action of a given sRNA is tightly related with the Argonaute protein to which it can bind [197]. Argonautes form the core of all sRNA-guided silencing complexes identified so far. Once loaded into Argonaute, sRNA guides the silencing machinery to targets through base pairing principles. Argonautes are highly conserved proteins with family members in most eukaryotes [197–199]. Although there are two main subfamilies of Argonautes in eukaryotes: AGO and PIWI; only AGO proteins can be found in plants. Also in plants, AGOs can be grouped into three phylogenetic clades with a highly variable number of elements from species to species [198] (Figure 3.1). Arabidopsis has ten members with specialized or redundant functions among them: AGO1, AGO5 and AGO10 in the first clade; AGO2, AGO3 and AGO7 form the second clade; and the third clade is composed by AGO4, AGO6, AGO8 and AGO9 [197]. Members of the first and second clade are involved in post-transcriptional silencing (PTS) by inhibiting translation or by promoting messenger RNA cleavage, and AGOs in the third clade are chromatin modifiers that induce transcriptional silencing (TS) via mechanisms such as DNA methylation [200–202]. Understanding the mechanisms that determine the loading of sRNA to specific AGOs is essential for predicting their biological function, and for identifying their putative silencing targets.

High-throughput sequencing in combination with immunoprecipitation (IP) techniques have made possible to determine the sequences of sRNA that are bound to different AGO families. AGO-IP experiments have been performed for AGO1, AGO2, AGO4, AGO5, AGO6,

(5)

45 AGO7, AGO9 and AGO10. The low expression level of AGOs 3 and 8 suggests that they may not be functionally relevant. Previous analyses have shown that AGO-sRNA associations are partly determined by the 5’ terminus and the length of a sRNA sequence [203, 204]. Sequences of approximately 21 nucleotides (nt) tend to be involved in PTS, while 24 nt sRNAs are characteristic of TS. Although sequence length has been widely used as a way to infer the silencing pathway a given sRNA is most likely implicated in, by itself it is an inaccurate predictor since many sequencing products can lack other structural features known to enable AGO loading [203–206]. Contrary to animals, in plants the 5’ nucleotide is also recognized as a strong indicator of AGO sorting. Enrichment for sequences starting with pyrimidines are frequent in AGOs from the first clade (AGO1/10: uridine and AGO5: cytosine), adenosine dominates the third clade (AGO4/6/9) as well as AGO2 from the second clade. Furthermore, direct experimental evidence shows that mutating the 5’ nucleotide of a sRNA can redirect its AGO destination [207]; however, other relevant features, such as sequence motifs encoded by the primary structure appear to play a role [200, 208–211]. While sRNAs that participate in PTS have been intensively studied, leading to the discovery of many structural features that influence activation, much less is known about AGO-associated hc-siRNA in transcriptional silencing. Indeed, most studies that use hc-siRNAs give

Figure 3.1. Phylogenetic tree for AGO in A. thaliana. Functional groups are indicated: transcriptional

silencing (TS) and post-transcriptional silencing (PTS).

(6)

46

a strong emphasis to sequence length ignoring other important aspects of the sRNA sequence that promote an active role in genomic regulation. Studying AGO-bound sRNA is a starting point to fill this gap and to improve our understanding on the relationship between the structure and function of sRNA in plants. Here we train support vector machines (SVMs) on sRNA sequencing data obtained from AGO-IP experiments to identify features that determine sRNA affinity with specific AGOs. Our SVMs reveal that AGO affinity is strongly determined by complex k-mers in the 5’ and 3’ ends of sRNA, in addition to well-known features such as sRNA length and the base composition of the first nucleotide. Moreover, we find that these k-mers tend to overlap known transcription factor (TF) binding motifs, thus highlighting a close interplay between TF and sRNA-mediated transcriptional regulation. We incorporated the learned SVMs in an online computational pipeline that can be used for de novo sRNA functional classification. The classification pipeline is suitable for individual sRNAs but also for high-throughput sRNA-seq datasets.

3.2 Material and methods

3.2.1 Data sets

Table 3.1 summarizes the A. thaliana deep sequencing sRNA libraries used in this study. For SVM model training and testing we used one Col-0 wild-type sRNA library as well as eight AGO-IP datasets. SVM model validation was afterwards performed on additional AGO-IP data from different tissues and from pathogen infected plants, in addition to a set of putative miRNA and ta-siRNAs from several plant species. All sRNA-seq datasets were pre-processed and mapped to the Col-0 A. thaliana reference genome. Reads with at least one perfect match were collapsed into unique sRNA sequences and single copy sRNAs were removed. sRNA sequences in the genome-wide library that were not present in any of the AGO-IP sets, were isolated as a new group named “noAGO”. The “noAGO” set was used for SVM training in combination with AGO-IP data to learn discriminative rules to identify sequences with low potential to load to an AGO and therefore with small chance of becoming functional sRNA.

(7)

47

Table 3.1. Libraries of sRNA used in the current work. TOT: total sRNA; AGO-IP: AGO

immunoprecipitated; NA: Not Applicable. T: used for training; V: used for validation.

# Type Notes Species Tissue Reference T V

1 AGO1-IP - A. thaliana Inflorescence Mi, 2008 Y Y

2 AGO1-IP - A. thaliana Flower Zhu, 2011 N Y

3 AGO1-IP - A. thaliana Flower Wang, 2011 N Y

4 AGO1-IP - A. thaliana Leaf Wang, 2011 N Y

5 AGO1-IP - A. thaliana Root Wang, 2011 N Y

6 AGO1-IP - A. thaliana Seedling Wang, 2011 N Y

7 AGO1-IP - A. thaliana Inflorescence Qi, 2006 N Y

8 AGO1-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y

9 AGO2-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y

10 AGO2-IP - A. thaliana Inflorescence Mi, 2008 N Y

11 AGO2-IP Pseudomonas syringae infection set A. thaliana Leaf Zhang, 2011 N Y

13 AGO4-IP - A. thaliana Inflorescence Havecker, 2010 N Y

14 AGO4-IP - A. thaliana Inflorescence Qi, 2006 N Y

15 AGO4-IP - A. thaliana Flower Wang, 2011 N Y

16 AGO4-IP - A. thaliana Leaf Wang, 2011 N Y

17 AGO4-IP - A. thaliana Root Wang, 2011 N Y

18 AGO4-IP - A. thaliana Seedling Wang, 2011 N Y

20 AGO6-IP - A. thaliana Aerial Havecker, 2010 Y Y

21 AGO7-IP - A. thaliana Inflorescence Montgomery, 2008 Y Y

22 AGO9-IP - A. thaliana Aerial Havecker, 2010 Y Y

23 AGO10-IP A. thaliana Inflorescence Zhu, 2011 Y Y

24 AGO1-IP HA-AGO1-DAH(Col-0)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

25 AGO1-IP HA-AGO1-DAH(Col-0)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

26 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y

27 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Rosette Garcia-Ruiz, 2015 N Y

28 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y

29 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y

30 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y

31 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y

32 AGO1-IP HA-AGO1-DAH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y

33 AGO1-IP HA-AGO1-DAH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y

34 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

35 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

36 AGO10-IP HA-AGO10-DDH(Col)_Mock A. thaliana Rosette Garcia-Ruiz, 2015 N Y

37 AGO10-IP HA-AGO10-DDH(Col)_TuMV A. thaliana Rosette Garcia-Ruiz, 2015 N Y

38 AGO1-IP HA-AGO1-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y

39 AGO1-IP HA-AGO1-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y

40 AGO10-IP HA-AGO10-DAH(ago2-1)_Mock A. thaliana Cauline Garcia-Ruiz, 2015 N Y

41 AGO10-IP HA-AGO10-DAH(ago2-1)_TuMV-AS9 A. thaliana Cauline Garcia-Ruiz, 2015 N Y

42 AGO2-IP HA-AGO2-DAD(ago2-1)_Mock A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

43 AGO2-IP HA-AGO2-DAD(ago2-1)_TuMV A. thaliana Inflorescence Garcia-Ruiz, 2015 N Y

44 miRNA All known miRNA from Arabidopsis A. thaliana NA Kozomara, 2014 N Y

45 miRNA All known miRNA in plants All plants NA Kozomara, 2014 N Y

46 ta-siRNA All known tasiRNA from Arabidopsis A. thaliana NA Zhang, 2014 N Y

47 ta-siRNA All known tasiRNA in plants All plants NA Zhang, 2014 N Y

48 TOT Total sRNA from wild type A. thaliana Inflorescence Slotkin, 2009 Y N

(8)

48

3.2.2 Learning procedure

A supervised machine learning approach rooted in the SVM algorithm was devised to learn classifiers capable of determining AGO-sRNA affinity from the sRNA sequences alone (Supplementary Material). Briefly, the complete inference system comprises three layers (Figure 3.2A): layer 1 includes a binary SVM model that filters out sequences that do not show strong evidence for binding to any of the known plant AGOs, and that therefore are expected to be inactive; layer 2 is composed by an ensemble of binary one-vs-one classifiers, each trained to explore the dissimilarities in AGO-bound sRNA sequences in a pairwise fashion; finally, layer 3 consists of a voting system that assigns a single score to each AGO using the decision values produced in the previous layer. Layers 2 and 3 are interconnected, since the outputs of the classifiers from the 2nd layer serve as inputs to the 3rd layer, and the 3rd layer combines them to provide a more informative result. Layer 1 is independent of all other classifiers and can therefore be decoupled if desired. All sRNA sequences used in training, testing or validation were transformed into sets of features comprising:

i. Position specific base composition (PSBC)

One way to convert a string into a numerical representation is by using flags for the presence of a given nucleotide in a determined sequence position. This is equivalent to mapping each sequence position to a four-dimensional feature space that represents each of the four possible bases in the DNA alphabet as follows:

𝐴 = 〈1,0,0,0〉, 𝐶 = 〈0,1,0,0〉, 𝐺 = 〈0,0,1,0〉, 𝑇 = 〈0,0,0,1〉.

By using such format, a sequence can be mapped to a feature space of dimensionality 𝐹 = 𝐴. 𝐿, where 𝐴 is the number of possibilities in the alphabet of nucleic acids and 𝐿 expresses the dependence on the sequence length. Since the length of the sRNA sequences here analyzed is not constant but can vary in an interval with an upper right limit here defined as being 27 bases, all sRNA must be projected into a space of size 27x4=108. In the case of sequences with a length shorter than 27, the extra positions in the feature vector can simply remain empty for all nucleotides. Although this approach is a reasonable solution to cope with the variation observed in length, it has the disadvantage of introducing noise in the representation that increases as the 3’ end of the model sequence is approached.

(9)

49

A

B

Figure 3.2. Machine learning architectures employed: (A) architecture of the inference system; (B) cascade SVM implemented. In the cascade scheme, data are split into subsets and SVs are determined for each one.

The results are combined two-by-two and used for training in the next cascade layer. This is done sequentially till a single model is obtained in the last layer. The resulting SVs are then tested for global convergence by feeding the result of the last layer into the first layer, together with the non-SVs. FV: feature vectors; DVi: decision values from classifier i; TDi: Training data partition i; SVj: SVs produced by optimization j; CLk: cascade layer k.

(10)

50

This happens because the right most nucleotides in the real sRNA sequences are projected into more central positions of the model sequence, blurring 3’ side positional patterns when looking across instances with variable size. To compensate for that effect, the same kind of projection but starting at the 3’ position of the sRNA instead of the 5’ was additionally considered in a feature set here mentioned as PSBC2.

ii. k-mer composition

Approaches based on k-mers map the presence or absence of subwords with a given length in the sRNA sequence into a feature space that represents all possible k-mers of that length. Taking as example k-mers of size 2, there are 42=16 possibilities in the 4 letters universe of DNA. The 16 length 2-gram vector for the DNA “ACGT” alphabet would then be:

〈𝐴𝐴, 𝐴𝐶, 𝐴𝐺, 𝐴𝑇, 𝐶𝐴, 𝐶𝐶, 𝐶𝐺, 𝐶𝑇, 𝐺𝐴, 𝐺𝐶, 𝐺𝐺, 𝐺𝑇, 𝑇𝐴, 𝑇𝐶, 𝑇𝐺, 𝑇𝑇〉.

As an example, mapping the sequence “ATGCATG” onto this vector space, considering the presence or absence of each of the possible k-mers yields:

〈0,0,0,2,1,0,0,0,0,1,0,0,0,0,2,0〉.

It is important to note that this method focuses on the frequency of patterns rather than their position in the sequence. Here k-mers of length 1 to 5 were explored. iii. Shannon entropy scores

Entropy as a measure of information content gives an indication about the degree of repetitiveness in a sequence. Among several flavors, Shannon entropy is one of the most popular and consists in a score given by:

𝐻(𝑋) = − ∑ 𝑝(𝑥_𝑖 _𝑖) log₂(𝑝(𝑥_𝑖))

, with 𝑋 a sequence of length 𝑙 and 𝑝(𝑥𝑖) the frequency of the character at position 𝑖. iv. Sequence length

This feature entails the number of nucleotides that compose each sRNA. Although simple, the enrichment for certain sizes in specific pathways is a consistent observation.

(11)

51 The learning methodology applied to layers 1 and 2 was similar. Prior to model training, highly correlated features (|Pearson score|>0.75) were removed keeping a single representative randomly selected from each correlated set. The remaining ones were normalized in the range between 0 and 1, to avoid dominance effects and numerical difficulties in the downstream calculations [212]. SVM learning with recursive feature elimination (SVM-RFE) [213] was then applied to find models for layers 1 and 2 with a reduced and more informative feature set. To circumvent computational problems that typically arise when standard SVM algorithms are applied to large datasets, a linear kernel was employed with a specialized linear solver [214]. A 5-fold cross-validation procedure was implemented to modulate data variation in each feature selection round and the mean ROC-AUC was calculated to assess the quality of the classifiers. Each round, 1/3 of the features with the lowest contribution for the discriminative model were eliminated, until 10 features were left. From here on, features were excluded one by one until no more features were available. The optimal feature subset for each classifier from layers 1 and 2 was determined with an elbow method applied to the curve formed by the mean cross-validation ROC-AUC values recorded during the feature selection process. The best features were subsequently used to train classifiers applying LIBSVM [215] with radial basis function (RBF) kernels to explore non-linear relationships in the data. A cascade scheme was implemented in this task to tackle computational problems that otherwise would not allow learning with such large datasets (Figure 3.2B). To avoid biases in learning, training was performed with balanced datasets by undersampling the largest class involved in each binary problem. Sequences with the highest read abundance were prioritized, and the remaining spots were occupied by instances randomly selected from the remaining pool(s).

In layer 3, a balanced dataset composed of sRNA from all 8 AGO groups available was created. Decision values were computed for each of the sequences using the classifiers obtained for layer 2, and served as input for a 5-fold cross-validation procedure used to train and test the inference scheme applied to the 3rd layer. Three strategies were explored to combine the outputs from layer 2: a voting system, where the winner is the AGO protein with the largest number of decisions in its favor; a weighted rule learned with a linear SVM algorithm; and a weighted rule learned with a non-linear SVM using a RBF kernel. All SVM hyperparameters in layers 1, 2 and 3, were tuned by means of a grid search. A more detailed description of the SVM approach is provided in Supplementary Material.

(12)

52

3.3 Results

3.3.1 Detection of high confidence functional sRNA

We explored a series of SVM classifiers to discriminate putatively functional from non-functional sRNA based on various sRNA sequence features. As outlined above (see section Material and Methods), SVMs were trained on sequenced A. thaliana Columbia (Col-0) AGO-IP sRNA libraries in comparison with libraries of Col-0 total sRNA. Specific sRNA sequences contained in the AGO-IP sRNA libraries were labeled “AGO”, whereas sequences contained in the total library (but not in the AGO-sRNA libraries) were labeled “noAGO”. To minimize sequencing noise, single copy sRNA were removed.

5’ and 3’ k-mers are important in distinguishing functional from non-functional sRNA We first trained separated SVMs using either only the Position Specific Base Composition (PSBC or PSBC2) of the sRNA sequences, sRNA entropy, sRNA length, or all possible k-mers of

Figure 3.3. Results for the classifiers trained to detect sRNA from the sets AGO vs noAGO: number of features and performance. Pie charts represent the fraction of features in the complete set (the total number

is in the central black section of the pie) that was eliminated during the correlation analysis (red section in the external ring) or kept (grey section in the external ring). The remaining features were subjected to selection with SVM-RFE, of which a subset comprises the optimal features (light blue section). For each initial conglomerate of features, accuracy was measured: using all features in a set (grey bars) and after feature selection (light blue bars). The optimized features determined when all feature sets were analyzed together were then subjected to non-linear learning using a cascade SVM scheme (dark blue bar). Each bar plot has on top the standard deviation for the accuracy calculated from the 5-fold cross-validation procedure.

(13)

53 Fi gu re 3.4 . Feat u re s ran ke d b y ab solut e v al u e o f th e ir we ig h ts in th e o p tim ize d c lassi fi e r fr o m l ay e r 1 (“ A GO” v s “n o A GO ”). Po sitio n s p ec if ic b as e comp o sitio n (P SB C ) fe at u re s ar e re p re se n ted in th e fo rm at 𝐿𝑃 , w h ere 𝐿 ∈ 𝐴 , 𝐶 , 𝐺 , 𝑈 an d 𝑃 is th e p o sitio n o f th e n u cle o tid e in th e se q u en ce s ta rtin g th e cou n t at 5’ o r 3’ in t h e ca se o f P SB C v2, w h ich is r ep re se n te d b y 𝑖𝐿𝑃 ; Sh an n o n : s eq u en ce in fo rm at ion co n ten t giv en b y Sh an n o n en tro p y.

3

(14)

54

size 1 to 5 nucleotides (see section Learning procedure). The SVMs trained on the PSBC or the k-mer features achieved considerable classification accuracy (~65%), while the SVMs trained using sRNA length or entropy performed less well (accuracy: ~50-60%, Figure 3.3). We then joined all the features together in a single classifier. Starting with 1582 features in total, our selection procedure yielded a final classifier containing only 138 features (Figure 3.3), and achieved a classification accuracy of nearly 80%. Of the 138 features, 127 are k-mers and 10 correspond to position specific nucleotides (Figure 3.4).

The k-mers of the final classifier were examined in more detail. Since the k-mers had no positional requirements for inclusion in the model we asked whether the k-mers retained in the final classifier showed any positional bias in the actual sRNA sequences. To do this, each of these k-mers was mapped back to the sRNA sequences contained in the “AGO” and the “noAGO” libraries. We found that the location of the k-mers was strongly biased toward the 5’ end of the sRNAs and to some extent also toward the 3’ end (Figure 3.5), with a visible depletion toward the center of the sRNA sequences.

When looking to the 5’ end of sequences with meaningful length, we observed that 24 nt sRNA from the “AGO” set is dominated by k-mers starting with an adenine (A), and 21 nt sRNA have a mixture of adenine, uracil (U) and cytosine (C); plus in both cases there is absence of 5’ guanine (G), see Figure 3.6. This finding is consistent with previous reports showing that the base composition of the first nucleotide of a sRNA is important for AGO affinity [194, 197, 207].However, the vast majority (125 out of 138) of the features consisted of k-mers larger than one nucleotide in length, indicating that more complex sequence features guide AGO-sRNA associations.

Figure 3.5. Density distribution of k-mers retained in the final classifier discriminating between functional (AGO) versus non-functional (noAGO) sRNA (layer 1). The projection of k-mers in the AGO set shows a clear

(15)

55

Figure 3.6. Frequency of nucleotides in k-mers across the sRNA libraries. This was determined from the

distribution of k-mers retained in the classifiers according to their density peaks in 21 nt and 24 nt sRNA: (A) layer 1: AGO vs noAGO; (B) layer 2: mean of the frequencies for classifiers involving a given AGO.

(16)

(17)

57 5’ and 3’ k-mers are enriched for transcription factor binding motifs

We further assessed whether the k-mers correspond to known sequence motifs. To that end, we performed a motif analysis using the “AGO” and the “noAGO” sets with MEME [216], a computational framework for de novo and known motif identification. Then, we focused on k-mers with a size of 5 nt and mapped them to the motifs retrieved in the previous step. We found that the majority of k-mers (38 of 46) corresponded to segments of known or predicted motifs, and noted a significant enrichment for k-mer matches in “AGO” when comparing with “noAGO” (100 and 32, respectively). Interestingly, the known motifs mapping k-mers were consistently related to stress response and development, which are processes known to be highly regulated by sRNA (Table S3.1).

The TF enrichment was not surprising for 21 nt sRNA which are known to act on genic sequences that frequently contain TF binding motifs. However, similar TF patterns were also found for 24 nt sRNA, suggesting a role in gene regulation beyond the well-documented function of 24 nt sequences in heterochromatin silencing. We identified the 5-mer “AGAAG” as the k-mer that showed stronger enrichment in 24 nt sequences compared with 21 nt sRNA from the “AGO” set and noted that this subsequence was associated with 3 motifs: At1g68670, a G2-like protein involved in phosphate homeostasis; SVP, a MADS protein that acts as a floral repressor and functions within the thermosensory pathway; and MYB77, which is expressed in response to potassium deprivation and auxin. All these motifs have been shown to have higher DNA-binding capacity when the targets are unmethylated [217], revealing a possible bridge between 24 nt heterochromatic sRNA, DNA methylation and TF occupancy.

Figure 3.7. Results for the models trained to discriminate sRNA from two different AGO-IP sets (layer 2): features selected, performance and top3 features. The top three rows of pie charts represent the features

kept after the correlation analysis (grey section), a fraction of which comprises the optimal sets obtained by feature selection (light blue section). The three following rows of pie charts represent the composition of the features kept after feature selection, with the number in the middle representing the total number of features that survived SVM-RFE. For each binary model, ROC-AUC is shown for classifiers trained: using all non-correlated features (grey bars), after feature selection (light blue bars) and using the optimized feature set in a non-linear learning with the cascade SVM scheme (dark blue bars). Each bar plot has on top the standard deviation for the performance calculated from the 5-fold cross-validation procedure. The top3 features found with the feature selection procedure (features kept in the last 3 SVM-RFE iterations) are indicated in a 3rd section. The bottom section contains an indication of the AGO-IP sets used in each of the 28 models, plus a colour scheme that signs inter- (white) and intra-clade models (red: clade 1; blue: clade 2; green: clade 3).

(18)

58

3.3.2 Classification and validation of specific AGO-sRNA associations

In the previous section, our goal was to build a classifier that can distinguish functional from non-functional sRNA. We achieved this by training on “AGO” versus “noAGO” sRNA. A related problem is to find properties of sRNA that allow to infer their differential loading to specific AGO proteins, as this will determine their particular mode of action. To achieve this, we trained binary one-vs-one classifiers to find sequence features that discriminate between the eight different Col-0 AGO-IP libraries (layer 2, Figure 3.2A). The ensemble of one-vs-one classifiers was then subjected to a voting system that assigns a single score to each AGO (layer 3, Figure 3.2A).

We used a 5-fold cross-validation scheme to determine the accuracy of the inference system at three levels: 1. AGO: fraction of sequences for which the AGO-bound protein was correctly predicted; 2. Clade: fraction of sequences for which the AGO prediction falls within the correct clade; and 3. Function: fraction of sequences for which the functional group can be correctly assigned based on the AGO prediction, translating predictions for AGO4/6/9 into a potential for involvement in TS, and assignments to other AGOs as suggestion of PTS activity. In addition, we also performed a validation using 35 A. thaliana AGO-IP datasets never seen by the classifiers during training, plus miRNA and ta-siRNA libraries from multiple plants. The AGO-IP validation sets were collected from different tissues and experimental conditions, which allowed us to evaluate the robustness of the classifiers to such variation.

Classification accuracy of sRNA at AGO, clade and function level

Results from our 5-fold cross-validation analysis showed that our classifiers achieved very high accuracy (Figure 3.8), indicating that differences between sRNA bound to specific AGOs can indeed be detected. Classification accuracy at the level of specific AGO proteins was around 60% on average, ranging from as low as 40% (AGO7) to as high as 85% (AGO5). Since AGO proteins within a clade are highly homologous and similar in function, it is likely that sRNA-AGO binding is promiscuous in nature, which would render the search for discriminative features more challenging. Indeed, classification accuracy at the level of the clade and function were considerably higher than at the level of specific AGOs (clade accuracy: 85%, function accuracy 93%; see Figure 3.8A). Moreover, the final intra-clade classifiers were generally more complex than inter-clade classifiers, containing on average

(19)

59 122 features compared with 75 features, respectively (Figure 3.8C). Looking to the features retained in the final classifiers, intra-clade classifiers contained proportionally more k-mers of length larger than 1 nt compared with inter-clade classifiers (86% and 79% of the features in intra-clade and inter-clade models, respectively). Hence, factors that govern AGO-specific affinities within a clade appear to involve more complex sequence determinants. Similar to the “AGO” versus “noAGO” analysis presented above (section Detection of high confidence functional sRNA), we found that the k-mers in the final classifiers showed strong positional bias toward the 5’ and 3’ ends of sRNA sequences (Figure 3.6). This finding indicates that the same sRNA regions that differentiate functional from non-functional sequences also contribute to the binding affinity of sRNA to specific AGO proteins.

To study the nature of the information contained in the k-mers selected by SVM-RFE in more detail, we compared them with a motif analysis performed for each AGO library using the MEME suite [216]. For a matter of simplicity, we focused on 5-mers (Supplementary Material). We found that nearly 40% of the 5-mers were identified as motifs or derived segments (Table 3.2), often related to known transcription factors with roles in development and response to stress. For instance, in models involving AGO2 (an AGO linked to ta-siRNAs) we identified 5-mers matching the motif ETT, an experimentally validated target of an evolutionarily conserved ta-siRNA denominated tasiR-ARF [63]. Targeting of ETT by ta-siRNA has been extensively studied and is known to have pleiotropic effects on Arabidopsis flower development by interference with the auxin pathway [218]. These results thus support the conclusion that SVM learning captured sequence information with relevant biological meaning. A full overview of k-mers overlapping known or predicted motifs is provided in Table 3.2 and in the Supplementary Material.

Validation analysis revealed that our classifiers are relatively sensitive, even across biologically very heterogeneous datasets (Figure 3.9). Sensitivity was particularly high at the level of clade and function (clade: 70% and function: 84.3%), and moderate at the level of specific AGOs (AGO: 42%). Interestingly, when looking to the performance obtained for the ta-siRNA and miRNA datasets, we observed high sensitivity values both for the set isolated for Arabidopsis (~80% and ~90%, respectively) and also for the complete plant set (~95% and ~90%, respectively). In the set containing a mixture of ta-siRNA from diverse plants, the sensitivity is considerably higher compared with the one composed only by sequences from

(20)

60

A B C D

Figure 3.8. Details of the AGO-sorting inference (layers 2+3). Mean accuracy across the 5-fold

cross-validation scheme for the voting system used in layer 3, measured by: (A) AGO, clade and function level; (B) separated by individual AGO sets. The average number of features composing the intra- and inter-clade classifiers from layer 2 (C), and the percentage of those features which are k-mers (D). Error bars: standard deviation.

Table 3.2. Number of 5-mers in the final classifiers from layer 2 matching motifs found with MEME in each AGO-IP library. The 5-mers were separated by the signal of their weight w in the final classifier.

# Set a Set b

# features in the final classifier

# 5-mers in the final classifier

% 5-mers mapping to a

motif Favouring Mapping to a motif

AGOa (w>0) AGOb (w<0) AGOa AGOb 1 AGO1 AGO2 92 22 30 2 19 40,4 2 AGO1 AGO4 27 9 3 0 2 16,7 3 AGO1 AGO5 92 20 18 8 8 42,1 4 AGO1 AGO6 41 14 6 7 1 40,0 5 AGO1 AGO7 92 35 16 19 10 56,9 6 AGO1 AGO9 92 27 17 11 0 25,0 7 AGO1 AGO10 138 53 24 14 9 29,9 8 AGO2 AGO4 61 8 6 7 0 50,0 9 AGO2 AGO5 61 16 14 15 6 70,0 10 AGO2 AGO6 92 25 16 18 3 51,2 11 AGO2 AGO7 61 10 5 3 5 53,3 12 AGO2 AGO9 92 19 18 12 3 40,5 13 AGO2 AGO10 61 11 8 8 4 63,2 14 AGO4 AGO5 18 2 1 1 0 33,3 15 AGO4 AGO6 137 48 17 9 1 15,4 16 AGO4 AGO7 92 35 16 7 11 35,3 17 AGO4 AGO9 131 36 17 8 3 20,8 18 AGO4 AGO10 61 2 6 2 5 87,5 19 AGO5 AGO6 41 4 5 3 1 44,4 20 AGO5 AGO7 61 19 16 10 8 51,4 21 AGO5 AGO9 41 5 5 1 1 20,0 22 AGO5 AGO10 207 65 42 31 21 48,6 23 AGO6 AGO7 138 46 41 7 20 31,0 24 AGO6 AGO9 91 8 18 0 0 0,0 25 AGO6 AGO10 92 9 16 1 11 48,0 26 AGO7 AGO9 138 36 43 22 2 30,4 27 AGO7 AGO10 138 38 49 22 23 51,7 28 AGO9 AGO10 61 6 5 0 3 27,3

(21)

61 Figu re 3.9 . Se n si ti vi ty for th e v al id ation set s. Bar p lot s sh o w re su lts f o r A GO (li gh t gr ey ), clad e (l ight b lu e) an d f u n ctio n (d ar k b lu e) in fe re n ce . In d icatio n ab o u t th e u se o f th e d at a se ts d u rin g le ar n in g, th e samp le d tis su e an d th e AG O w ith w h ich th ey w ere i m m u n o p re cipit at ed is s h o w n b y colo re d s q u ar es u n d er th e b ar s. Af ter t h e se , t h e d at a se ts r e fe re n ce n u m b e rs a re in d icated , follo w in g t h e en u m era tio n in Tab le 3. 1.

3

(22)

62

Arabidopsis, which supports the idea that the inference system can identify PTS sRNA not only from the species used in the learning phase but is also extensible to other plant species. In conclusion, the inference system demonstrates robustness for tissue and treatment variation and can recognize sRNA membership independently of the cellular origin showing good generalization as desired for the discovery of new functional units.

3.4 Discussion

To our knowledge, this is the first time that classifiers were built to infer AGO-sRNA affinity from the sRNA sequence alone. Using adequate solvers and learning architectures, SVMs could be applied to large genomic datasets and discovered highly discriminative rules, both to distinguish sequences that bind to AGOs from other sequencing products, as well to infer AGO-sRNA kinship. In addition to the known 5’ nucleotide composition and the sequence length, feature selection revealed the contribution of other sequence attributes, mostly complex k-mers, in defining sRNA preference for certain AGO proteins. How robust specific sRNA-AGO affinities are to changes in certain sequence properties is an interesting topic for future research and can shed light on evolutionary constraints on sRNA-mediated transcriptional and post-transcriptional silencing pathways. To answer such questions, additional experiments are necessary to manipulate sRNA sequences in a precise and targeted fashion, for example by use of the CRISPR-Cas9 system. Although our inference method appears to be highly accurate in predicting the putative function of sRNA sequences, it is important to keep in mind that the actual biological activity of a given sRNA is dependent on other factors beyond the AGO loading step, such as the degree of complementarity between a sRNA and the target sequence, as well as the presence of specific chromatin states at the target locus.

AGO-sorting information has the potential to decrease the very high number of false positives reported by currently available PTS target prediction tools, but the true impact needs to be further evaluated. In any case, the computational framework here developed displays a discriminative power that makes it suitable for early screens in genome-wide sRNA libraries when looking for candidates with the highest chance to have certain functional roles, and when sorting sRNA by TS and PTS classes is needed. The tool is ultimately a more affordable alternative to expensive and laborious IP experiments, since it can get

(23)

AGO-63 sRNA profiles from a single genome-wide sequencing library. Another interesting application of the framework is to explore if specific sRNA from exogenous sources, such as artificially designed sequences or those derived from pathogens, correspond to functional plant sRNA and which silencing pathways they are likely to target.

More sophisticated models can be developed using for example the expression patterns of the AGO proteins at the moment that the sRNA sequencing experiments take place. This extra information can eventually improve the capacity to correctly predict AGO affinity under particular situations, but on the other hand demands additional and more complex inputs, thus limiting the range of applications of the sRNA classifiers.

(24)

64

S3 Supplementary Material

S3.1 Support Vector Machines

The support vector machine (SVM) is a popular machine learning algorithm with strong bonds to statistical learning theory. During the training phase, the SVM searches for the decision hyperplane that maximizes the margin for the training set (𝑥⃗1, 𝑦1) ... (𝑥⃗𝑖, 𝑦𝑖) composed of 𝑛 training instances 𝑥⃗𝑖 with labels 𝑦𝑖ϵ {−1, 1} in 2-class problems. Maximum margin classifiers are frequently the ones that guarantee the highest generalization capacity, which explains the higher performance of models obtained via SVMs compared to other machine learning algorithms [219].

To attain a classifier of the form 𝑤⃗⃗⃗ ∙ 𝑥⃗ + 𝑏 = 0, the learning process demands to minimize the norm of the decision hyperplane normal vector ‖𝑤⃗⃗⃗‖ subject to the function 𝑦𝑖(𝑤⃗⃗⃗ ∙ 𝑥⃗𝑖 + 𝑏) ≥ 1, for 𝑖 = 1, … , 𝑛. A classification is then given by the signal of the output from the classifier, this in the binary case.

To accommodate points that lay in the wrong side of the decision frontier, a soft-margin classifier can be trained instead of a hard-margin version, considering a hinge loss function of the type 𝜁_𝑖 = 𝑚𝑎𝑥(0, 1 − 𝑦_𝑖(𝑤⃗⃗⃗ ∙ 𝑥⃗_𝑖+ 𝑏)) , and minimizing 1 𝑛∑ 𝜁𝑖 𝑛 𝑖=1 + 𝜆‖𝑤⃗⃗⃗‖2

, subject to 𝑦_𝑖(𝑤⃗⃗⃗ ∙ 𝑥⃗_𝑖 + 𝑏) ≥ 1 − 𝜁_𝑖 and 𝜁_𝑖 ≥ 0, for all 𝑖. Variable 𝜆 defines the tradeoff between having a larger margin and the number of training points lying in the wrong side of the margin. For 𝜆 = 0, the soft margin formulation becomes the hard-margin version. Most frequently, the trade-off is tuned not through 𝜆 but a directly related cost parameter 𝐶 =1

𝜆. The above equation states the optimization problem in a form known as the primal. Classically, it can be solved using quadratic programming, but there are other more sophisticated approaches such as the sub-gradient descent or coordinate descent methods

(25)

65 that can reduce the training time several orders of magnitude when learning from very large datasets.

Alternatively, the optimization problem can be stated in the dual form

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝑓(𝛼₁… 𝛼_𝑛) = ∑ 𝛼_𝑖−1 2∑ ∑ 𝑦𝑖𝛼𝑖(𝑥𝑖∙ 𝑥𝑗)𝑦𝑗𝛼𝑗 𝑛 𝑖=1 𝑛 𝑖=1 𝑛 𝑖=1 , subject to ∑𝑛_𝑖=1𝛼_𝑖𝑦_𝑖 = 0 and 0 ≤ 𝛼𝑖 ≤ 1

2𝑛𝜆 , for all 𝑖; and can be solved through the Lagrangian method. In this case the decision hyperplane is described in terms of the training instances 𝑥⃗𝑖 that lay closer to it and for which the Lagrangian multiplier 𝛼𝑖 ≠ 0, being 𝑤⃗⃗⃗ = ∑𝑛_𝑖=1𝛼_𝑖𝑦_𝑖𝑥⃗_𝑖. These instances placed in the margin are called support vectors (SV). When more complex relationships between the classes are encoded in the data, it is beneficial to use a search space with higher dimensionality where the discrimination is more feasible. Data can be projected into such space using non-linear functions classically employed in machine learning like polynomials, sigmoids or radial basis functions (RBFs) [220, 221]. RBFs are a special and common choice, since other kernel functions like the linear and the sigmoid can be designed as special cases of RBFs [220, 221].

S3.2 SVM-RFE

Recursive feature elimination (RFE) is a backward feature elimination method that can be defined in three simple steps:

1. Train a classifier (optimization of weights 𝑤_𝑖 according to an objective function 𝐽) 2. Determine the ranking criteria for all features (the cost 𝐷𝐽(𝑖) or 𝑤_𝑖2)

3. Remove the feature(s) with the smallest ranking criterion

The whole process must be repeated until a stopping condition is met, such as achieving a minimum number of features desired, an empty feature set or other.

In the case of SVM-RFE, the principles of RFE are combined with SVM model optimization in a wrapper approach.

S3.3 SVM learning in very large datasets

SVM learning is frequently mentioned as one of the most accurate machine learning methods, thanks to the maximum margin principle that it employs. However, searching for

(26)

66

the maximum margin hyperplane in a large dataset comes with a computational burden from loading large matrices for which it is necessary to solve extensive optimization problems. This makes traditional SVM formulations with a training complexity dependent on the number 𝑛 of training samples in between 𝒪(𝑛2) and 𝒪(𝑛3_{) [222], of difficult use when} exploring large sets. The machine learning community has been struggling to overcome such adversity, mostly by optimizing the mathematical background of SVM learning or by presenting “out of the box” solutions based on the theory behind SVM learning and experimental observations.

Empirical studies showed that linear SVM learning can be in some cases sufficient and allows to obtain classifiers with accuracy values comparable with the ones achieved using more complex kernel transformations. There are solvers specialized in linear SVM learning, such as LibLinear [214] with a training time complexity of 𝒪(𝑛), for which the training time grows linearly with the number of training instances and that therefore has a drastically reduced learning time when comparing to other SVM formulations. Additionally, using such approach does not demand optimizing any other hyperparameters beyond the cost, further decreasing the training rounds necessary to tune the algorithm.

When non-linear SVM learning is necessary, the computational burden can be attenuated by using special learning schemes, such as the cascade configuration [223], which take advantage of theoretical properties of the SVM algorithm. This architecture breaks the global optimization process into smaller subproblems that are solved separately by multiple SVMs with subsets of data. The partial solutions composed by the support vectors can then be joined and reprocessed in a cascade of SVMs that end in a single model guaranteed to converge to the global optimum after multiple passes through the cascade. Empirical tests showed that a single pass over the cascade can already produce decision rules with good generalization. In addition, this design can be parallelized further dropping the training time. SVM formulations for multiclass problems are available. However, since the complexity of the problem increases as the number of classes are included in the same optimization problem, multiclass SVM classification uses typically binary models. Two flavors can be found: one-vs-one and one-vs-all [224]. In the first case, binary classifiers are trained for each of the pairwise combinations among the K classes, forming a total of K(K−1)₂ classifiers [219]. One-vs-all schemes use less classifiers since each of the classes is trained to separate

(27)

67 class 𝑖 from all remaining groups, so a total of 𝐾. A final prediction can be obtained by combining the inference of each binary classifier through mechanisms as simple as an unweighted voting system, where a final decision is made in favor of the group that gathers the highest number of positive predictions from all the individual rules. Multiclass prediction through binary schemes is proven to achieve accuracy levels comparable to that of straight multiclass models [219], plus it brings extra advantages associated with the parallelization of the training process and its ease of implementation.

S3.4 Cascade SVM

The cascade SVM [223] is a learning architecture in which training is performed sequentially in subsets of data. Each partial SVM filters uninformative data by finding sub-solutions that through the cascade are sequentially joined in a global solution. Former SVM sub-solutions which are given by the respective SVs are fed as input to a later SVM that further simplifies the result in case of data redundancy. This hierarchy ends-up in a single SVM that joins the SVs from previously optimized problems in a final model. This divide-and-conquer strategy, which explores principles of SVM learning in the dual form, allows reducing the number of training points used to find the global solution by dealing with smaller and therefore computationally more manageable subsets of data. The computational problems faced by the SVM algorithm when dealing with large datasets are circumvented since the training complexity is dependent on the number 𝑛 of training samples and estimated between 𝒪(𝑛2) and 𝒪(𝑛3), with obvious gains for smaller datasets.

S3.5 Motif analysis

All sRNA sets used to develop the classifiers here presented were subject to a motif search to clarify the eventual presence of conserved patterns in the sRNA sequences. This task used the MEME-ChIP platform [225] and was directed towards the list of unique sRNAs in the “AGO”, “noAGO” and AGO-IP sets. In the beginning, a de novo motif discovery search was executed for long ungapped motifs using MEME (6 to 27 nt) and short ungapped motifs with DREME (3 to 8 nt). The motifs found to be significant (e-value≤0.05) were then analyzed for enrichment in terms of their relative position inside the sequence with CentriMo (e-value≤0.05).

(28)

68

Next, all motifs found were aligned to two libraries of known motifs from Arabidopsis thaliana, using Tomtom (e-value≤1). The first library contained a set of 113 motifs detected by protein-binding microarrays (PBM; Franco-Zorrilla et al., 2014) and the second one is composed by 872 motifs captured by DNA affinity purification sequencing (DAP-seq; O’Malley et al., 2016).

Finally, the motifs from each set were compared pairwisely to identify library-specific motifs and shared ones.

The k-mers selected by SVM-RFE were matched with perfect similarity (no mismatches, no gaps) to subsequences retrieved with FIMO (p-value<0.0001) using the motifs previously detected against each sRNA set from which they derived. For a matter of simplicity and to avoid ubiquitous matches, only 5 nt k-mers were used and matched against the sRNA subsequences. To get a more reliable set of matches, we compared the probability to get a certain 5-mer by chance against the probability of finding the segment in the motif where that 5-mer matches. The probability of getting a given k-mer by chance was determined using the frequency 𝑓𝑖 for each of the nucleotides 𝑖 = {𝐴, 𝐶, 𝐺, 𝑇} in the A. thaliana genome, and then dividing it by the total number of nucleotides in the genome. These values were then combined to get a probability for each k-mer through the product of the probabilities for each of the letters in the specific 5-mer. The motif-associated probability was determined in similar way but using the position-specific probability matrix from the mapping motif. K-mers with a motif probability lower than random were left behind.

(29)

69 Tab le S3. 1 . M o ti fs w ith 5 -m e r m atc h e s fou n d in t h e “AGO” an d “n o A GO” sRN A se ts.

3

(30)

70 Tab le S3. 1 ( co n tinu e d )

(31)

71

Table S3.2. PBM motifs found in the AGO-IP libraries.

Set #Motifs Long ungapped (MEME) Short ungapped (DREME) Centrally enriched (CentriMo) Known (Tomtom) Remapped (FIMO) AGO1 1 40 10 25 36 AGO2 0 28 39 28 38 AGO4 0 28 0 21 20 AGO5 0 74 5 57 72 AGO6 1 22 2 14 10 AGO7 3 84 0 53 80 AGO9 1 27 0 15 18 AGO10 1 28 32 22 37

3

(32)

72

Table S3.3. All PBM motifs found in the AGO-IP libraries.

# Name TAIR ID Short description

1 ARR14 AT2G01760 Cytokinin-activated signaling pathway, phosphorelay signal transduction system.

2 ATHB51 AT5G03790 Bract formation, floral meristem determinacy, leaf morphogenesis, positive

regulation of transcription, DNA-templated, regulation of timing of transition from vegetative to reproductive phase.

3 bZIP60 AT1G42990 bZIP60 mRNA is upregulated by the addition of ER stress inducers. It is involved in

endoplasmic reticulum unfolded protein response, immune system process, and response to chitin.

4 DAG2 AT2G46590 Expression is limited to vascular system of the mother plant. Recessive mutation is

inherited as maternal-effect and expression is not detected in the embryo. Mutants are defective in seed germination and are more dependent on light and cold treatment and less sensitive to gibberellin during seed germination.

5 DEAR3 AT2G23340 Ethylene-activated signaling pathway.

6 DEAR4 AT4G36900 Ethylene-activated signaling pathway, plant defense, freezing stress, dehydration stress.

7 ETT AT2G33860 ettin (ett) mutations have pleiotropic effects on Arabidopsis flower development, causing increases in perianth organ number, decreases in stamen number and anther formation, and apical-basal patterning defects in the gynoecium. The ETTIN gene encodes a protein with homology to DNA binding proteins which bind to auxin response elements. ETT transcript is expressed throughout stage 1 floral meristems and subsequently resolves to a complex pattern within petal, stamen and carpel primordia. ETT probably functions to impart regional identity in floral meristems that affects perianth organ number spacing, stamen formation, and regional differentiation in stamens and the gynoecium. During stage 5, ETT expression appears in a ring at the top of the floral meristem before morphological appearance of the gynoecium, consistent with the proposal that ETT is involved in prepatterning apical and basal boundaries in the gynoecium primordium. It is a target of the ta-siRNA tasiR-ARF.

8 GLK1 AT2G20570 Chloroplast organization, negative regulation of flower development, negative regulation of leaf senescence, positive regulation of transcription, regulation of chlorophyll biosynthetic process, regulation of transcription and signal transduction. 9 HSFB2A AT5G62020 The heat sock transcription factor B2A is involved in response to chitin and

gamethophyte development. The mRNA is cell-to-cell mobile. 10 HSFC1 AT3G24520 Heat shock transcription factor C1.

11 KAN1 AT5G16560 Involved in abaxial cell fate specification, carpel development, organ morphogenesis, plant ovule development, polarity specification of adaxial/abaxial axis, radial pattern formation, xylem and phloem pattern formation.

12 KAN4 AT5G42630 Encodes a member of the KANADI family of putative transcription factors. Involved in integument formation during ovule development and expressed at the boundary between the inner and outer integuments. It is essential for directing laminar growth of the inner integument. Along with KAN1 and KAN2, KAN4 is involved in proper localization of PIN1 (necessary for proper flowering) in early embryogenesis.

13 MYB52 AT1G17950 Involved in the regulation of secondary cell wall biogenesis and response to abscisic acid.

(33)

73

Table S3.3. (continued)

# Name TAIR ID Description

14 RAP2.6 AT1G43160 Involved in chloroplast organization, ethylene-activated signaling pathway, and positive regulation of transcription. Mediates cellular response to: heat, abscisic acid, cold, jasmonic acid, osmotic stress, salicylic acid, salt stress, water deprivation and wounding.

15 REM1 AT4G31610 Expressed specifically in reproductive meristems. It is a member of a moderately sized gene family distantly related to known plant DNA binding proteins. Necessary for proper flower development.

16 RVE1 AT5G17300 Myb-like transcription factor that regulates hypocotyl growth by regulating free auxin levels in a time-of-day specific manner.

17 SPL7 AT5G18830 Encodes a member of the Squamosa Binding Protein family of transcriptional regulators. SPL7 is expressed highly in roots and appears to play a role in copper homeostasis. Mutants are hypersensitive to copper deficient conditions and display a retarded growth phenotype. SPL7 binds to the promoter of the copper responsive miRNAs miR398b and miR389c.

18 STY1 A member of SHI gene family. Arabidopsis thaliana has ten members that encode proteins with a RING finger-like zinc finger motif. Despite being highly divergent in sequence, many of the SHI-related genes are partially redundant in function and synergistically promote gynoecium, stamen and leaf development in Arabidopsis. STY1/STY2 double mutants showed defective style, stigma as well as serrated leaves. Binds to the promoter of YUC4 and YUC8 (binding site ACTCTAC).

It is involved in auxin homeostasis, negative regulation of gibberellic acid mediated signaling pathway, positive regulation of transcription, stigma development, style development, xylem and phloem pattern formation.

19 TOE1 AT2G28550 It is related to AP2.7 (RAP2.7). It is involved in ethylene-activated signaling pathway, multicellular organismal development, organ morphogenesis and vegetative to reproductive phase transition of meristem.

20 TOE2 AT5G60120 Involved in ethylene-activated signaling pathway and multicellular organismal development.

21 WOX13 AT4G35550 Involved in plant development.

22 WRKY12 AT2G44745 Pith secondary wall formation and a promoter of flowering.

23 WRKY38 AT5G22570 Involved in defense response to bacterium and salicylic acid mediated signaling pathway. Arabidopsis FLOWERING LOCUS D influences systemic-acquired resistance-induced expression and histone modifications of WRKY genes. Dehydration stress. Basal Defense in Arabidopsis: WRKYs Interact with Histone Deacetylase HDA19.

24 YAB1 AT2G45190 Abaxial cell fate specification, cell fate commitment, fruit development, inflorescence meristem growth, meristem structural organization, polarity specification of adaxial/abaxial axis, regulation of flower development, regulation of leaf development, regulation of shoot apical meristem development, specification of floral organ identity, specification of organ position.

25 YAB5 AT2G26580 Polarity specification of adaxial/abaxial axis, regulation of leaf development, regulation of shoot apical meristem development.

26 ZAT14 AT5G03510 Zinc finger.

(34)

74

Table S3.4. PBM motifs found in the AGO-IP libraries: motifs only in AGOa.

# AGOa AGOb Motifs only in AGOa

1 AGO1 AGO2 KAN1, WRKY38, ZAT14

2 AGO1 AGO4 DEAR3, KAN1, MYB55_2, WRKY38, ZAT14

3 AGO1 AGO5 DEAR3, MYB55_2, WRKY38, ZAT14

7 AGO1 AGO10 DEAR3, WRKY38, ZAT14

8 AGO2 AGO4 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, MYB55_2, REM1, RVE1_2, SPL7, STY1_2,

TOE1, WOX13, WRKY12, YAB1

TOE1, WOX13, YAB1

13 AGO2 AGO10 bZIP60_2, DEAR3, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13,

YAB1 14 AGO4 AGO5 - 15 AGO4 AGO6 - 16 AGO4 AGO7 - 17 AGO4 AGO9 - 18 AGO4 AGO10 -

19 AGO5 AGO6 HSFB2A_2, HSFC1_2, KAN1, RAP2.6_2, YAB5

22 AGO5 AGO10 HSFB2A_2, HSFC1_2, RAP2.6_2

23 AGO6 AGO7 - 24 AGO6 AGO9 - 25 AGO6 AGO10 - 26 AGO7 AGO9 - 27 AGO7 AGO10 - 28 AGO9 AGO10 -

(35)

75

Table S3.5. PBM motifs found in the AGO-IP libraries: motifs only in AGOb.

# AGOa AGOb Motifs only in AGOb

1 AGO1 AGO2 bZIP60_2, DEAR4, ETT_2, KAN4, REM1, RVE1_2, SPL7, STY1_2, TOE1, WOX13,

WRKY12, YAB1

2 AGO1 AGO4 -

3 AGO1 AGO5 HSFB2A_2, HSFC1_2, RAP2.6_2, YAB5

4 AGO1 AGO6 -

5 AGO1 AGO7 -

6 AGO1 AGO9 -

7 AGO1 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, TOE1_2, TOE2,

WRKY12, YAB5, ZAT18

8 AGO2 AGO4 -

10 AGO2 AGO6 -

11 AGO2 AGO7 -

12 AGO2 AGO9 -

13 AGO2 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, TOE1_2, TOE2,

YAB5, ZAT18

15 AGO4 AGO6 -

16 AGO4 AGO7 -

17 AGO4 AGO9 -

18 AGO4 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, KAN1, MYB52, MYB55_2,

TOE1_2, TOE2, WRKY12, YAB5, ZAT18

19 AGO5 AGO6 -

20 AGO5 AGO7 -

21 AGO5 AGO9 -

22 AGO5 AGO10 ARR14_2, ATHB51, DAG2, DEAR3_2, GLK1_2, HSFB2A, MYB52, MYB55_2, TOE1_2,

TOE2, WRKY12, ZAT18

123 AGO6 AGO7 -

24 AGO6 AGO9 -

26 AGO7 AGO9 -