Selectively Encoding Syntactic Information for Aspect-Based Sentiment Analysis

(1)

MSc Artificial Intelligence

Master Thesis

Selectively Encoding Syntactic Information

for Aspect-Based Sentiment Analysis

by

Xiaoxiao Wen

12320323

August 6, 2020

48 EC December 2019 - July 2020

Supervisor:

Dr. Z. Afzal (Elsevier)

Assessor:

Dr. P. Ren (UvA)

(2)

Abstract

In the last two decades, sentiment analysis has become a popular research topic that also has a wide range of applications in the industry. However, the plain sentiment analysis task is only about extracting the overall expressed opinion towards the sentence and, therefore, much detailed information like opinion targets/aspects are missing. Recently, the fine-grained version of sentiment analysis, as known as the Aspect-Based Sentiment Analysis (ABSA), has attracted an increasing amount of attention. ABSA focuses on first identifying the differ-ent features/aspects that are existdiffer-ent in the given sdiffer-entence and then classifying their corresponding sdiffer-entimdiffer-ent polarities.

Since the introduction of the topic, many research have used either traditional machine learning methods or neural networks. With regards to representation learning, different attempts using semantic information, syntactic information or both have been researched. For encoding the semantic information, researchers have investigated different neural network architectures like Convolutional Neural Networks (CNN), Recurrent Neu-ral Networks (RNN), and attention-based networks, which proved their values under different situations. While for encoding syntactic information, aside from the above mentioned architectures, Recursive Neural Networks (ReNN) have also been used. In the latest approaches, given the tree-structured form of syntactic informa-tion of input sentences, Graph Convoluinforma-tional Networks (GCN) have become popular for learning syntactic representation for the task. GCN trivially aggregates the information from local connections among nodes in the syntactic tree. However, existing models using GCN treat the syntactic relations universally and do not distinguish between them, leading to partial exploitation of the syntactic information. Hence, we propose an Attentional-Embedding-based Relational Graph Convolutional Network (AERGCN) model that not only utilizes the syntactic relations but also differentiates among different relation types. We base our model on Relational Graph Convolutional Networks (R-GCN) to model the multi-relational syntax in sentences, and we further complement the model with attention-based feature extractors like pre-trained BERT models and multi-head attention models. Furthermore, we also investigate the use of Part-Of-Speech (POS) tag embeddings to enrich syntactic encodings.

We have evaluated our models for the general ABSA subtasks on the SemEval 2016 dataset for both restau-rant and laptop domains, and our results prove the effectiveness and competitive performance of the proposed AERGCN model compared to some recent state-of-the-art baselines. Furthermore, we achieved state-of-the-art performance on SemEval 2016 (task 5 subtask 1 slot 3) for both domains with different model variants.

(3)

Acknowledgements

I would like to express my appreciation towards my daily supervisor Zubair Afzal and my examiner Pengjie Ren for their tireless help and supervision along the whole master thesis project. Furthermore, I would like to thank Elsevier for the full-time support and also thank my colleagues Lea van den Brink, Yury Kashnitsky, Michael Schniepp, Jenny Truong together with the rest of the RCO-DS team for the helpful discussions and feedbacks on my work. Moreover, I would to thank my friends Philip Zhang, You Hu, Robbie Luo, and Gongze Cao not only for their advice and suggestions, but also for the leisure time spent with me. Last but not least, I would like to express my deepest love and gratitude towards my girlfriend Freya and my parents who have wholeheartedly supported me through these hard times.

(4)

Introduction

In the last two decades, with the rapid development in internet, online review sites and communities have emerged in various forms, for instance New York Times’ Books page, Rotten Tomatoes review-aggregation website, IMDb’s ratings (Pang, Lee, & Vaithyanathan, 2002) etc. Similarly, the e-commerce platforms have attracted a lot of attention, where the products are reviewed by customers. Subsequently, Sentiment Analysis was brought up as an essential task. Sentiment analysis, or sentiment classification, is about analyzing the overall opinion towards the entity expressed in the context (B. Liu, 2012). For example, the review sentence “How could anyone sit through this movie?” expressed a negative sentiment towards the referred movie (Pang et al.,2002). By aggregating such reviews and analyzing their corresponding sentiments, the analyses are not only helpful for coming users to have an overall understanding of the opinions from preceding users about the entity, but also helps developed models and systems to build a comprehension of the used language to express these opinions. Hence, sentiment analysis has been one of the most active research areas in the field of Natural Language Processing (L. Zhang, Wang, & Liu, 2018).

In the last decade, e-commerce has expanded in an even drastic manner and generated a huge amount of digital information in the form of natural language. After purchasing products online, users are able to express their opinions towards the products and rate the degree of their satisfactions in multiple dimensions. For example, in the product review shown in Figure 1.1a, “The quality is excellent but the price is a little too high.”, the user expresses a positive sentiment towards quality of the product and a negative sentiment towards price. In the traditional sentiment analysis problem as defined above, the overall opinion of the review sentence cannot be easily distinguished as it contains conflicting opinions towards different features/aspects. In order to describe this type of problems, the fine-grained task Aspect-Based Sentiment Analysis (ABSA) was proposed in Hu and Liu(2004) andB. Liu (2012). Aspect-Based Sentiment Analysis is the extended task of Sentiment Analysis, where instead of the overall opinion towards the entity, fine-grained opinions towards each different attributes of the entity are identified (B. Liu,2012). In the aforementioned example, quality and price are the attributes of the entity (product). Given that ABSA is fine-grained and more detailed compared to general sentiment analysis, it requires the proposed models and systems to have a deep comprehension of the natural language, and serves as a rather challenging task in the research field (Pontiki et al., 2016; Pontiki, Galanis, Papageorgiou, Manandhar, & Androutsopoulos, 2015; Pontiki et al., 2014). There are two essential subtasks of ABSA: identifying the aspect categories, Aspect Category Classification (ACC), and classifying the corresponding polarities, Aspect-level Sentiment Classification (ASC). Aspect-Term Sentiment Classification (ATSC) is a specific subtask of under ASC, where sentiment polarity is analyzed towards the explicit aspect terms.

(a) (b)

Figure 1.1: Examples of syntactic dependency parse trees with explicit aspects (a) and implicit aspects (b). Syntactic dependencies are annotated below the arrows and the legend shows different Part-of-Speech tags.

Early research in the field (Brun, Perez, & Roux, 2016; Jihan, Senarath, Tennekoon, Wickramarathne, & Ranathunga, 2017; Kumar, Kohail, Kumar, Ekbal, & Biemann, 2016; Toh & Su, 2016) focused on using

(6)

manually designed features. Later on, in order to deal with the inefficiency and task agnosticity of hand-crafted features, many researchers, as reviewed inDo, Prasad, Maag, and Alsadoon (2019); C. Zhang, Li, and Song(2020), employed neural networks for task-specific end-to-end learning of the general ABSA task. While many of these papers only utilized the semantic features of the input sequence, Dong et al. (2014); Socher et al. (2013); Tai, Socher, and Manning (2015) explored the usage of syntactic information, especially the syntactic relations/dependencies, using recursive neural networks. Recently, with the increasing interest and attention in graph convolutional neural networks (GCN) (Kipf & Welling, 2016),Xiao et al.(2020);C. Zhang et al. (2020) used GCN to encode the syntactic dependencies instead of recursive neural networks for ABSA and achieved competitive results, in which the authors encoded all the syntactic dependencies of an input sentence in one adjacency matrix. However, “not all relations are created equal”, and the difference among the syntactic relations has not been considered before. For example, as shown inFigure 1.1a, among the four different syntactic relations, “nsub” (nominal subject) and “acomp” (adjectival complement) are intuitively more indicative for identifying the aspects and classifying their sentiments. Moreover, these studies used GCN to extract syntactic representation based on the semantic embeddings of the input, but these embeddings did not indicate the syntactic roles of the input tokens. Furthermore, Xiao et al. (2020); C. Zhang et al.(2020) and some other recent studies narrowly focus on the ATSC task. The proposed model architectures do not generally support learning the original ACC or ASC, whereas ATSC is a merely specific subtask of ABSA only considering explicit aspect categories with opinion target expressions (OTEs). Since implicit aspect categories are often the case in customer reviews, the proposed model should be able to solve the general ABSA task to cover different scenarios.

In this work, we interchangeably use the terms aspect categories and aspects to represent a unique attribute of the entity (B. Liu,2012). We further refer to aspect expressions, including explicit and implicit aspect/opinion terms, as opinion target expressions (OTE). We incorporate ACC and ASC in our work, where we also specify that for ASC both the pre-identified OTEs and their corresponding aspect categories are provided. This is also known as Targeted Aspect-Based Sentiment Analysis (TABSA) as introduced and referred in some recent papers (Saeidi, Bouchard, Ai, Liakata, & Riedel, n.d.;Sun, Huang, & Qiu,2019). Furthermore, we also define the task Aspect-Term Sentiment Classification (ATSC) as identifying the sentiment polarities towards the OTEs of the existing aspect categories, which has been a popular task among different research.

In the following sections, we will first propose our research question. Then we will give a short introduction of the SemEval 2016 Task 5 challenge and its corresponding datasets that we use. Subsequently, we will go through the different research in the field that are related to our work. Afterwards, we are proposing our model AERGCN in general and describing the different modules that lay the foundation for it. Subsequently, we will describe the conducted experiments. Finally, we will show the results in the aspects of overall performance, module analysis, out-of-domain evaluation, and case study, with detailed discussion and explanations and conclude our work.

1.1 Research Question

In this project, we mainly focus on the research question that, given an annotated dataset of customer reviews from a certain domain, how can we develop an automated system that is able to combine both semantic and syntactical information in order to accurately conduct aspect-based sentiment analysis on a review sentence. We propose the novel Attentional-Embedding-based Relational Graph Convolutional Network (AERGCN) model to address the research question. In order to answer the main research question, we need to address the following sub-questions sequentially:

1. How can we extract and exploit the semantic information from the review sentence, extract expressive syntactic encodings via graph convolutional networks and combine the overall semantic and syntactic information?

2. Using the combined semantic and syntactic information, how can we develop a one-stage classifier that identifies the aspect categories existing in the review sentence and classifies the respective sentiments? Moreover, to improve on the limitations mentioned above, in the proposed AERGCN model we

• model and train POS tag embeddings and utilize Relational Graph Convolutional Network (R-GCN) (Schlichtkrull et al.,2018) to fully exploit the syntactic information;

• populate the original dataset with meta labels, and train with the objective of Next Sentence Prediction (NSP) to include target information as the next sentence;

• utilize and fine-tune the pre-trained BERT embeddings to create context-aware semantic embeddings for the input sentence;

• conduct experiments on the general ABSA task consisting of both ACC and ASC of SemEval 2016 (Pontiki et al.,2016) and provide a combined model to address both tasks simultaneously.

(7)

Chapter 2

Datasets

SemEval 2016 Task 5 (Pontiki et al.,2016) is the benchmark dataset from SemEval workshop 2016 that provides 19 training and 20 testing sets of customer reviews for 8 languages (English, Arabic, Chinese, Dutch, French, Russian, Spanish and Turkish) in 7 domains (restaurants, hotels, laptops, mobile phones, digital cameras, telecommunications and museums). The task consists of three subtasks: sentence-level ABSA, text-level ABSA and out-of-domain ABSA.

Sentence-level ABSA contains review sentences and the subtask is to identify all the potential opinion tuples towards the sentences. The opinion tuples contain three slots: Slot1 - Aspect Category, Slot2 - Opinion Target Expression and Slot3 - Sentiment Polarity. An aspect category is a predefined entity E and attribute A pair E#A, towards which an opinion is expressed in the review sentence. Opinion target expression is the explicit mention of the aspect category. The sentiment polarity is the polarity label, positive, negative or neutral, assigned to each identified aspect category. The definitions of aspect category and sentiment polarity align with our definitions correspondingly as described inchapter 1.

Text-level ABSA contains complete reviews and the subtask is to identify all the potential opinion tuples towards the complete reviews. The opinion tuples are formed in the same way as for the sentence-level ABSA subtask.

Out-of-domain ABSA is about testing the system in unknown domains, domains of which the training data was not available for the system.

In our experiments, we focus on sentence-level ABSA (Slot1 and Slot3 as to ACC and ASC) for English language under the restaurants domain and the laptops domain.

For the restaurants domain, the training set contains 2000 sentences and the test set contains 676 sentences. In order to conduct hyperparameter tuning, we further split the training set into stratified training and validation sets with a 80 : 20 ratio. We denote this dataset as EN-REST-2016.

For the laptops domain, the training set contains 2500 sentences and the test set contains 808 sentences. Similarly, we also split the training set into stratified training and validation sets with a 80 : 20 ratio. We denote this dataset as EN-LAPT-2016.

(8)

Chapter 3

Literature Review

3.1 Aspect-Based Sentiment Analysis

As mentioned above, two main subtasks of ABSA are: aspect category classification (ACC) and aspect-level sentiment classification (ASC). The ACC task is a multi-label classification task of the predefined aspect cate-gories for a given review sentence, while the ASC task is the multi-class classification task of the given tuple of (aspect category, OTE) for the review sentence, where OTE is null if there only exists implicit mention of the corresponding aspect category (Pontiki et al.,2016).

Earlier ABSA research focused on using supervised machine learning algorithms with hand-crafted semantic features, for instanceToh and Su(2016) used several manual features together with additional features extracted by a convolutional neural network (CNN) and achieved the top performance for the ACC task of both EN-REST-2016 and EN-LAPT-EN-REST-2016; preceding works inJihan et al.(2017) used manually designed features and not only outperformed the top result from Toh and Su (2016) but also some other deep learning methods at the time. With regards to the ASC task, Brun et al. (2016) used a Linguistic Feature Factory including various hand-crafted semantic features fed into the feedbacked ensemble models to interactively select the most effective features for the final classification. They achieved the top performance for the ASC task of EN-REST-2016. For these models, the design of the hand-crafted features served a crucial role to improve the performance of the classification. Nonetheless, these hand-crafted features were expensive to design and, thus, the quality of feature engineering became the bottleneck for these supervised machine learning methods. Moreover, due to the multi-label nature of the tasks, these models often used the one-vs-all classification strategy, which also required a lot of computational resources and computational time as the number of labels increased.

Later, as reviewed by L. Zhang et al. (2018) andDo et al. (2019), many proposed models used recurrent neural network (RNN) modules for semantic feature extraction of sequences, such as LSTM (Hochreiter & Schmidhuber,1997) and GRU (Cho et al.,2014) with the combination of various pre-trained word embeddings like word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean,2013) and GloVe (Pennington, Socher, & Manning,

2014). However, the use of RNN largely increased the computational time when fed with long input, since the sequential computation is difficult to parallelize. Also, the use of these fixed pre-trained embeddings for each tokens of the input sequence was weakly context-aware, because the word embedding for a token was extracted from a lookup table regardless of the context of the token.

Recently, the focus has been mainly shifted to deep learning methods. Hoang, Bihorac, and Rouces(2019) used the pre-trained BERT word embeddings and utilized the sentence pair classification, Next Sentence Pre-diction (NSP), training objective from BERT (Devlin, Chang, Lee, & Toutanova, 2019). In the paper, pre-processing was done to convert the aspect categories from labels to semantic texts and concatenate the con-verted aspect categories to the reviews sentences as the “next sentence”. Then, by using the pre-trained BERT embeddings, semantic information from both the review sentence and the aspect categories were associated. Finally, a classification was made using meta labels unrelated, positive, negative and neutral. Although the method is relatively trivial, it achieved the state-of-the-art result for EN-REST-2016.

Another recent work (Movahedi, Ghadery, Faili, & Shakery, 2019) utilized a deep learning method based on attention mechanism for ACC by attending to different parts in the sentence that reflect or contribute to the topics related to the corresponding aspect categories. This method first encoded the review sentence using bidirectional Gated Recurrent Unit (Cho et al., 2014), then it used a topic-attention layer to attend to the different parts in the review sentences relevant to the topics and finally regressed the probability of each aspect categories using two fully-connected layers with non-linear activations of Squash (Sabour, Frosst, & Hinton,

2017). The method achieved the second best state-of-the-art performance for ACC of EN-REST-2016.

While the aforementioned two deep learning methods have state-of-the-art performances, there is still room for improvement. As described above, even though the models encoded semantic information through the pre-trained word embeddings, NSP or attention mechanism, neither of the methods took syntactic information

(9)

into consideration. Nonetheless, for ACC, the different syntactic relation pairs encode information about the aspects/topics in the text and can be exploited to help discriminate different aspect categories; for ASC, no matter how the sentiment is indicated towards an aspect, utilizing syntactic relations can help to pinpoint the correlation between the explicit/implicit reference of the aspect and the sentiment towards it. For example, as shown in Figure 1.1a, given that the aspects are FOOD#QUALITY and RESTAURANT#PRICES with positive and negative polarity labels respectively, firstly we can observe that the syntactic relation nsubj has a strong indication towards explicit mentions (“quality” and “price”) of the aspects. Secondly, it can also be observed that the syntactic relation “acomp” associates the explicit mentions and their corresponding polarities. Moreover, long-range word dependencies are difficult to capture simply by using word embeddings or attention mechanism, while syntactic dependency inherently connects inconsecutive words, and helps with mitigating the long-range effect.

In both Q. Zhang, Lu, Wang, Zhu, and Liu(2019) andXiao et al.(2020), syntactic dependency trees were encoded and used with graph convolutional networks (GCN) to extract useful syntactic information from the text. In Q. Zhang et al. (2019), for the GCN layers, both encodings of the syntactic dependency tree with directed and undirected graphs were experimented and encoded as an adjacency matrix with entries showing whether connection exist for two words in the sentence. The syntactic features were then processed by an aspect-specific masking layer that masks out the features of the non-aspect words. Finally, attention scores were computed between the semantic features, extracted by a bidirectional LSTM from the sequence of GloVe word embeddings (Pennington et al., 2014), and the syntactic features. In Xiao et al. (2020), on the other hand, only undirected graphs were used for encoding the syntactic dependency trees in the same fashion as in

Q. Zhang et al. (2019), but with an additional point-wise convolution in each layer. The extracted syntactic features were then processed by a multi-head self-attention layer, and then the multi-head interactive attention layer combined both semantic and syntactic information by computing the attention scores of the syntactic encoding with the semantic encoding of both the review sentence and the OTE. The semantic encodings were obtained by first applying a bidirectional LSTM on the pre-trained BERT embeddings for the sentence and then computing the multi-head self-attention scores.

In Q. Zhang et al. (2019), GCNs were firstly used on the syntactic features encoded in graphs, but the potential of the syntactic dependencies were exploited more extensively inXiao et al.(2020) with the improved GCN model. Simultaneously, the semantic encoding was improved with using the powerful pre-trained BERT embeddings. However,Xiao et al.(2020) increased the complexity of the model drastically with the additional heavily parametrized modules, which is critical for inference speed and computational resources. Furthermore, neither of the two models distinguished among the different types of syntactic relations in the adjacency ma-trix, which ignores the importance of different types of syntactic relations towards the tasks. In the example

Figure 1.1a, it is observed that the syntactic relations “nsubj” and “acomp” play more important roles towards the ACC and ASC tasks.

Our proposed model takes advantage of the pre-trained BERT embeddings instead of using RNN models to extract semantic features of the input sequence, where the transformer model is easy to parallelize, and the embeddings are context-aware. Furthermore, we also utilize the NSP training objective by creating meta labels in a similar fashion as inHoang et al.(2019) in order to extract target-aware semantic embeddings.

3.2 Encoding Syntactic Relations

In order to encode the syntactic dependency trees, it is intuitive to use the graph structure, where GCN (Kipf & Welling,2016) can be used to find higher-order syntactic correlations among the nodes in the graph (tokens in the sentences) through multiple layers. As mentioned insection 3.1, inQ. Zhang et al.(2019) andXiao et al.

(2020), the syntactic information was encoded using an adjacency matrix to indicate whether or not two words are connected in a sentence by any syntactic relations. Similarly, in the research Huang and Carley (2020), the review sentence is transformed into a dependency graph, and an adjacency matrix is used to represent the syntactic relations between any pairs of words. However, by mixing all the syntactic relations in only one adjacency matrix, the learned graph embeddings become universal for each word in the sentence, only to be able to indicate the words are connected by any syntactic relation, however the different syntactic relations are not distinguished and the importances of certain syntactic relations specific to the task are not learned. For instance, in the example Figure 1.1b, as the aspects are implicitly mentioned, compared to the explicit exampleFigure 1.1a, the “nsubj” and “acomp” relations do not exist and the model shall be able to learn and attach importance to the syntactic relations towards the ACC and ASC tasks in an end-to-end fashion. In our proposed work, we differentiate and learn the importances and nuances among different syntactic relations by encoding the syntactic dependency tree of the sentence into an adjacency tensor, which contains an adjacency matrix independently for each syntactic relation.

(10)

Chapter 4

Methodology

4.1 Model Architecture

The overall architecture of AERGCN is shown in Figure 4.1. The model takes an input tuple (s, a) consisting a review sentence s and a potential aspect category a and make an inference ˆt that predicts the target label t indicating the relation for (s, a), related /unrelated for ACC and positive/negative/neutral for ASC.

The model consists of two parallel branches, the semantic branch and the syntactic branch where the input (s, a) is encoded into the semantic embeddings esem = {esem1 , esem2 , ..., esemn } and the syntactic embeddings

esyn_{= {e}syn 1 , e

syn 2 , ..., e

syn

m } respectively by the embedding layers. For the semantic branch, first we use a linear

layer to reduce the dimensionality of esem _{and then we feed the reduced embedding, denoted as ˜}_esem_{, into}

a multi-head self-attention layer to generate the higher-order representation rsem_{. For the syntactic branch,}

similarly, we use the same dimensionality reduction layer to acquire the reduced embeddings ˜esyn_{and then feed}

them into the R-GCN layers for processing the syntactic information to obtain the graph embeddings ˆesyn_.

After that, the higher-order representation rsyn _{is obtained by using a multi-head self-attention layer. Then}

the higher-order semantic representation rsem_{is correlated with the higher-order syntactic representation r}syn

through a multi-head interactive attention layer to generate rfuse_{. Finally, both r}syn _{and r}fuse _{are aggregated}

using average pooling and concatenated to form a final representation vector rfin_{, which is projected onto the}

label space and activated with a softmax layer.

(11)

4.2 Preprocessing

Similarly as in Hoang et al. (2019), since we utilize the NSP training objective, we first concatenate s with a to form a sentence pair where we format a by lowercasing it and substituting the hashtag with a comma to make it sentence-like. Then we create a meta label t to indicate the relation between s and a. Taking the example shown in Figure 1.1a, for the ACC task, we generate two data points as “Quality is excellent but price is expensive. food, quality” and “Quality is excellent but price is expensive. restaurant, price” for the aspect categories FOOD#QUALITY and RESTAURANT#PRICE with labels as related and for the rest of the aspect categories, we label them as unrelated. For ASC, we generate the same two data points but with labels positive and negative correspondingly, and we only have related samples due to the essence of the subtask. For the combined model, aside from the samples with different sentiment polarities, we also populate the dataset similarly as for ACC with numerous unrelated data samples, where we consider the samples with any of the sentiment polarities as related. During loss optimization, we compensate for the label imbalance by assigning weights inversely proportional to the frequency for each class to lower the influence of the overwhelming number of unrelated data samples and compensate for the class imbalance in the original dataset.

For the syntactic relations, the input sentence is preprocessed offline with the dependency parser from spaCy (Honnibal & Montani,2017) to encode the adjacency matrices for the selected syntactic relations in the sentence. In this work, to precisely encode the syntactic relations, we strictly follow the direction of the heads pointing to their corresponding dependents/children and, therefore, create a directed acyclic graph (DAG) for each review sentence. Subsequently, we convert the DAG into an adjacency tensor, containing an adjacency matrix for each selected syntactic relation. For a specific syntactic relation type k, the adjacency matrix Ak _{∈ {0, 1}}N ×N _is

constructed where N is the length of the review sentence s tokenized according to spaCy. In the adjacency matrix Ak _{for relation type k, an entry A}k

ij is set to be 1 if the token at position i in the sentence is the

dependency head pointing to the dependent/child token at j. The adjacency matrices Ak _{for all relation types}

form the adjacency tensor A.

4.3 Embeddings

4.3.1 Pre-trained Semantic Embeddings

As shown in Hoang et al. (2019) and Xiao et al. (2020), the use of pre-trained BERT embeddings showed a significant improvement in the performance at the cost of model complexity. As an optimization of BERT, RoBERTa (Y. Liu et al., 2019) improves the performance compared to BERT without introducing additional model complexity. On the other hand, DistilBERT (Sanh, Debut, Chaumond, & Wolf, 2019) improves on the efficiency of BERT by utilizing knowledge distillation and trades off model performance against the model complexity. Also, the authors applied knowledge distillation in a similar manner to RoBERTa, creating a distilled version of RoBERTa, DistilRoBERTa, which serves as a powerful counterpart to DistilBERT.

models #params BERT 110M DistilBERT 66M

RoBERTa 125M DistilRoBERTa 82M

Table 4.1: Size of the BERT models.

In our research, we investigated four different variations of the pre-trained semnatic embeddings as shown in Table 4.1, where the base size model is used for all these pre-trained semantic embeddings. To cope with each embedding model, we used its corresponding tokenization method. These tokenization methods introduce special tokens to indicate the beginning, the separation and the end of the sentences, which we also take into account in our semantic branch. The embeddings for each token constitute the semantic embeddings esem _for

the semantic branch of the model as described above. We implemented the BERT models using Huggingface’s Transformers (Wolf et al.,2019).

4.3.2 Fine-tuning of the Pre-trained Semantic Embeddings

InDevlin et al. (2019), the corpora used for training BERT was BooksCorpus (Zhu et al.,2015) and English Wikipedia (16GB). For RoBERTa, the corpora was extended with CC-News (Nagel,2016) (76GB), OpenWeb-Text (Gokaslan & Cohen,2019) (38GB) and Stories (Trinh & Le,2018) (31GB). We can observe that for training BERT, the corpora used are completely in the form of structured and formal English text. While for RoBERTa,

(12)

except for OpenWebText, the rest of the corpora (∼ 76.4%) are also in a similar structured form. However, customer reviews are generally in the form of casual and informal English text. Hence, it is intuitive to apply fine-tuning on the pre-trained word embeddings to better capture the flavor of the dataset.

In order to fine-tune the pre-trained semantic embeddings, we used the annotated datasets described in

chapter 2and trained the model with the NSP training objective in a supervised end-to-end fashion, as used in Hoang et al.(2019). We updated the weights of the embedding layers with a different rate compared to the weights of the rest of the model. With this supervised objective, using the main tasks of ACC or ASC as the final goal directly guides the weight updates to be more task-specific and to learn more relevant representation for the inputs.

4.3.3 POS tag Embeddings

We used learned POS tag embeddings as the syntactic embeddings esyn_{to correlate with the syntactic relations.}

Because a POS tag indicates the syntactic role of the token in the sentence, it is natural to align it with the syntactic relations in order to completely encode the syntactic information of the sentence.

First, the most likely POS tag is predicted for each token in the input using the out-of-box POS tagger from spaCy (Honnibal & Montani, 2017), and filtered by a predefined set of tags which are irrelevant. Then, the embedding vector for the respective POS tag is fed into the R-GCN and correlated with the syntactic relations presented in the sentence. The embeddings are learned in an end-to-end fashion during training for each of the included POS tag.

4.4 Relational Graph Convolutional Networks

R-GCN (Schlichtkrull et al.,2018) is a generalized and extended version of GCN and is used to model relational data, where in the directed graph denoted as G = (V, E , R), a node vi∈ V is connected to node vj via the edge

(relation) (vi, r, vj) ∈ E under relation r ∈ R. In each R-GCN layer, for a node vi the graph embeddings hj

from the last layer of its related nodes vj are accumulated with weighted normalized sum. In addition to this

accumulated graph embeddings, the graph embedding hi from node vi is also added with a learned weight (self

connection).

As discussed in section 3.1, although the previous studies Q. Zhang et al. (2019) and Xiao et al. (2020) utilized GCNs to extract syntactic information from syntactic dependency trees encoded in graphs, the potential of syntactic relations were not fully exploited. Both of these studies used only one adjacency matrix to embed the syntactic dependency tree without specifying the exact relation between the words.

In order to differentiate the syntactic relation types and learn different weights towards them, we integrate R-GCN layers in our model, which constructs multiple adjacency matrices Ak _{for an input sequence, consisting}

of directed edges Ak

ij from token i to j based on the k type relation, out of K types. We first compute the

normalized sum of the propagated graph embeddings under the relational graph for each node and each relation type as hl_ik= N X j=1 Ak_ijW_klhl−1_j /Ci,

where l is the layer index of the R-GCN, hl

ik indicates the graph embedding of the node i in layer l for relation

type k, hl−1_j indicates the graph embedding of the node j in layer l − 1, Wl

k is the linear transformation weight

matrix that is shared within the layer and Ci is the cardinality of the node i in the relational graph. This

procedure is illustrated inFigure 4.2.

After computing the graph embeddings hl

ik of layer l for each node i with regards to each relation type k,

we differentiate among different relation types by aggregating the graph embeddings of all relation types into one as hli= ReLU( K X k=1 ckhlik),

where ck is the learned importance coefficient for relation type k.

As introduced in Schlichtkrull et al. (2018), we also apply two weight decomposition methods, basis-decomposition and block-diagonal-basis-decomposition, to regularize the weights in order to prevent overfitting the model on sparse multi-relational data.

(13)

Figure 4.2: Example of an R-GCN layer. The normalized addition implicitly includes the division by the cardinality.

4.5 Multi-head Attention Mechanism

Inspired by Xiao et al. (2020), both multi-head self-attention (MHSA) and multi-head interactive attention (MHIA) are used to obtain higher-level embeddings. Both modules are based on the multi-head attention (MHA) mechanism.

As defined inSong, Wang, Jiang, Liu, and Rao(2019), we first generate the query vectors q = {q1, q2, ..., qn}

and the key vectors k = {k1, k2, ..., kn} for the input embeddings e = {e1, e2, ..., en} using two learned weight

matrices WQ and WK respectively. Then, different from the definition of the value vectors in Vaswani et al.

(2017), the value vectors are defined as v = k, which leads to the definition of the attention Attention(q, k) = softmax(fs(q, k))k,

where fs(·) is the score function to measure the similarity between the set of vectors q and k. Here we have

two variants: the additive score (Bahdanau, Cho, & Bengio, 2015) and the scaled dot-product score (Vaswani et al.,2017).

Additive Score The additive score function uses a feed-forward network (or multi-layer perceptron) with a learned weight matrix WS to compute the similarity based on the concatenation of the two vectors qiand kj as

fs(qi, kj) = Tanh([qi; kj] · WS),

where ; indicates the concatenation of the two vectors.

Scaled Dot-product Score The scaled dot-product score literally computes the scaled dot product between the two vectors qi and kj to compute the similarity as

fs(qi, kj) =

qi· kjT

√ dhidden

,

(14)

Finally, we define the multi-head attention (MHA) as

MHA(q, k) = [Attention1(q, k); Attention2(q, k); ...; Attentionnhead_{(q, k)] · W}

multi-head,

where we have nheadseparate attention modules for q and k to focus on different topics in parallel and a learned

weight matrix Wmulti-head to aggregate them.

For MHSA, we obtain q and k from the same sequence of features; for MHIA, we obtain q and k from the two sequences of features respectively.

4.6 Target-aware Feature Extraction

Similar to the essence of the sentence pair classification method used in Hoang et al. (2019) and the target feature extraction in Xiao et al. (2020), where the target aspect categories are encoded into sentences, our model is also designed to be target-aware, such that semantic correlation can be associated between the review sentence and the target aspect category as introduced insection 4.2.

4.7 Training Objective

As described insection 4.1, we project rfin_{onto the label space through a linear layer to obtain the logits vector}

l. Then we feed the logits into the softmax layer to compute the probabilities for each label and compute the cross-entropy loss. We can formulate the training objective L as follows

L = CrossEnt(Softmax(y)) + λX θ∈Θ kθk2= −X c∈C wcyclog ˆyc+ λ X θ∈Θ kθk2,

where ˆy = Softmax(l), y is the one-hot encoded vector of the target label t, C is the set of labels, wc indicates

the weight assigned to class c to alleviate label imbalance, Θ is the set of trainable parameters in the model and λ is the parameter for L2 regularization.

(15)

Chapter 5

Experiments

5.1 Dataset

As mentioned in chapter 2, we used the SemEval 2016 Task 5 sentence-level dataset (Pontiki et al., 2016) for evaluation. More specifically, we used the EN-REST-16 and EN-LAPT-16 datasets that contain customer reviews in English for the restaurant domain and the laptop domain respectively. The detailed distribution of the dataset is shown in Table 5.1, where we exclude all the unannotated out-of-scope data samples from the original dataset. As introduced insection 4.2, for ACC we populated the dataset with unrelated samples. There are 12 predefined aspect categories of EN-REST-16 and 88 of EN-LAPT-16, which explains the drastic increase in the size of the dataset after population. As discussed, the class imbalance both introduced by the unrelated samples and by the original labels were addressed by assigning weights inversely proportional to the frequency for each class.

Split

EN-REST-16 EN-LAPT-16

ACC ASC ACC ASC

unrel rel pos neg neu unrel rel pos neg neu train 16785 1976 1297 597 82 158741 2214 1189 883 142 val 4201 531 360 152 19 40841 695 448 201 46 test 6961 859 611 204 44 60361 801 481 274 46

Table 5.1: Distribution of the SemEval 2016 Task 5 SB1 dataset after population. Abbreviations: unrel -unrelated, rel - related, pos - positive, neg - negative, neu - neutral.

5.2 Experimental Settings

In this research, we used the original embedding size de of 768 for the pre-trained BERT embeddings as well

as for the POS tag embeddings. We then defined the hidden size dh = 512, which is the reduced size of

the embeddings and the final representations. We used Glorot initializer (Glorot & Bengio,2010) and Adam optimizer (Kingma & Ba, 2015) with a fixed learning rate of 0.0001 and L2 regularization of 0.001 to initialize and update the weights in the model. While fine-tuning BERT models, an individual Adam optimizer was used with L2 regularization of 0.001. The fine-tuning process was only turned on for the first 1/3 steps of the training, and we used different learning rates dependent on the BERT models. We used the learning rate of 1 × 10−5 for DistilBERT and RoBERTa and 2 × 10−5 for BERT and DistilRoBERTa.

We set the batch size to be 16. Furthermore, we selected 20 types of filtered most frequent syntactic relations for encoding. Also, we fixed the number of heads of the multi-head attention layers to be 6. For the syntactic branch, we stacked 2 R-GCN layers as the feature extractor. We used the basis decomposition for weight regularization with the number of bases set to be 16.

We conducted training for 15 epochs for ACC and 50 epochs for ASC with the train split of the dataset for both EN-REST-16 and EN-LAPT-16. We used the micro-averaged F1 score for ACC and accuracy for ASC to evaluate the performance of the model. We used the validation split for tuning the hyperparameters and only conducted one final experiment on the test split to compare the genuine model performance to prevent overfitting on the validation set. Our hardware setup was a Linux machine with an NVIDIA Tesla M60 GPU.

(16)

5.3 Hyperparameter Selection

In general, we first narrowed down the range of hyperparameter choices through qualitative analysis. Then, we conducted grid search experiments to fix the remaining hyperparameters for each setup. We used the ACC task as a proxy and conducted the experiments for EN-REST-2016. We applied the same setting for the EN-LAPT-16 experiments assuming the hyperparameters have similar performance regardless of the domain.

5.3.1 Learning Rate for Fine-tuning BERT models

Due to the different model designs of the BERT models, the BERT models could show different sensitivity in learning rates. We conducted experiments to investigate the use of learning rates for fine-tuning by only training for 5 epochs, and compared the initial performance of the trained models after these 5 epochs to decide on the corresponding learning rates for each BERT model.

As can be seen in Table 5.2, the appropriate learning rates are within the lower range [1 × 10−5, 2 × 10−5], indicating that the pre-trained embeddings only need subtle fine-tuning with a supervised end-to-end goal. In the further experiments that fine-tune the pre-trained embeddings, we used the learning rates selected in this section for each model.

Model lr 1 × 10−5 2 × 10−5 3 × 10−5 4 × 10−5 5 × 10−5 BERT-uncased 0.7267 0.7311 0.7238 0.7152 0.6441 DistilBERT-uncased 0.7218 0.7095 0.7154 0.6726 0.6967 BERT-cased 0.7313 0.7362 0.7067 0.7168 0.6592 DistilBERT-cased 0.7094 0.7071 0.6961 0.7054 0.6807 RoBERTa 0.7424 0.7265 0.1115 0.5563 0.1095 DistilRoBERTa 0.7194 0.7469 0.7238 0.7177 0.711

Table 5.2: F1-Score performance of the BERT models on task ACC with different fine-tune learning rates. Best performance for each model is indicated in bold.

5.3.2 R-GCN Weight Regularization

As mentioned above, we have two weight regularization methods for the R-GCN layers, basis-decomposition and block-diagonal-decomposition. As discussed inSchlichtkrull et al. (2018), the block-diagonal-decomposition encodes the intuition that node features can be grouped into sets of variables that are more tightly coupled within groups than across groups. However, the syntactic relations in our dataset are relatively sparse, which could be vulnerable to the sparsity introduced by the block-diagonal-decomposition and miss crucial information. Therefore, we selected the basis-decomposition method for weight regularization. Furthermore, we set up a grid search experiment to find the appropriate number of basis. Similar to selecting the learning rates, we conducted experiments for the ACC task and compared the performances for a set of values to fix the number of basis. We used the uncased BERT as the pre-trained embeddings and also fixed the other hyperparameters for a fair evaluation.

As shown in Figure 5.1, we can observe that when the number of basis is set to be 16, we have the best performance.

5.3.3 Number of R-GCN Layers

When using L R-GCN layers, we ensure that the R-GCN module can encode information of nodes within L “hops”, or syntactic relation edges. For instance, in the example of Figure 1.1a, the first “is” token in the sentence is two “hops” away from the “expensive” token in the sentence through the syntactic relation edges “conj” (conjunct) and “acomp” respectively. By including more “hops” in the representation learning, we can correlate the tokens with their Lth_{-order neighborhoods making use of the relations in the graph.}

To select the appropriate number of R-GCN layers, we have obtained statistics on the training split of both EN-REST-16 and EN-LAPT-16 to find the distribution of Lth_{-order syntactic relations, and as shown in}

Figure 5.2andFigure 5.3, we can observe that for 97% of the syntactic relations are within 5th-order, therefore, we evaluate the performance for R-GCN layers within range [1, 5].

With the other hyperparameters fixed, we conducted experiments with different number of R-GCN layers to compare the performance. As shown in Figure 5.4, the first and second best performances occur for 2 and 4 layers of R-GCN. The difference between the best performance at 4 and the second best performance at 2 is relatively small, however the R-GCN model complexity doubles, where approximately 8.4M weight parameters

(17)

Figure 5.1: Line plot of the F1-Score performance of BERT (uncased) using different number of basis.

Figure 5.2: Bar plot of the distribution of Lth-order syntactic relations in EN-REST-16 and EN-LAPT-16.

are added. Therefore, we selected the number of R-GCN layers to be 2 to compensate for the model complexity with little cost of performance drop.

5.3.4 Number of Heads for the Multi-head Attentions

The number of heads of the multi-head attention layers defines the number of latent/abstract “topics” that are taken into consideration while attending to the tokens in the sentence. Similarly as for selecting the number of basis for R-GCN weight regularization, we conducted experiments for the ACC task and compared the performances for a set of values. We fixed the BERT model to be the uncased BERT and also fixed the other hyperparameters for a fair evaluation.

As shown in Figure 5.5, we can observe that when the number of heads is set to be 6, we have the best performance.

5.3.5 Score Function for the Multi-head Attentions

As described in section 4.5, we implemented two different score functions for the Multi-head Attention layers (MHAs), the additive score and the scaled dot-product score. Both score functions measure the alignment or relevancy between the tokens in the sentence, but the difference lies in the question whether the metric needs to be learned. For additive score, the metric is learned as a matrix weight; for scaled dot-product score, simply the dot-product is used to measure the relevancy. We conducted experiments for the ACC task and measured the F1-Score performance using the two score functions. We evaluated the performance on different BERT models as the pre-trained embeddings and fixed the other hyperparameters for a fair evaluation.

FromTable 5.3, we can see that using additive scores attains the better performances for most of the BERT models.

(18)

Figure 5.3: Percentage chart of the Lth-order syntactic relations in EN-REST-16 and EN-LAPT-16.

Figure 5.4: Line plot of the F1-Score performance of BERT (uncased) using different number of R-GCN layers.

5.4 Model Comparisons

We first implemented several variants of the AERGCN model, where we differentiated among which type of BERT model is used. For each BERT variant, we trained three models, two models separately learning ACC and ASC, and one combined model of both ACC and ASC, noted as COM. For the evaluation of the combined models on the ACC task, we categorized positive/negative/neutral as related.

Then, we selected several baseline models to compare the performance with the two selected AERGCN model variants for both ACC and ASC. The baseline models are listed below:

• NLANGP (Toh & Su,2016) achieved the best performance on EN-REST-16 and EN-LAPT-16 datasets for ACC in SemEval 2016.

• SVM (Jihan et al.,2017) uses an SVM classifier with hand-crafted features and outperformed NLANGP. • XRCE (Brun et al.,2016) achieved the best performance on EN-REST-16 for ASC in SemEval 2016. • BERT-NSP (Hoang et al.,2019) uses a classification layer on top of the pre-trained BERT embeddings

and has state-of-the-art performance for both ACC and ASC on EN-REST-16 dataset.

• IIT-TUDA (Kumar et al., 2016) achieved the best performance on EN-LAPT-16 for ASC in SemEval 2016.

• TAN (Movahedi et al., 2019) makes use of GRU and uses a multi-head attention layer to extract topic related features for the ACC task.

5.5 Module Analysis

In order to investigate the importance and performance gain each module brings to the model, we conducted a series of experiments on EN-REST-16, in which we sequentially included different modules of the selected variant of the AERGCN model and observed their corresponding contributions.

In the first step, we used the pre-trained semantic embeddings from the semantic branch as the base, where we decreased the size of the embeddings, applied average pooling and projected the representation on the label space for softmax classification. Then, based on the basic setup, we included the semantic MHSA to extract higher-level semantic representation for the final classification. The third step was to introduce the syntactic representation by using the R-GCN layers on the reduced version of the pre-trained semantic embeddings. Similarly we applied average pooling on the output of the R-GCN module and concatenated it with the semantic representation from the semantic MHSA to form the final representation for classification. The fourth step was to also include a MHSA on the output of the R-GCN module to generate the syntactic representation. Subsequently, the next step was to include the MHIA layer to correlate the semantic representation from

(19)

Figure 5.5: Line plot of the F1-Score performance of BERT (uncased) using different number of heads.

Additive Scaled Dot-product BERT-uncased 0.7576 0.7834 DistilBERT-uncased 0.7287 0.7359 BERT-cased 0.7455 0.7362 DistilBERT-cased 0.7394 0.724 RoBERTa 0.7988 0.7934 DistilRoBERTa 0.7651 0.7617

Table 5.3: F1-Score performance of the BERT models on task ACC with different score functions for the multi-head attention layers. Best performance for each model is indicated in bold.

the semantic MHSA with the syntactic representation from the syntactic MHSA to form the final syntactic representation, which we used to concatenate with the semantic representation from the semantic MHSA. Finally, we also included learning the POS tag embeddings to focus on the syntactic roles of the tokens in the input.

(20)

Chapter 6

Results & Discussion

6.1 Overall Performance

Model EN-REST-16 EN-LAPT-16 ACC ASC ACC ASC BERT 0.7364 0.8673 0.3016 0.8215 BERT-COM 0.6932 0.8814 0.298 0.7065 DistilBERT 0.7151 0.8661 0.386 0.8052 DistilBERT-COM 0.6368 0.8608 0.3838 0.7656 RoBERTa 0.7597 0.9057 0.2859 0.8414 RoBERTa-COM 0.7224 0.9011 0.3895 0.7765 DistilRoBERTa 0.7245 0.8615 0.2341 0.8315 DistilRoBERTa-COM 0.7108 0.877 0.2798 0.7417

Table 6.1: Experiment results of the AERGCN model variants on the SemEval 2016 test set. The top two performances are shown in bold.

The experiment results of the different AERGCN model variants are shown in Table 6.1. Firstly, the single task learning model AERGCN-RoBERTa had the best performance in general, with achieving the top performance of both tasks on EN-REST-16 and of the task ASC on EN-LAPT-16. The combined model with RoBERTa embeddings, noted as AERGCN-RoBERTa-COM, achieved the best performance for ACC on EN-LAPT-16 and the second best result for ASC on EN-REST-16. As these two model variants had a generally good performance for different tasks, we selected them as the models to be compared with the baseline models. For convenience, we denote the single task learning model and the joint learning model as AERGCN and AERGCN-COM respectively hereinafter.

By comparing the original BERT models with their corresponding distillations, we observe that the dis-tillation of the models trades off model performance against model complexity as expected, except for the BERT-based models for EN-LAPT-16. The average performance drop by using the distilled version was 2.49% for BERT, excluding the performance increases for EN-LAPT-16, and 4% for RoBERTa, respectively.

Model EN-REST-16 EN-LAPT-16 ACC ASC ACC ASC AERGCN 0.7597 0.9057 0.2859 0.8414 AERGCN-COM 0.7224 0.9011 0.3895 0.7765 NLANGP 0.7303 - 0.5194 -XRCE 0.687 0.8813 - -IIT-TUDA 0.63 0.867 0.439 0.827 SVM 0.7418 - 0.5221 -BERT-NSP 0.799 0.898 0.517 0.828 TAN 0.7838 - -

-Table 6.2: Experiment results of the selected AERGCN model variants and the baseline models on the SemEval 2016 test set. The top two performances are shown in bold.

(21)

As can be seen in Table 6.2, compared with the baselines, both AERGCN and AERGCN-COM achieved state-of-the-art performances for ASC on EN-REST-16. Moreover, the single task learning model also achieved state-of-the-art performance for the ASC task on EN-LAPT-16. Compared to ASC, the ACC tasks for both models were relatively poor, especially on EN-LAPT-16. We suspect that this is due to the fact that syntactic relations are more indicative of associating tokens in the input sequence. This is inherently more relevant to the ASC task as to associate the explicit/implicit references of the aspect categories with the tokens representing their sentiment polarities.

In terms of the difference in the performances between EN-REST-16 and EN-LAPT-16, we believe it is due to the difference in task complexity. As mentioned in section 5.1, the number of aspect categories of EN-LAPT-16 is way larger than that of EN-REST-16 while the number of data samples from the original dataset is of the same scale. Hence, the class distribution for EN-REST-16 and EN-LAPT-16 are significantly different. We suppose that the MHA modules are essentially sensitive to this class distribution, as for different number of aspect categories, the number of heads corresponding to the number of latent topics is different. Since the hyperparameters were tuned for the ACC task on EN-REST-16, we assume that the models trained for EN-LAPT-16 are suboptimal in the parameter space, which also explains some of the anomalies mentioned above.

6.2 Module Analysis

Model ACC ASC AERGCN-COM 0.7656 0.8941 + Semantic MHSA 0.7436 0.8975 + R-GCN 0.7332 0.8824 + Syntactic MHSA 0.762 0.901 + MHIA 0.7312 0.891 + POS tag Embeddings 0.7224 0.9011

Table 6.3: Experiments results for the addition of the different modules of the AERGCN-RoBERTa-COM model on the EN-REST-16 test set. The AERGCN-COM is the plain semantic model with the joint learning setup and in each step we include every addition from previous steps. The top two performances are shown in bold.

As described in section 5.5, we sequentially included the major modules of the model in several steps to observe the additional influence of these modules on the performance. We take the AERGCN-COM model variant as an example, because it achieved a generally good performance on all the tasks and can learn both tasks jointly. Generally, by adding up the modules, the ACC performance of the model dropped by 2-5% compared to the plain semantic model, while the ASC performance increased marginally. After including the syntactic MHSA module in the model, we can observe that even though neither the ACC performance nor the ASC performance were the best, the general performance taking both tasks into consideration excelled. By comparing it with the other steps, we can conclude that the inclusion of the syntactic branch and the MHSA layers for both the semantic and syntactic branch had positive influence on the tasks compared to the plain semantic model. However, including the MHIA layer and the POS tag embeddings had obvious negative effect on the ACC performance. We suspect that the inclusion of the syntactic information is more beneficial to the ASC task because it can easily associate either the explicit or the implicit mention of the aspect categories with their corresponding adjectival or adverbial modifier. Moreover, since using the semantic information is already sufficient for the ACC task, the inclusion of the syntactic information introduced numerous parameters that potentially caused overfitting for the ACC task.

6.3 Out-of-domain Evaluation

Model EN-REST-16 EN-LAPT-16 ACC ASC ACC ASC AERGCN 0.1817 0.8615 0.0894 0.8015 AERGCN-COM 0.2112 0.8135 0.0423 0.809

Table 6.4: Out-of-domain experiment results of the selected AERGCN model variants and the baseline models on the SemEval 2016 test set.

(22)

To research into the cross-domain generalization ability of the AERGCN model, we also conducted exper-iments for the selected AERGCN model variants. We trained the models on dataset from one domain and evaluated their performances for both ACC and ASC on the dataset from the other domain. Similarly as above, we selected the best performing AERGCN-RoBERTa and AERGCN-RoBERTa-COM variants, which correspond to the single task learning model and the joint learning model respectively, and we refer to them as AERGCN and AERGCN-COM for convenience. As shown in Table 6.4, the generalization ability of both AERGCN and AERGCN-COM were relatively good for the both tasks when they were trained on EN-LAPT-16 compared to the case when they were trained on EN-REST-16. We suppose that this is due to the fact that the dataset size of EN-LAPT-16 is considerably larger than that of EN-REST-16 after population and it mitigated the overfitting influence.

Furthermore, the ASC task is essentially more general, as the sentiments are usually expressed in regular semantics and syntactic patterns. This is also indicated in the fact that, when only taking the ASC performance into consideration, the performances are unusually good, within 10% when compared to the state-of-the-art performance (0.9011 for EN-REST-16 and 0.8414 for EN-LAPT-16). On the contrary, for the ACC task, we can observe that the performances of the out-of-domain models are poor. We assume that this is because identifying the aspect categories require relatively more domain related semantic information.

6.4 Case Study

ID Review Aspect OTE Label Prediction

1 Freshest sushi – I love this restaurant. FOOD#QUALITY sushi pos pos3 RESTAURANT#GENERAL restaurant pos pos₃

2 I swore never to return for a warm beer and mediocre meal.

DRINKS#QUALITY beer neg neg3 FOOD#QUALITY meal neg neg3

3 Food was good and cheap. FOOD#QUALITY Food pos pos3 FOOD#PRICES Food pos pos₃

4 It’s fresh, welcoming, delicious, and relaxing.

FOOD#QUALITY - pos pos₃ AMBIENCE#GENERAL - pos pos₃

5 Mercedes restaurant is so tasty, the service is undeniably awesome!

SERVICE#GENERAL service pos pos₃ FOOD#QUALITY - pos pos₃

6 They are not greasy or anything. FOOD#QUALITY - pos neu7 FOOD#STYLE OPTIONS - unrel neu₇

7 tucked away over by the Beverly Center.

LOCATION#GENERAL - neu unrel7 AMBIENCE#GENERAL - unrel pos₇ RESTAURANT#GENERAL - unrel pos₇

8 Great Pizza, Poor Service FOOD#QUALITY Pizza pos neg 7 SERVICE#GENERAL Service neg neg₃ 9 Plan on waiting 30-70 minutes. SERVICE#GENERAL - neu neg ₇

Table 6.5: Examples of customer reviews for AERGCN-COM evaluated on EN-REST-16. The annotations for OTEs are from the original dataset. Abbreviations: unrel - unrelated, rel - related, pos - positive, neg - negative, neu - neutral.

To further understand and explain the behavior of the AERGCN-COM model, we selected several “good” and “bad” examples during evaluation from both domains. The examples are shown inTable 6.5andTable 6.6

for both domains.

When we are observing the correctly predicted examples, we can find that one important factor for correct classification is the existence of the explicit mentions to the aspect categories, or OTEs. For example, in the reviews 1, 2, 3 from EN-REST-16 and reviews 1, 2 from EN-LAPT-16, the OTEs are existent towards each aspect category. We believe for the correct classification, the model first pinpoints the OTEs given the aspect categories using the semantic information, and secondly exploits the syntactic information to associate the corresponding expressions with the identified OTEs. Finally, the models utilizes the semantic information again to classify the sentiment polarity of the associated sentiment expressions. In this case, the syntactic patterns that associate the OTEs with their sentiment expressions are usual and regular, which the model can learn directly. However, there are also cases with existing explicit mentions to the aspects, where the model fails to

(23)

ID Review Aspect OTE Label Prediction

1 Very great Apple product as expected.

LAPTOP#GENERAL product pos pos₃ COMPANY#GENERAL Apple pos pos₃

2 But, hey, it’s an Apple. COMPANY#GENERAL Apple pos pos3 LAPTOP#GENERAL toy pos pos₃

3 I would run back to computer store and return this expensive toy.

LAPTOP#GENERAL toy neg neg₃ LAPTOP#PRICE - neg neg3

4 for the price that i paid i feel that i got good value.

LAPTOP#PRICE price neu neu₃ LAPTOP#GENERAL - pos pos₃

5 Plus you get 500GB

which is also a great feature.

HARD DISC#DESIGN FEATURES - pos unrel7 LAPTOP#DESIGN FEATURES - unrel pos₇

LAPTOP#GENERAL - unrel pos₇

6

While the price may seem high, it is worth every penny

with its many perks.

LAPTOP#PRICE price neg pos₇ LAPTOP#GENERAL - pos pos₃

7 :( LAPTOP#GENERAL - neg pos7

SUPPORT#QUALITY - unrel neg₇

8 still have my ibook g4 i bought in 2005.

COMPANY#GENERAL - unrel pos₇ LAPTOP#GENERAL - unrel pos₇

Table 6.6: Examples of customer reviews for AERGCN-COM evaluated on EN-LAPT-16. OTEs are manually annotated due to the lack of annotations in the original dataset. Abbreviations: unrel - unrelated, rel - related, pos - positive, neg - negative, neu - neutral.

classify the correct label. In review 8 from EN-REST-16 and review 6 from EN-LAPT-16, the OTEs partially exist for the corresponding aspects, but the classifications show opposite sentiments against the labels. We attribute these errors to potentially wrong association of OTEs with their sentiment expressions. For instance in review 8 from EN-REST-16, given the sentiment expression “Great” usually has a strong indication towards positive sentiment, the incorrect classification is most likely due to the association of “Pizza” with “Poor” within 2 hops (“Pizza” to “Service” and “Service” to “Poor”). Moreover, we also suspect that for some of the sentiment expressions, the semantic information is not extracted properly. Taking review 6 from EN-LAPT-16 as an example, the misclassification of “LAPTOP#PRICE” to positive is because in the association of “price” with “seem high”, the semantic information of “seem high” is insufficient for the model to bias towards a negative sentiment polarity.

On the other hand, when there are only implicit mentions to the aspects, we can observe that the model becomes rather error-prone. In the cases of correct classifications, we believe the model relies more on the semantic information of the adjectival/adverbial modifiers, which we can observe in reviews 4, 5 from REST-16 and reviews 4, 6 from LAPTOP-16. Similarly, for the misclassifications in review 6 from EN-REST-16 and reviews 5, 7 from EN-LAPT-16, the reason is the insufficient semantic information extracted for the adjectival/adverbial modifiers.

In some of the “bad” examples, like review 9 from EN-REST-16 and review 8 from EN-LAPT-16, we find that the content of the review is ambiguous or insufficient for the classification. Even though both EN-REST-16 and EN-LAPT-EN-REST-16 are sentence-level, the reviews are usually parsed into several sentences from the original review paragraphs. The required semantic information may lie in the context from the previous sentences.

(24)

Chapter 7

Conclusion

In this research, we examined the limitations and challenges of the current works in ABSA, as described in chapter 1. We proposed the novel AERGCN model to selectively encode syntactic relations, exploiting the syntactic information by using POS tag embeddings of the input and combine semantic encodings with syntactic encodings to obtain a multi-dimensional representation for the input. We also broaden research scope to focus on the general ABSA task including both ACC and ASC, and provided a combined model setup to learn the tasks jointly. Moreover, our proposed model variants achieved competitive results on the tasks and achieved multiple state-of-the-art performances for ASC.

For future work, this study can be improved by firstly investigating the optimal parameter settings for the EN-LAPT-16 domain as discussed above. During hyperparameter tuning, also the joint learning model can be used for better generalization between the tasks. More in-depth hyperparameter tuning can also be applied to ensure the optimal performance of the model. Secondly, the methods to mitigate class imbalance introduced by preprocessing can be investigated. Also, introducing label smoothing methods can improve the robustness of the model and prevent overfitting. Furthermore, we used out-of-box dependency parser and POS tagger from spaCy (Honnibal & Montani,2017), while we can implement different methods to select the most appropriate tools. Moreover, the BERT models were only fine-tuned in a supervised fashion with an end-to-end goal. However, the BERT models can also be fine-tuned for the masked language modeling task with rich external domain related corpora, which largely prevents the overfitting of the model. Finally, the research scope can be further extended with aspect term extraction that requires additional sequence labeling module.

(25)

References

Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In 3rd international conference on learning representations, iclr 2015 - conference track proceedings.

Brun, C., Perez, J., & Roux, C. (2016). XRCE at SemEval-2016 task 5: Feedbacked ensemble modelling on syntactico-semantic knowledge for Aspect Based Sentiment Analysis (Tech. Rep.). Retrieved from

http://semeval2016.xrce.xerox.com/ doi: 10.18653/v1/s16-1044

Cho, K., Van Merri¨enboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014, 6). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Emnlp 2014 - 2014 conference on empirical methods in natural language processing, proceedings of the conference (pp. 1724–1734). Association for Computational Linguistics (ACL). doi: 10.3115/v1/d14-1179

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transform-ers for language undtransform-erstanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1 (Mlm), 4171–4186. Retrieved fromhttp://arxiv.org/abs/1810.04805

Do, H. H., Prasad, P. W., Maag, A., & Alsadoon, A. (2019, 3). Deep Learning for Aspect-Based Sentiment Analysis: A Comparative Review (Vol. 118). Elsevier Ltd. doi: 10.1016/j.eswa.2018.10.003

Dong, L., Wei, F., Tan, C., Tang, D., Zhou, M., & Xu, K. (2014). Adaptive Recursive Neural Network for target-dependent Twitter sentiment classification. In 52nd annual meeting of the association for computational linguistics, acl 2014 - proceedings of the conference (Vol. 2, pp. 49–54). Association for Computational Linguistics. Retrieved fromhttps://www.aclweb.org/anthology/P14-2009 doi: 10.3115/v1/p14-2009 Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In

Journal of machine learning research (Vol. 9, pp. 249–256). Retrieved fromhttp://www.iro.umontreal.

Gokaslan, A., & Cohen, V. (2019). OpenWebText Corpus. Retrieved fromhttps://skylion007.github.io/ OpenWebTextCorpus/

Hoang, M., Bihorac, O. A., & Rouces, J. (2019). Aspect-Based Sentiment Analysis using {BERT}. Proceedings of the 22nd Nordic Conference on Computational Linguistics, 187–196. Retrieved from https://www .aclweb.org/anthology/W19-6120

Hochreiter, S., & Schmidhuber, J. (1997, 11). Long Short-Term Memory. Neural Computation, 9 (8), 1735–1780. doi: 10.1162/neco.1997.9.8.1735

Honnibal, M., & Montani, I. (2017). spaCy2: Natural language understanding with bloom embeddings, convolu-tional neural networks and incremental parsing. doi: 10.5281/zenodo.1212304

Hu, M., & Liu, B. (2004). Mining opinion features in customer reviews. In Proceedings of the 19th national conference on artifical intelligence (p. 755–760). San Jose, California: AAAI Press.

Huang, B., & Carley, K. M. (2020, 9). Syntax-aware aspect level sentiment classification with graph attention networks. In Emnlp-ijcnlp 2019 - 2019 conference on empirical methods in natural language processing and 9th international joint conference on natural language processing, proceedings of the conference (pp. 5469–5477). Association for Computational Linguistics (ACL). Retrieved fromhttp://arxiv.org/abs/ 1909.02606 doi: 10.18653/v1/d19-1549

Jihan, N., Senarath, Y., Tennekoon, D., Wickramarathne, M., & Ranathunga, S. (2017). Multi-domain as-pect extraction using support vector machines. In Proceedings of the 29th conference on computational linguistics and speech processing, rocling 2017 (pp. 308–322).

Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In 3rd international conference on learning representations, iclr 2015 - conference track proceedings.

Kipf, T. N., & Welling, M. (2016, 9). Semi-Supervised Classification with Graph Convolutional Networks. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. Retrieved fromhttp://arxiv.org/abs/1609.02907

Kumar, A., Kohail, S., Kumar, A., Ekbal, A., & Biemann, C. (2016). IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain Dependency and Distributional Semantics Features for Aspect Based Sentiment Analysis. In Proceedings of the 10th international workshop on semantic evalua-tion (semeval-2016) (pp. 1129–1135). San Diego, California: Associaevalua-tion for Computaevalua-tional Linguistics.

Selectively Encoding Syntactic Information for Aspect-Based Sentiment Analysis

MSc Artificial Intelligence

Master Thesis