Aspect-based Sentiment Analysis of Social Media Data with Pre-trainedLanguage Models

(1)

Aspect-based Sentiment Analysis of Social Media Data with Pre-trained

Language Models

SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE

Anina Troya

11154659

MASTER INFORMATION STUDIES Data Science

FACULTY OF SCIENCE UNIVERSITY OF AMSTERDAM

GitHub Repository https://github.com/aninatroya/ABSA_Tweets.git

1st Supervisor Reshmi Gopalakrishna Pillai r.gopalakrishnapillai@uva.nl Faculty of Science. Information

Studies

2nd Supervisor Dr Cristian Rodriguez Rivero Faculty of Science. Information

Studies

External Supervisors Dr Zulkuf Genc zulkuf.genc@prosus.com

Dr Subhradeep Kayal deep.kayal@prosus.com Dogu Arcai

dogu.araci@prosus.com Prosus

(2)

ABSTRACT

There is a great scope in utilizing the increasing content expressed by users on social media platforms such as Twitter. This study ex-plores the application of Aspect- based Sentiment Analysis (ABSA) of tweets to retrieve fine-grained sentiment insights. The Plant-based food domain is chosen as an area of focus. To the best of our knowledge this is the first time ABSA task is done for this sector and it is distinct from standard food products because dif-ferent and controversial aspects arise and opinions are polarized. The choice is relevant because these products can help in meet-ing the sustainable development goals and improve the welfare of millions of animals. Pre-trained BERT, ‘Bidirectional Encoder Representations with transformers’, is leveraged for this task and it stands out because it was trained to learn from all the words in the sentence simultaneously using transformers. The aim was to develop methods to be applied on real life cases, therefore lowering the dependency on labeled data and improving performance were the key objectives. This research contributes to existing approaches to ABSA by data proposing data processing techniques to adapt social media data for ABSA. The scope of this project presents a new method for aspect detection which does not rely on labeled data by using regular expressions (Regex). For aspect sentiment classification a semi-supervised learning technique is explored. Ad-ditionally Part-of-Speech tags are incorporated into the predictions. The findings show that Regex is a solution to eliminate the depen-dency on labeled data for aspect detection. For aspect sentiment classification fine-tuning pre-trained BERT on a small subset of data was the most accurate method to lower the dependency on aspect level sentiment data.

KEYWORDS

Aspect-Based Sentiment Analysis, ABSA, BERT , Semi-supervised, POS tags, Social Media Data, Plant Based Domain.

1 INTRODUCTION & RESEARCH QUESTIONS

In the era of digitalization the rise of social media has changed the way information flows around the world and the way social demands are organized. This transformation began in the early 2000s and has been rapid and profound [19]. In 2004 there were less than a million social media users, and now, out of a population of around 7,8 billion, at least 3,5 billion are active online [19]. Popular platforms such as Twitter, Facebook and Instagram are a huge source of information containing opinions from varied groups in real time. Only on Twitter there are 330 million monthly users and 500 million tweets are sent each day [8]. Simultaneously, Natural Language Processing (NLP) and the computational capacities of computers are evolving continuously [35]; NLP allows computers to understand textual data. Consequently social media has become a rich source for people-centric organizations.

Sentiment Analysis (SA) is a form of NLP and serves as a tool to retrieve information about human emotions. There are three levels of analysis: document level, text level and most recently aspect level. Much research has been devoted to SA of Twitter

data at the text level, focusing on the aggregate sentiments of a tweet [14]. Aspect-based Sentiment Analysis (ABSA), on the other hand, is a fine-grained form of SA and has been applied mostly for customer review data sets [22]; its aim is to find the distinct topic features (aspects) mentioned in a text and their respective sentiment. For instance, in the sentence ’I am healthier eating more vegetables, and they are delicious.’ the aspects and sentiments are ’health, positive’ and ’taste, positive’. Initially, traditional machine learning techniques were applied to perform this task [22]. Later on, more complex deep learning models were used; these accounted for word ambiguity derived from context [10].

With the rise of transfer learning, networks pre-trained on large amounts of data have shown groundbreaking performances [25]. In NLP, pre-trained Bidirectional Encoder Representation with Trans-formers (BERT) stands out because it uses transTrans-formers that learn from words in all positions in the sentence simultaneously [9]. BERT was introduced in 2018 by Google AI and has now achieved state-of-the-art results in many NLP tasks including ABSA [11].

Additionally, lexical and syntactical features have been incor-porated with neural models, thus improving results. For instance, Shi et al. (2019) used BERT and incorporated Part of Speech (POS) tags and dependency trees for relation extraction. In the case of the ABSA, task approaches with deep learning models combined with syntactic meaning rule based methods have also led to improve-ments in their predictive capabilities [23]. When solving the task of ABSA, it would be interesting to see how POS tags could improve predictions by establishing relationships between nouns (aspects) and their respective opinion (adjectives).

With respect to model performance, there is a risk that when using transfer learning BERT, the same model can perform worse in a different domain [11]. In this case a possible solution is to ‘post-train’ the model with relevant data, in particular the layers of the network that are after BERT, which is in the first layer. In practice, semi-supervised learning is a feasible approach to this post-training since it minimizes the cost of the fine-grained aspect level labeled data required, and leverages unlabeled data as well [3]. The focus of this research is the "plant-based food sector". In terms of ABSA, it stands out from the sector of conventional food in that the content and aspects found in these sentences are more di-verse than usual food product features - they are controversial and out of scope. For instance, “similarity”, “texture”, “health benefits”, “environment”, “stigma” could be the expected aspects. Another difference is that public sentiments towards these food alternatives to animal meat are strongly polarized. The sector is gaining mo-mentum and is a frequent topic of conversation online. It is chosen because of the immense potential it has to mitigate climate change, prevent pandemics, and improve the lives of the animals raised in the industrialized food production system predominant around the world [6].

Specifically, Twitter posts containing the hashtag ‘#BeyondMeat’ and/or the mention ’@BeyondMeat’ will be used as a sample. The ‘#’ in a social media post helps those who are interested in a topic to find it by looking for a keyword. The ’@’ is used to tag a particular social media page. Beyond Meat Inc. is a publicly traded company 1

(3)

that sells meat-like veggie burgers: simulated chicken, beef and pork sausages. The objective of this company is to emulate the taste and texture while dispensing with processed meats [24]. Beyond Meat is a good candidate for this study because it is the fastest growing food producer in the United States [27], and there are a lot of opinionated Tweets posted about it daily. An example of such a tweet is: “Not a huge fan of BeyondMeat. But when we’re comparing it to real BEEF, it is only the data that matters. PlantBased for the animals, and for the planet. Look at how much water we waste on animal slaughter.” Here, for the aspect “taste” of the burger, the sentiment of this user is negative; on the other hand, the sentiment is positive when considering aspects such as the animals, water waste and the planet.

The above considerations lead to the following research question and sub-questions:

Research Question - How to do Aspect-based Senti-ment Analysis of tweets for the plant based food domain using pre-trained language models

Sub-questions - To what extent is the extraction of target aspects using Regexa feasible solution to auto-matically assign aspect category labels to the sentences? - To what extent can pre-trained language models be leveraged to reduce or eliminate the dependency on la-beled data?

- What is a feasible way to incorporate POS-tags into the predictions?

- To what extent does the incorporation of POS tagts improve ABSA performance?

- To what extent are semi-supervised learning techniques a feasible solution to address the problem of costly fine-grained labeled data?

The contribution of this research is that it investigates the ap-plication of pre-trained deep learning model BERT for the task of ABSA by using semi-supervised learning techniques, with the additional feature of incorporating POS tags into the predictions. To the best of our knowledge, this investigation is the first of its type to implement such an architecture. Moreover, much research has focused on applications of ABSA to product reviews, whereas this research aims to take advantage of the vast amount of opinions available on social media posts. This is achieved by establishing the objective of the development of a practical model that can be used for ABSA analysis of social media posts in real life cases.

The approach to solve these question starts with retrieving data on twitter API with mentions of BeyondMeat. Then labels must be generated for a small portion of the tweets. For BERT, ABSA is turned into a sentence pair classification task via constructing an auxiliary sentence [30]. Additionally, the word embedding and the POS embedding must be combined, and the model is post-trained with a semi-supervised learning.

This research is structured as follows: section 2 presents the re-lated work positioning this study within existing research. Section 3 is the Research Methodology from data acquisition techniques, through data prepossessing, feature extraction, until the models are explained and the experiments outlined. Section 4 contains the

results of the experiments followed by section 5 with the discus-sion and deeper insights. Finally, in section 6, is the concludiscus-sion which summarizes the research and highlights main contributions together with suggestions for future work. At the end is the Ap-pendix which is referenced to across the paper.

2 RELATED WORK

2.1 Sentiment Analysis for the Plant-based

Food Sector

The plant-based food sector is experiencing fast growth and is predicted to continue to do so. Nevertheless, while some groups recognize the values of these products, others are skeptical about them. SA is a tool that can help understand these perceptions.

Zhu 2020 conducted Twitter SA research for the the plant-based brand “Impossible Foods” (the competition). In spite of the com-pany´s portrayed image of health and sustainability for the envi-ronment, it still faces much criticism on Twitter since a majority of tweets with negative sentiments was found.

Note that SA research for the plant-based food sector is scarce compared to research focused on standard food companies [7]. Additionally, these insights are limited to aggregated sentiment at the sentence level of each Tweet. Public perceptions of these products are influenced by the obscurity that arises from current legislation and mass media. Aspect level understanding would be a tool for organizations to gain a fine-grained understanding use targeted efforts to overcome negative bias.

2.2 Aspect Based Sentiment Analysis (ABSA)

In the seminal work of Hu et al. 2004, additional to document and sentence levels of Sentiment Analysis (SA), a third level was pre-sented: aspect-level. Formally two variants of ABSA were presented in the SemEval competition in 2014 [22].

They are visualized with this example:

"I think the environment should come first, but on a daily basis I think about practicality when I make decisions"

The fist variant (subtask 3 and subtask 4, Sem Eval 2014) general method:

Subtask 3, Aspect Category Detection (ACD): The aspects detected in the sentences are broader categories which are not explicitly mentioned in the sentence. ACD: "environmental sustain-ability".

Subtask 4, Aspect Category Sentiment Classification(ASC) Determination of the sentiment polarity expressed about the aspect category. ASC: "conflict".

The second variant (subtask 1 and 2, Sem Eval 2014) is a coarsed method:

Subtask 1, Aspect-target Extraction (AE): Target aspects which are explicitly mentioned in the sentence are extracted. AE: "envi-ronment".

Subtask 2, Aspect-target Sentiment Classification (ATSC): Determination of the sentiment polarity expressed about the target aspect. ATSC: "conflict".

The focus of this research is on the sub-task ’ASC’. This decision was made from analysing the types of aspects and their frequencies 2

(4)

across the data set (Section 4.4). For this task a sentence ’s’ with it’s respective target aspects ’ta’ should be provided as an input to the model, and the sentiment polarity ’p’ expressed about ’ta’ in the given sentence are the predictions to be made. This is applicable for organizations that aim to recognize the public sentiment about target-aspects that are relevant to them. Moreover, when working with real life social media cases it is necessary to device methods for ACD to generate the ’ta’s to be provided as input to the ASC models as fine grained labeled data is not available.

Initially [22] most of the approaches to ABSA used traditional machine learning techniques, and the best performing models relied on SVMs [31] using n-grams. They where limited because these methods did not consider ambiguity in the meaning of words that depend on the words that surround them [22].

ABSA has been a fast moving research field. Deep Learning (DL) has evolved to consider such nuances regarding word meaning from context. Recurrent neural networks (RNN) account for the sequential nature of language, LSTMs account for long term de-pendencies [20], bidirectional RNNs account for future and past contextual dependencies, and transformers account for non-linear contextual dependencies [20]. A comparative review of different deep learning models used for ABSA explored the performance deep neural networks (DNN), RNN, convolutional neural networks (CNN), Recursive Neural Networks RecNN and LSTM [10]. To im-prove performance, several of these models used fine-tuned and pre-trained word embeddings as well as POS tags to incorporate grammatical factors [10]

DL models pose challenges - they require a substantial amount of labeled data to accurately learn and they are computationally expensive. Pre-trained models that have been previously trained on large amounts data and can be fine-tuned for differen tasks can be used to overcome this. In this way models such as ULMFit [29], Elmo [4], GPT 2 [5], XLNet [33] and Bidirectional Encoder Repre-sentations with Transformers (BERT) have been applied to several NLP tasks and found improved performance.

Aspect Based Sentiment Analysis with BERT Pretrained BERT in particular has achieved state-of-the-art performance for ABSA [11]. It stands out because it uses Transformer encoders which interpret words within their context by considering simultaneously all words in the sentence [2]. Initially, studies on ABSA with BERT [32] approached the task by post-training BERT for a review reading comprehension task. Later on, ABSA research with BERT was done by constructing auxiliary sentences from the aspect, converting ABSA into a sentence pair classification task [30]. Hoang et al. 2019 used this similar architecture and, proposed a combined model for both aspect and sentiment classification which outperformed previously advanced results [11]. These studies used the human annotated data sets from SemEval [22] and SentiHood [26] and did not address lowering the methods dependency on labeled data.

2.3 Part of Speech Tagging for ABSA

Again, to the best of our knowledge BERT models for ABSA have not yet incorporated POS tags into their predictions. Various other deep learning models, combined with rule-based methods that consider syntactic meaning and sentence structure, have shown improved performance for ABSA from such an incorporation; for instance,

Deep CNN POS tags, RecNN POS tags, dependency tree-based CNN POS tags and several others [10]. Considering BERT models and POS tags, this combination has been used for other NLP tasks. Shi Lin [28] used BERT models and incorporated lexical and syntactical features with the use of POS tagging and dependency trees. It would be interesting to see how these POS tags could be used to improve ABSA predictions with BERT by establishing relationships between nouns (aspects) and their respective opinion (adjectives).

2.4 Semi-supervised learning

Additionally, when using transfer learning BERT for a task different to the domain in which BERT was pre-trained, there is a possibility to post-train the model with a data set from the relevant domain. BERT itself will be in the first layer of the network and its complex-ity makes it nearly impossible to change most of its parameters, and so post-training is done in the next layers. A challenge that arises for ABSA is that the data labeled at such a fine-grained level is costly to generate. On the other hand there is a large amount of unlabeled data. Semi-supervised learning is a practical approach to solve this problem. It allows for the use of large amounts of unlabeled data and some labeled data thus minimizing the overhead of generat-ing labeled data [3]. Li, Chow and Zhang [17] implemented this methodology of semi-supervised sequence learning over a small set of labeled reviews and a large set of unlabeled reviews from the same domain for ABSA. They conducted experiments on the datasets from the SemEval workshops and arrived at pioneering performance.

3 RESEARCH METHODOLOGY

3.1 The Data

3.1.1 Existing Gold Standard Data Sets: Prevelant research on ABSA and TABSA has used the SemEval 2014 Task 4 (Pontiki et al.) and the Sentihood (Said et al.) data sets for training and testing their models. The data are customer reviews for the restaurants and laptops domain, and text from a question-answering platform with reviews about different locations in London respectively. In Sentihood, the ’t’ for each sentence is a ’target entity-target aspect pairs’ (t-a pairs) or a list of (t-a) pairs for which a sentiment is expressed. In SemEval 2014 these ’t’ is composed only by ’a’, hence ’e’ is implicit. These variations are taken into account for the ACD task (section 4.4).

3.1.2 Social Media Data: Social media posts are dependent on each user and, unlike in product reviews where users are asked about particular offerings, each user is free to choose the topics he wants to talk about. Derived from this, three scenarios are con-sidered: cases in which they mention different products of food choices ’te’ with an opinion and without mentioning the different ’a’ about them - here ’a’ can also be considered; cases in which no ’te’ is mentioned, rather users are talking about their perceptions of ’a’ directly and not in relation to any ’te’ - this would be a similar set up to SemEval 2014, Task 4; and finally there were cases in which sentences contained sentiments about aspects related to entities forming ’t-a’, as in SentiHood.

Apart from structure, it is important to consider that social me-dia posts are different from product reviews and from reviews on 3

(5)

question-answering platforms in that it is (messy/noisy) and of unpredictable length. With respect to Twitter, there are three main limitations to large Twitter data sets available online. Firstly, they are limited to certain topics among which plant based food is not included; secondly, the data were retrieved on earlier dates and thus were less relevant since sentiments change over time from external events and/or marketing campaigns; and lastly, these data sets have been labeled for Sentiment Analysis tasks, therefore portray the aggregated tweet-level sentiments. The current aim is to achieve fine-grained insights about the different sentiments expressed about targets at sentence-level. Consequently it is relevant to have up-dated data scraped using key words of a given domain, business or organization.

3.1.3 Data Acquisition: The different data acquisition libraries: Twepy, Snscrape, and Twint libraries are compared and discussed in Appendix A1 along with the decision process and limitations that lead to the final data set. Snscrape was the best option consid-ering its has less restrictions and is in line with Twitter terms and conditions. The final list of key words in the search were: ’Beyond Meat’, ’Beyond Burger’, ’Vegan’, ’Veganism’ and ’Plant Based’. It contained 348,114 tweets, 295678 dating from the 7th August 2020 until 14 of Dec 2020.

3.1.4 Domain Relevant Labeled Data: Aspect category de-tection and sentiment polarity inference procedures explained in section 4.3.

3.2 Pre-processing

The unstructured data from Twitter API must be pre-processed and cleaned. Consecutively exploratory data analysis was conducted.

3.2.1 Cleaning: Cleaning the data set before further processing was an especially important and laborious task when working with social media data. ’’ and @ symbols and urls were removed. Some hashtag words were part of a sentence and semantically relevant. In this cases only the ‘’ was be removed but the consecutive word would remain. On the other hand many words accompanying "" did not have semantic relevance and had to be removed together with the "". A baseline rule was developed to distinguish weather or not the ’’ word was relevant. The Python module ’re’ was used for regu-lar expression (regex) operations. Patterns of Unicode strings were searched to modify each sentence using the defined patterns. [15]. Emojis, given that they are expressions of emotion, were not elimi-nated; instead they were added as special tokens to the BERT tokens in later stages of the methodology.

3.2.2 Exploratory Data Analysis: Libraries from the Natural Language Toolkit (NLTK), are used for statistical natural language processing. Hashtags are used to indicate the topic of a post. Com-paring frequencies of the hashtags across the data was done to obtain insights as to which topics were present. Matplotlib enabled WordClould to run and plot on its base and output words in different sizes, each representing the importance of each word according to the frequency. This was done separately for the data corresponding to each key word. These insights were used in later stages of the methodology (section 3,5 and 3,6) where the target- entities and aspects-targets were being determined and extracted. Figure 1 in Appendix A.2 section presents examples of these clouds.

3.2.3 Sentence Segmentation: Each tweet was split into sen-tences with NLTK SentTokenizer for a sentence level analysis. The tweet id value can be used to trace back which sentences were together in a single tweet.

3.2.4 Subjectivity Analysis: Subjectivity implies expressions of opinions and feelings. Subjectivity analysis serves to determine if tweets contain sentiments or they are objective statements. Tweets that score low on subjectivity are excluded from the analysis. The Textblob library is used to calculate the sentence subjectivity score which is a float between 0 and 1, 1 being the highly subjective and 0 being highly objective. Sentences with a score below 0.4 were excluded from the analysis to prevent inputting noise to the models.

3.3 Linguistic Features: POS tag Embeddings

The sentences were split into word tokens and the most likely POS tags were predicted for each token using the out-of-box POS tagger from spaCy [18] . These tags are informative for a syntactic and se-mantic understanding. For instance, in the sentence “Can I eat those leftovers please?” the word “Can” could have a semantic meaning of a modal for question formation as well as a food container.

3.4 Target Extraction

The first step in identifying the ‘t’s for each sentence was establish-ing the predefined lists of target entities and target aspects.This is congruent with the approach from the gold standard data sets [26] [22] as well as with the objective of finding insights about targets aspects of interest for the domain.

3.4.1 Target-entities(‘te’ in the Data Set): A ‘te’ is an entity of interest with attributed aspects for which opinions are expressed. For this running example, the set of target entities is chosen using relative frequencies of different name entities across the sentences as well as the topic hashtags. This was done using the hashtag frequency word clouds from the EDA shown in Apendix A.2 Figure 12. Additionally the name entities were extracted from the sentences and another frequency word cloud was built. The frequency analysis was done considering the relevance to the domain. Finally the entities were organized in a dictionary that included 31 entity values such as ’Beyond Meat’ and ’Plant Based’. This analysis is presented in Appendix A.3.

3.4.2 Target-aspects (’ta’) for Plant Based Food A target as-pect ‘ta’ is an attribute about an entity towards which an opinion is expressed. They should be domain relevant. A comprehensive review of the characteristics of plant based nutrition that induce a switch to a plant-based diet, presented in Appendix A.4, was taken into consideration as an instruction to choose aspects for entities of this domain [1]. The relative word frequencies from the Word Clouds were used to asses the appearances of the aspects among the data set and an aspect dictionary was made. The keys were the aspect categories [health, wellness, price, animal welfare, environmental sustainability, animal welfare, deforestation, ocean, pandemic, similarity, taste, values, investment]. The dictionary val-ues were the target aspects as they were explicitly mentioned, and they included the keys with their synonyms and other aspects from the category.

3.4.3 Entity and Aspect Extraction The inputs for the ASC task are ‘s’ and ‘t’. In a real life case manual labeling is not a practical 4

(6)

approach. In order to provide the ASC model with ‘s’ and ‘t’, Regex regular expressions were used to identify targets as they appear within the sentence and extract them. Sentences were assigned their ‘t’s as labels, while sentences with no ‘t’s were excluded from the analysis. This method ACD was later tested with ground truth (human annotated) data. Note that ACD task was chosen instead of AE. This was the feasible approach since for the AE task 50 different types of target aspects appeared, as they were explicitly mentioned; 36 of these were contained in less than 200 sentences. Using broader target categories resulted in higher frequency distributions across less categories which allowed a larger and balanced data set; this can be referred to in the following section 3.4 Target Selection and section A.6 of the the Appendix.

3.4.4 Types of Sentences In the ASC task, ultimately the in-put to the model is a sentence ‘s’ with its respective target(s) ‘t’. Let us consider two possibilities for what ‘t’ entails in TABSA sun2019utilizing, this is dependent on the composition of the sen-tence. The first type of ‘t’ appears when a sentence has target-entity target-aspect pairs (‘te-ta’ pairs) composing each ‘t’ . This is the case in the SentiHood data set [26]. A list of 50 locations in London was given as potential ‘te’s and a list of 7 aspect characteristics about the locations (such as price, ambience, etc) was given as ‘ta’s for the locations. The second form of ‘t’ is found in sentences where the ‘te-ta’ pairs become only target-aspects ta, hence a single (implicit) entity is assumed sun2019utilizing. This structure is found in the SemEval 2014 datasets, Task 4, for product reviews for restaurants and laptops [22]. The following types of sentences were identified in the data set for tweets about the plant based food topic:

(1) Sentences that contain only target-aspects: "I think this way we are living sustainably and reducing cruelty towards ani-mals". ‘Ta’s : ‘environmental sustainability’ ‘animal welfare’. Note that there is no mention of a target entity, thus pre-suming there is only one entity which is implicit; it can be inferred by checking the topic corresponding to the sentence assigned according to the search query. In this sentence the topic was ‘PlantBased’.

(2) Sentences that contain target entity-aspect pairs: "Choos-ing veganism is mainly about acknowledg"Choos-ing animals but delicious and healthy food is a side benefit!". The target entity-aspect pairs are [veganism, animal welfare], [vegan-ism, taste], [vegan[vegan-ism, health].

(3) Sentences that contain only target-entities: "Beyond Meat is a great option". Target Entity: Beyond Meat. Target entities by themselves were added as a new type of´t´. In the SentiHood data set [26] sentences with no aspect towards an entity were considered aspect ‘general’, but in our case, since targets were extracted using regular expressions, the target entity is appropriate.

Analyzing ‘te-ta pairs falls out of the scope of this investiga-tion. It is suggested that similar methodologies can be used when taking ‘t’ and ‘te-ta pairs’sun2019utilizing, although considering this distinction, data sets should be evaluated separately, therefore sentences which included ‘te-ta pairs where excluded from the analysis.

3.4.5 Target Selection: 17326 sentences which contained only target-aspects were chosen for this analysis. Aspects with frequency

counts below the 0,25 quantile of 530 were excluded, leaving 16287 sentences, 94% , for 6 aspects [‘animal welfare’, ‘environmental sus-tainability’, ‘taste’, ‘price’, ‘wellness´ and ´health´. Consequently sentences with many ‘t’s were transformed into a single observa-tion per sentence with its respective aspects as features, leading to 14832 individual sentences. Out of these 13804 , 93% contained one aspect, and the maximum frequency count of any set of aspects found together across the sentences did not surpass 182. This is because tweets themselves have a maximum length of 55 words, and these had previously been split into sentences. The data was filtered to only those sentences which contain one aspect. Regard-ing the data set with the sentences type that contained only ‘te’s, 76666 of these (88% ) had [‘vegan’, ‘veganism’, ‘Beyond Meat’, ‘plant based]´as´te´s and appeared as single targets. These sentences and corresponding targets were included in the analysis. The implicit aspects about these entities was ‘general’.

3.5 Domain Relevant Ground Truth Data Sets

The data set for ACD and ASC varied in size. This was because the labels ‘AE’ were extracted from the sentences using Regex, whereas the labels for ‘SA’ posed the constraint of manual labeling. A larger data set helps the models perform better but an imbalanced data set would result in overfitting to the majority class. Figure 3 in Appendix A.6 shows that the class frequencies for the different targets ranged between 700 to 25000.

To build and test the ACD models undersampling was done for the majority classes, arriving at a semi-balanced sample totaling of 9443 sentences presented in Figure 4 Appendix A.6. This addressed the objective of having as large a data set as possible while preserv-ing a balance among classes. A ground truth (human annotated) data set was also required to test these models, and so a randomly selected hold out set of 370 (4%) sentences was put aside from the rest of the sample. Due to the additional costs of manual labeling and time constraints, a maximum of 1370 sentences were be labeled. For ASC task the data set to label was balanced with 100 observa-tions for each class for a total of 1000 sentences presented in Figure 5 Appendix A.6.

3.5.1 Interrater reliability: Excel was used as an annotation tool for simplification purposes. Initially three annotators were selected for the task, none specialists in linguistics. This labeling procedure was conducted in a similar manner to the labeling for the ABSA SentiHood data set [26]. They began by reading guide-lines and examples from SemEval 2014 and SentiHood data sets. Afterwards they entered a discussion to resolve any misunderstand-ings and reach a shared discernment. For ASC, following SemEval 2016 [21], the polarity categories where set to “positive”, “negative”, “neutral”, or “conflict”. Including the conflict class was found ap-propriate for this running example because of the double entendre and/or sarcasm found in several sentences. 10% of the whole dataset was randomly selected and annotated by all the three annotators. Cohen’s Kappa coefficient (K) [34] is used to measure the pairwise agreement between each two annotators. The annotator with the highest aggregated inter-annotator agreement had 0.768 K score which is interpreted as substantial [16] agreement was selected to annotate all the 1000 sentences. This score was 0.15 & 0.43 higher than the other two annotators respectively. Detailed K scores per 5

(7)

category and an analysis of the implications are shown in the Ap-pendix A.5 Annotator agreement evaluation: Cohen’s Kappa. The polarity distribution of the labeled sentences is plotted in Figure 6 in Appendix A.6 showing a class imbalance towards the positive class.

Annotations for the the ground truth test set for the ACD de-tection were evaluated with a similar approach. In this case the annotator with the highest K score had 0.811 which was 0,5 and 0,2 higher than the other two annotators.

3.6 Models

3.6.1 Recurrent Neural Networks: LSTM and GRU: RNN sim-ple models pose the risk, in the backpropagation stage, of encoun-tering vanishing or exploding gradient. LSTM GRU networks are able to reduce this hindrance. LSTM manages the read, write and reset workings of its internal state with output, input and forget gates in a memory cell. The forget gate, at time t with the current input xt, decides which information parts to maintain or offload, also considering the output from the hidden state h(t-1), to then update the memory cell. GRU, introduced in 2014 by Kyunghyun Cho et al., has fewer parameters and differs from LSTM because it has only two gates: reset and update; it does not have an output gate and no memory unit. Figure 8 Appendix A.7 compares the demonstration and computation of hidden networks between LSTM and GRU.

3.6.2 BERT: The chosen BERT Model:BERT base has 12 atten-tion heads and 12 encoder layers (12*12) for a total of 144 distinct attention mechanisms. This summarized model architecture is vi-sualized in Figure 9 Appendix A.7

• Special Pre-processing to fit BERT: Input words id, input mask and input type ids are the 3 inputs expected by BERT. The preprocessing model adjusts the strings into objects for the appropriate format for BERT. Additionally, with a packer, the input is truncated to 128 tokens and organized into tensors before entering BERT. Setting up the BERT pre-trained model starts with WordPiece tokenization used from BertTokenizer, then adds specialized tokens [SEP] [CLS]. • Input Representation: The set of tokens is processed through

three different embedding layers (token, segment and posi-tion) with the same dimensions; input embeddings are the sum of these three and are passed to the encoder layer (trans-former encoder). Trans(trans-former architectures are encoder-decoder networks and BERT is an encoder stack of trans-formers. BERT does not have a transformer decoder - it is only the encoder with the classification. The classifier acts as a decoder in masked prediction, trying to identify the word underlying the mask.

• Transformer encoder: Projections of words to numerical values. Contextualized word-embeddings assign embeddings to words based on their context. BERT stands out because it considers the context of a word looking at the words at the left and right side of it simultaneously rather then right to left or left to right. This is called multihead attention since attention mechanisms (heads) operate simultaneously • Classification Layer: Fully Connected Layer + GELU + Norm:

Aspect Categories:10 aspect categories. Sentiment Polarity

Classification: ‘positive’, ‘neutral’, ‘negative’ and ‘conflict’ classes are created.

• Fine-tuning: In deep neural networks the first layers learn the most general patterns and the final layers learn higher order feature representations from the data. Fine-tuning a pre-trained model can lead to better performance by setting a few top layers of the model as trainable to repurpose them with the dataset of the relevant domain. The added classification layer and the last layers are trained to be relevant for the specific task.

3.7 Experiments

3.7.1 ACD:The experiments of this tasks start with the Automatic Extraction of Aspects using Regex ’Regex ACD’ to generate labels to train. The first part is testing the accuracy of the automatically extracted aspects with the ground truth set. Then pre-trained BERT was used to make predictions by unfreezing the final layer, leaving 3,076 trainable parameters. It was fine-tunned with the automat-ically extracted Regex labels ’BERT ACD’. This allowed to asses and compare the performance of both without the use of manual labeling. Early stopping and model check points were set up to monitor the models and save it at each epoch.

3.7.2 ASC: The experiments for the task of target sentiment classification consisted of three approaches. The first ’BERT (ASC)’ was using pre-trained BERT and the final layer unfrozen and fine-tuned with the 1000 sentences ground truth data. The second was ’BERT-POS (ASC) where instead of using only the sentences and targets as input, POS taggs were inserted in each sentence next to their corresponding word; these extended sentences were the input to the model. The third experiment was ’BERT-Semi (ASC)’ and consisted of a semi-supervised learning approach in which BERT(ASC) was used to make predictions for the remaining 8433 relevant data set; these predictions where used together with the ground truth (minus the test hold set) to train the model.

3.7.3 Train-test split: Since the data set used for the different tasks varied in size, and some experiments require a hold out ground truth set for testing, the training-validation-test split differed be-tween experiments. This is done to limit the test set size.

3.7.4 RNN: LSTM GRU: Tokenizing nltk.tokenizer is fitted to the words tweets in the data set and a corpus is created. Zero padding was applied for equally sized inputs in the training set with the dimensions. Each layer of the model returned the sequence output for the next layer, except in the second to last layer where the flattened sequence is returned in a single flattened numerical vector. The number of neuron units in each layer ranges from 300 to 50 where the top layer has 300, the next 250 and so on. The activation function used was ReLU and the multiclass categorical cross entropy.

3.7.5 BERT: Transferring learning and unfreezing final layer Only the final layer was unfrozen leaving 3,076 trainable parameters. In this way the model output from previous layers is flattened into one long feature vector that is then connected to the final classification layer that had the 10 target categories.

3.7.1 Other hyper parameters. Since this is a multi class classifica-tion problem the loss funcclassifica-tion used was categorical cross entropy function. For this the classes where vectorised for the computation 6

(8)

Hyper-parameters RNN: LSTM & GRU BERT (ASC) BERT-POS (ASC) BERT-Semi (ASC) BERT (AE) Train-Test split 900/100 900/100 900/100 200/9244 (80/20%) 362/9081 (80/20%) Layers 8 12 12 12 12 Epoch 3 41 37 20 36 Learning Rate 0.001 0.001 0.001 0.001 0.001 Batch Size 20 32 32 32 32 Trainable parame-ters 793,215 3,076 3,076 3,076 3,076 Frozen pa-rameters 0 109.482 million 109.482 million 109.482 million 109.482 million Table 1: table

Data of the Hyper-parameters of AE & ASC Models

of the appropriate distances between classes. The optimizer that BERT was originally trained with is "Adaptive Moments" (Adam). It minimizes the prediction loss and does regularization by weight decay.

3.8 Evaluation Metrics

Following the metrics used for ABSA [30] accuracy and weighted macro average precision, recall and F1 score are used as the evalua-tion indices. Accuracy is the proporevalua-tion of predicevalua-tions that where correctly classified by the models. Weighted macro average is the arithmetic mean of each classes’ metric in relation to f1-score, recall and precision. This averaging is appropriate due to the different number of observations for each sentiment class label. Appendix A.9 presents how these evaluation metrics were mathematically computed.

4 RESULTS

Results on the plant based tweets data set are presented in Table 2. For the ACD task Regex (ACD) had 9.41% higher accuracy than BERT(ACD) and 6% lower F1 score. For the task of ASC, BERT(ASC) without POS tags and without semi-supervised learning had the best performance from all the models with an accuracy of 66,77 % and an F1 score of 37%.

5 DISCUSSION

The discussion should entail analyzing the results, making connec-tions with other literature and including explanaconnec-tions as to why the results where as they were?

5.1 Answering the sub questions

• To what extent is the extraction of target aspects using Regex a feasible solution to automatically assign aspect category labels to the sentences?

Model Validation RNN: LSTM & GRU BERT (ASC) BERT-POS (ASC) BERT-Semi (ASC) BERT (ACD) Training loss 1.0373 0.8735 0.9712 0.4693 1.2072 Training accuracy 0.00e+00 0.6469 0.6301 0.8491 0.6037 Validation loss 1.1636 0.8065 0.9659 0.3844 1.0898 Validation accuracy 0.00e+00 0.7000 0.6778 0.8901 0.6516 Table 2: table

Accuracy and Loss Metrics for Training a Validation of the Models

Evaluation Results for the ASC & ACD Models BERT (ASC) BERT-POS (ASC) BERT-Semi (ASC) Regex (ACD) BERT (ACD) Accuracy 0.6677 0.58 0,1889 0.768 0.6739 Precision 0.44 0.21 0.12 0.779 0.6739 Recall 0.37 0.26 0.34 0.61 0,67 F1 score 0.37 0.20 0.16 0.61 0.67 Table 3: table

Accuracy & Weighted Macro Average Precision, Recall & F1 score

The extraction of target aspects using Regex proved to be a feasible way to automatically assign aspect category labels for the ACD task to the extent that 76.8% of the aspect categories detected were accurate. Precision of 77.9% indicates that this is the percentage of the total predicted instances for a class that actually belonged to it. The recall score is slightly lower and indicates that, out of the total true instances for a given class, on weighted macro average, 66% were correctly classified as that class and the rest were miss-classified.

• To what extent can pre-trained language models be leveraged to reduce or eliminate the dependency on labeled data? A comparison of pre-trained BERT (ACD) and Regex (ACD) high-lights which of the two methods is performing better without any labeled data to train. BERT (ACD) had an accuracy of 67.39 % and was trained with Regex (ACD) labels. The validation accuracy of BERT (ACD) could be misleading since it shows the performance of the model tested against predictions with Regex (ACD); in this case it is 65,16 % which is the proportion of the models’ predictions that would be correctly classified if Regex (ACD) was the relevant truth. During evaluation both methods were tested with a holdout ground truth human annotated data set. Accuracy for Regex (ACD) was 76,8% before training BERT, this is 9.41% higher than with BERT. An additional benefit to consider using only Regex (ACD) is that it did not require additional computational capacities. On the other hand, the F1 score for BERT (ACD) of 67% was 5% higher than the score for Regex (ACD) of 61%. These experiments reveal that Regex (ACD) and BERT (ACD) are both plausible methods as is inferred from the accuracy and F-1 scores, but BERT requires additional compitational capacities.

(9)

For the ASC task RNN: LSTM GRU (ASC) was built from scratch as a baseline to compare BERT (ASC) and assess the advantages of transferlearning BERT when training with a small labeled sample. RNN: LSTM GRU had 793,215 trainable parameters and failed to learn how to make polarity predictions on the data set, with approximately 0 in training and validation accuracy. This model and the other variation experiments that used RNN failed to learn due to vanishing gradients that prevented the learning of long data sequences. The gradient, that took information used in the RNN parameter, became increasingly small, and the learning was hampered because parameter updates were insignificant. This was why accuracy kept going back to zero. A potential way to solve this problem would have been creating a Residual Network (ResNet) which has residual connections that reuse activations from previous layers. Pretrained BERT (ASC), on the other hand, with only 3,076 trainable parameters had an accuracy of 66.77% in this scenario of social media data with a small sample of 1000 sentences. This proves that the model was successful in achieving reasonable accuracy relying on a small subset of labeled data. On the other hand the weighted macro averaged F1 score, which takes false positives and false negatives for each class into account, was 37% accurate and is a relevant metric due to the imbalance of classes across the data. This is further analyzed in the discussion of the last sub question with the confusion matrix for BERT (ASC).

• What is a feasible way to incorporate POS-tags into the predictions?

The results from the experiments show that adding the POS tag next to their corresponding word inserted inside the sentence, although feasible in the sense that it is possible and relatively straight forward, is not a recommendable way to incorporate this semantic informa-tion. This is concluded because BERT-POS (ASC) jeopardized model performance, as opposed to BERT(ASC), and it requires additional computational expenses. BERT-POS (ASC) had 58% accurate pre-dictions for the test set, which was 8.7% lower than BERT(ASC). It was also noted that in experiments with different hyperparameters: BERT (ASC) with 5 epoch, 0.001 learning rate and 16 batch size and BERT-POS (ASC) with 60 epochs, learning rate 1e-06 and batch size 12, both models had similar performance of 0.587. The F1-score of BERT-POS (ASC), however, was 20% which was 17% lower than the F1-score of BERT(ASC). This can be referred to in the BERT-POS (CM) observing, for instance, how sentiments with a true positive sentiment are all classified as neutral

• To what extent does the incorporation of POS tagts improve ABSA performance?

The results show that incorporating POS tags into the predictions by adding the postags directly inside the sentence next to each word does not improve the performance of the model when pre-dicting sentiments towards target aspects. Hence, the post tags by themselves did not improve the performance of the model. This indifference, however, only addresses the incorporating POS tags with the current method. This research does not imply that POS tags are categorically not beneficial for predicting sentiment polar-ity for targeted aspects. Other approaches considered are included in section 6: Future work.

• To what extent are semi-supervised learning techniques a feasible solution to address the problem of costly fine-grained labeled data?

The experiments show that for this data set the semi-supervised learning technique to finetune BERT did not improve the model’s ca-pacity to make predictions, with an accuracy of 18,89%. An analysis of the confusion matrices (CM) of these two models revealed a pat-tern. BERT (ASC) was built using the ground truth 1000 sentences and the majority class was positive with around 600 sentences (Fig-ure 6 Appendix A6). The CM of this model (Fig(Fig-ure 12 Appendix A.10), revealed the following: 82% of the true positives were incor-rectly classified as neutral; 49.48% and 60.82% of the true conflicts and negatives, respectively, were incorrectly classified as neutral; and 89% of the true neutrals were correctly classified.

BERT-Semi (ASC), on the other hand, was built with the data set with the 1000 ground human annotated sentences and 8443 sen-tences with sentiment labels predicted using BERT (ASC). In this data set the majority class was neutral with around 6000 sentences (Figure 7 Appendix A.6). This is in line with expectation since the BERT (ASC) CM (Figure 12 Appendix A.10) revealed that the dif-ferent categories where being incorrectly classified as neutral; this was especially strong for the majority class of the ground truth data, with 82% of true positives being predicted as neutral. BERT-Semi (ASC) CM (Figure 13 Appendix A.10), revealed the following: 81% of true neutrals were incorrectly classified as positive; 35% and 49,37% of the true conflicts and negatives respectively were incorrectly classified as predicted as positives; and 68,52% of true positives were correctly classified. This evaluation is showing that BERT-Semi (ASC), which was build with the data set of the majority class neutral, then has a bias towards making wrong prediction for false positives - especially for the majority class present in the data set used to build this model, with 81% of true neutrals being classified as positives. Note that in this experiment a holdout set was put aside for model evaluation which contains 200 sentences from the ground truth human annotated data set. A train-test split from the hold out data would result in evaluating the model with predictions made from the previous model which would not accu-rately depict model performance. This is why validation accuracy is 69,12% higher than test accuracy.

5.2 Answering the main research question

• How to do Aspect-based Sentiment Analysis of tweets for the plant based food domain using pre-trained language models? This proposed method for ABSA addresses the analysis of tweets for the plant based food domain with a targeted approach using prede-fined lists of targets aspects. The process starts with web scraping, Snscrape is the recommended library as it overcomes historical data retrieval restrictions while remaining within the tweeter terms and conditions. Consecutively, data cleaning, segmenting, filtering and feature extraction methods were applied to arrive at the appropriate ’sentence, target’ input required for the ABSA task. For the ACD task, without using labeled data, Regex (ACD) and BERT (ACD) are both viable methods. Regex (ACD) had the higher accuracy of 76.8% and BERT (ACD) had the higher F1 score of 67%. For the ASC task, BERT (ASC), fine-tuned with a small ground truth data set of 1000 8

(10)

sentences, is the best performing model for this data set, with and accuracy of 66.77% and F1 score of 37%.

Appendix A.11 presents the validation and training plots during training for BERT (ASC), BERT-POS(ASC), BERT-Semi(ASC) & BERT (ACD) together with an analysis of the observation and the underlying phenomena.

5.3 Research Validity

5.3.1 Internal Validity. The internal validity of the research refers to the extent to which the outcomes of the experiments were caused by the different treatments being investigated. The treatments would be the use of Regex, the use of pre-trained BERT, the in-corporation of POS tags and the semi supervised learning approach. For high internal validity, elements of the experiments environ-ment should be controlled to be able to determine what is causing changes in performances. With respect to the data set, in order to keep the analysis as close to a real life scenario, real social media was used. The scraping parameters, followed by additional filtering, allowed to control the list of key words for the search query, the dates used, the language, and the subjectivity of the tweets in the final data set. On the other hand it was not possible to determine the internal scraping process; for instance, how the tool passes from tweet to tweet and whether it follows a particular conversation line, which target groups it addresses, what locations it prioritizes, and several other variables that might be introducing bias to the sample. Additionally social media data has high variability, especially when compared to the ABSA gold sets [21] where each sentence imputed to the model processed was previously analyzed by a human anno-tator and was ensured to contain the target aspects with sentiments expressed. This variability might tarnish the models´ learning, mak-ing it less clear how the treatments in each experiment were the cause of the changes in model performance.

With respect to the models, the experiments were conducted so that all the variables across each experiment run was controlled, to then assess the differences in model performance with and without each treatment. In some cases the data set was larger but the clean-ing and filterclean-ing was the same. Hyper parameters tests and computa-tional environment remained the same across experiment instances. Considering the experiment for ACD, the categories detected were dependent on the targets as they were explicitly mentioned, which had been predefined in the ACD AT dictionary using the methods outlined in section 3.4. Therefore the experiment’s result was de-pendent on the content of this target dictionary. It is possible that a different dictionary would have led to different learning from the model and consequently variations in performance for Regex(ACD) and ACD (BERT) as well, since it was learning from these labels. In the case of incorporating POS tags into the predictions, only one methodology to include the tags was compared and using one pre-trained model. Clearer insights of how POS tags could be leveraged would be addressing several ways of incorporation of POS tags with different pre-trained language models. These comparisons would enable a better understanding of the cause for behaviour of the current experiments.

Perhaps the semi-supervised learning approach would have per-formed better with a more homogeneous data set such as the gold set data sets (sources), or even on data from review platforms and

newspaper articles. The initial BERT model would have potentially performed better and above a certain threshold of accuracy. For the initial model semi-supervised learning would improve accuracy, but below that threshold the model actually learns the wrong predic-tions. It is possible that the lower performance of this experiment is particular to this data set and not to the method per se. A way to address this would have been applying the same experiment with different types of datasets and comparing the outcomes.

External Validity External validity refers to the extent to which it is possible to extrapolate the findings, and the extent to which the outcomes of the experiments are interesting to people external to the examined case. Considering first the plant based food domain: the methods applied through the study are applicable for similar research with the domain because the target aspects chosen where based on two factors: prevalent research for relevant factors that influence these types of food choices, as well as the frequencies of appearances of the aspects across the whole data set of over 300.000 tweets. Considering the generalised inferences made about sentiment towards different aspects about plant based food and use for analytic, since it was not possible to control several variables in the data set, it is advised to consider targets groups when using analytic of the results to make inferences, thus taking the different features that came with the tweets, for example date and time, to make conclusions. Related to this is the fact that the sentences were filtered to only those containing the target aspects, and the final models can only be applied to the sentences that contain these target aspects. This was from 194,000 to 75,000. The remaining sentences could contain information of those within the domain who would like to investigate different aspects. In this case they can also do so by applying the same methods and using a different target aspect dictionary. A central goal of these investigation was to develop methods that could be used in real life cases to conduct ABSA on social media, therefore reproducibility was a key objec-tive. This was addressed by organizing the methods in scalable functions. This was done for all the methods, starting with web scraping, through tweet filtering, cleaning, segmenting and format-ting tweets to suit the ABSA task and the models. A google Colab document is provided explaining each part of the code written in Python language and including theory relevant to each part. These tools are accessible on the GitHub repository for this project. Re-searches then only need to change the different parameters to their objective. There are limitations to this. It is possible to generalize this approach to a data set about tweets for a different domain; for instance, tweets about bicycle brands. The key difference would be to change the target dictionary. On the other hand it might be problematic when working with a different social media platform. The main challenge is that tweets have a limited length of 260 characters, thus the resulting segmented sentences were short, and over 92% of them only contained one aspect. This would not be the case with, for instance, blog posts or questions-answering plat-forms (SentiHood). Therefore it is concluded that these methods and findings are generalizable to Twitter data

(11)

6 CONCLUSION FUTURE WORK

This research studied how Aspect-based sentiment insights can be derived from social media posts, specifically for tweets about the plant based food topic. The objectives were lowering the de-pendency on labeled data and improving the model´s predictive performance, with an additional objective of transforming the text found on social media to reflect a similar structure as the gold standard data sets used for the ABSA task.

The results show that for the ACD task, extracting the target aspects using Regex is a solution to eliminate the dependency on labeled data for this task with an accuracy of 76,8% and an F1 score of 61%. These labels can be used directly as input for the ASC task without requiring additional computational capacities, or they can be used to train BERT (ACD) accessing a GPU or TPU, which lead to an accuracy of 67,4% and an increase in the F1 score to 67%. It can be concluded that both are appropriate methods for the ACD task.

For ASC the results indicate that pre-trained BERT is a solution to lower the dependency on aspect level labeled social media data, achieving an accuracy of 66% and an F1 score of 37%. Practition-ers should consider that incorporating POS tags directly next to each word in the sentence does not lead to improvements in the models ability to predict aspect level sentiments. Additionally the experiments revealed that a semi-supervised learning approach BERT-Semi (ASC) decreases strongly the models potential due to the model learning from miss-classified predictions.

To better understand the implications of these results, future studies could address:

- Different ways of incorporating POS tags into the predictions; for instance, by adding syntactic rules to the inferences made by splitting the sentences in conjunctions, thus generating sub-sentences containing the aspects and inputting these sub-sub-sentences to the model, this way potentially helping the model learn better the relationships between the targets and the opinions expressed. - Applying semi-supervised learning techniques with more ho-mogeneous data sets, such as the SemEval SentiHood, to compare performance and the confusion matrices. The key would be to com-pare the performances of different BERT-Semi (ASC) models that were fine-tunned, using data classsified with BERT (ASC) models that have varying accuracies and F1 scores, to identify the thresh-old at which the semi-supervised model learns from sufficiently accurate predictions, and is then able to in fact leverage unlabeled data, rather than learning from missclassifications.

- It would be interesting to see how to generalize the models to different languages. A relatively straight forward way to do this would be to use Google Translate Ajax API for the the tweets that are in different languages, then apply the same methodology and evaluation to the translated sentences.

- Other pre-trained models could be tested for the same experi-ments for an enriched understanding. For instance ALBERT and GPT 3 have shown groundbreaking performances for seveal NLP tasks, and to the best of our knowledge have not yet been used for ABSA.

- Exploring other ways to leverage subjectivity scores for good quality input. For instance, running experiments with lower and/or

higher subjectivity thresholds and comparing overall model perfor-mance to find the optimal point.

- Conducting the research tagging ’te-ta pairs’ as targets instead of only aspects.

REFERENCES

All the sources have been accessed between the months of Septem-ber 2020 and January 2021, this also counts for the footnotes.

[1] Miklós Véha Márk Szakály Zoltán Szakály András Fehér, MichałVéha Gazdecki. 2020. A Comprehensive Review of the Benefits of and the Barriers to the Switch to a Plant-Based Diet.Sustainability (2020).

[2] Dogu Araci. 2019. Finbert: Financial sentiment analysis with pre-trained language models.arXiv preprint arXiv:1908.10063 (2019).

[3] Jamshid Bagherzadeh and Hasan Asil. 2019. A review of various semi-supervised learning models with a deep learning and memory approach. Iran Journal of Computer Science 2, 2 (2019), 65–80.

[4] Samuel R. Bowman, Ellie Pavlick, Edouard Grave, Benjamin Van Durme, Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, and Berlin Chen. 2019. Looking for ELMo’s friends: Sentence-Level Pretraining Beyond Language Modeling. https://openreview.net/forum?id=Bkl87h09FX [5] Paweł Budzianowski and Ivan Vulić. 2019. Hello, It’s GPT-2 – How Can I Help

You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. arXiv:1907.05774 [cs.CL]

[6] Bingli Clark Chai, Johannes Reidar van der Voort, Kristina Grofelnik, Helga Gudny Eliasdottir, Ines Klöss, and Federico J. A. Perez-Cueto. 2019. Which diet has the least environmental impact on our planet? A systematic review of vegan, vegetarian and omnivorous diets.Sustainability 11, 15 (2019), 4110. [7] Anirban Choudhury. 2019. A Deep Dive Analysis of Customer Sentiments in the

Food Service Industry | Quantzig’s New Success Story. https://apnews.com/press-release/pr-businesswire/58df0387e8dc46479849a6cb3078eb29. Accessed: 2020-11-15.

[8] Jessica Clement. 2019. Twitter: monthly active users worldwide | Statista. https://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed: 2020-11-13.

[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805 (2018).

[10] Hai Ha Do, P. W. C. Prasad, Angelika Maag, and Abeer Alsadoon. 2019. Deep learning for aspect-based sentiment analysis: a comparative review. Expert Systems with Applications 118 (2019), 272–299.

[11] Mickel Hoang, Oskar Alija Bihorac, and Jacobo Rouces. 2019. Aspect-based sentiment analysis using bert. InNEAL Proceedings of the 22nd Nordic Conference on Computional Linguistics (NoDaLiDa), September 30-October 2, Turku, Finland. Linköping University Electronic Press, 187–196.

[12] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. InProceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. 168–177.

[13] Cristela Maia Bairrada Isabel Miguel, Arnaldo Coelho. 2021. Modelling Attitude towards Consumption of Vegan Products.Sustainability (2021).

[14] Zhao Jianqiang and Gui Xiaolin. 2017. Comparison research on text pre-processing methods on twitter sentiment analysis.IEEE Access 5 (2017), 2870– 2879.

[15] A.M. Kuchling. [n.d.].Regular Expression HOWTO. Technical Report. Python. [16] Mary l. McHugh. 2012. Interrater reliability: the kappa statistic.Biochemia Medica

(2012).

[17] Ning Li, Chi-Yin Chow, and Jia-Dong Zhang. 2020. SEML: A Semi-Supervised Multi-Task Learning Framework for Aspect-Based Sentiment Analysis. IEEE Access 8 (2020), 189287–189297.

[18] I. Montan M. Honnibal. 2017. spaCy2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. Zenodo (2017).

[19] Esteban Ortiz-Ospina. 2019. The rise of social media. https://ourworldindata. org/rise-of-social-media. Accessed: 2020-11-12.

[20] Tal Perry. 2020. Context is King! Why Deep Learning matters for NLP. https: //www.lighttag.io/blog/context-is-king/. Accessed: 2020-11-09.

[21] Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad Al-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, et al. 2016. Semeval-2016 task 5: Aspect based senti-ment analysis. In10th International Workshop on Semantic Evaluation (SemEval 2016).

(12)

[22] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. SemEval-2014 Task 4: Aspect Based Sentiment Analysis. InProceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Association for Computational Linguistics, Dublin, Ireland, 27–35. https://doi.org/10.3115/v1/S14-2004

[23] Paramita Ray and Amlan Chakrabarti. 2020. A mixed approach of deep learn-ing method and rule-based method to improve aspect level sentiment analysis. Applied Computing and Informatics (2020).

[24] E. Battaglia Richi, Beatrice Baumer, Beatrice Conrad, Roger Darioli, Alexandra Schmid, and Ulrich Keller. 2015. Health risks associated with meat consump-tion: a https://www.overleaf.com/project/5fb4fbfa1968b983b7a27640review of epidemiological studies.Int. J. Vitam. Nutr. Res 85, 1-2 (2015), 70–78. [25] Sebastian Ruder, Matthew E Peters, Swabha Swayamdipta, and Thomas Wolf.

2019. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials. 15–18.

[26] Marzieh Saeidi, Guillaume Bouchard, Maria Liakata, and Sebastian Riedel. 2016. SentiHood: Targeted Aspect Based Sentiment Analysis Dataset for Urban Neigh-bourhoods. InProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. The COLING 2016 Organizing Com-mittee, Osaka, Japan, 1546–1556. https://www.aclweb.org/anthology/C16-1146 [27] Masha Shahbandeh. 2020. Beyond Meat Inc. https://www.statista.com/topics/

6016/beyond-meat-inc/. Accessed: 2020-11-20.

[28] Peng Shi and Jimmy Lin. 2019. Simple bert models for relation extraction and semantic role labeling.arXiv preprint arXiv:1904.05255 (2019).

[29] Sam Shleifer. 2019. Low Resource Text Classification with ULMFit and Backtrans-lation. arXiv:1903.09244 [cs.CL]

[30] Chi Sun, Luyao Huang, and Xipeng Qiu. 2019. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. arXiv preprint arXiv:1903.09588 (2019).

[31] Bo Wang and Min Liu. 2015. Deep learning for aspect-based sentiment analysis. Stanford University report (2015).

[32] Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. Bert post-training for review reading comprehension and aspect-based sentiment analysis. arXiv preprint arXiv:1904.02232 (2019).

[33] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2020. XLNet: Generalized Autoregressive Pretraining for Lan-guage Understanding. arXiv:1906.08237 [cs.CL]

[34] Paul R. Yarnold. 2016. ODA vs. and : Paradoxes of Kappa.Optimal Data Analysis (2016).

[35] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing.ieee Computa-tional intelligenCe magazine 13, 3 (2018), 55–75.

[36] Danny Zhu. 2020. Sentiment Analysis for Impossible Burger. https://rpubs.com/ DannyZhu/Impossible_Burger. Accessed: 2020-11-15.

(13)

A

APPENDICES

In this section, the data acquisition methods, exploratory data analysis frquency WordClouds, the target entity and target aspects dictionaries, the pairwise intralabeler agreement analysis, the, the LSTM GRU model comparison, and the validation plts are held for reference all throughout the main body of this report.

A.1 Data Aquisition

A comparison of different web approaches was done to prevent data retrieval becoming a bottleneck. The limitations of different approaches with respect to the search query parameters were considered as well as the time each method took to retrieve tweets. The libraries compared were Tweepy, Snsscrape and Twint.

A.1.1 Tweepy.

• Requires a Developer account to obtain Twitter API credentials. The application for this took three weeks from when the application was submitted until it was approved.

• It allows to search a list of key words simultaneously and it is not possible to retrieve data historical data or to set a time frame. • It took 30 minutes to retrieve 15 tweets. Recently the historical data restrictions by Twitter have increased which made retrieving

Tweets with this library a slow process.

• It falls within the Twitter Terms and Conditions Polices. A.1.2 Snsscrape.

• Requires a Developer account for Twitter API credentials

• It allows to search a list of keywords and an allows to choose the time interval.

• t can scrape around 100k tweets per day. It overcomes the historical data restrictions because instead of retrieving the tweets themselves this method retrieves the url links to each tweet of interest (determined by a key work or list). From the tweet urls the tweet ids can be filtered. Then the tweets are searched by their ids.

• It falls within the Twitter Terms and Conditions Polices. A.1.3 Twint.

• It does not require API credentials to retrieve the data.

• It allows to search one key word at a time from the list of key words and it allows for historical data retrieval in the sense that it begins by taking most recent tweets and the older ones and so on.

• It can retrieve over 300k tweets per day since it has no rate limitations.

• Although it is not against iliegal it does not fall within the Twitter terms and policies.

Taking the above aspects into account Snscrape is the recomended option because it overcomes many of the restrictions that delay data retrieval while still remaining within Twitter Terms & Restrictions After the most appropriate methodology is selected, the next step is to arrive at an optimal choice of research parameters. These were list of topics/keywords to search by and a time frame. Considering the practical orientation of the research, it was convenient to retrieve tweets posted near the date of retrieval so scarping was set to take tweets from the date of retrieval starting with recent tweets and backwards. The list of keywords was dependent on the topics relevant to the plant based food domain. Two variations were considered. The first related to brands and products to for insights about public perception towards them. Publicly traded plant based meat companies such as Beyond Meat and Impossible Meat were included. The other variation considered takes general topics of the domain, for instance terms such as ‘Vegan’ and ‘Plant Based’. This choice also impacted later stages of the analysis where target-entities and target aspects defined.

Initially the key words used were: ’Beyond Meat’, ’Beyond Burger’, ’Impossible Meat,’ , ’Impossible Burger’, ’Vegan’, ’Veganism’, and ’Plant Based’. It was possible to retrieve only 26 tweets mentioning ’Impossible Meat’ and ’Impossible Burger’. It is unlikely that this was due to data availability but rather an internal problem of the scraping process. Nevertheless these two key words were excluded form the analysis.

The final data set contained 348,114 tweets, 295678 dating from the 7thAugust 2020 until 14 of Dec 2020.

(14)

A.2 Exploratory Data Analysis

Figure 1: Key word:#PlantBAsed, Word Cloud

Figure 2: Key word:#BeyondMeat, Word Cloud