• No results found

VU Research Portal

N/A
N/A
Protected

Academic year: 2021

Share "VU Research Portal"

Copied!
7
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

VU Research Portal

The automatic acquisition of a Dutch lexicon for opinion mining

Maks, E.

2018

document version

Publisher's PDF, also known as Version of record

Link to publication in VU Research Portal

citation for published version (APA)

Maks, E. (2018). The automatic acquisition of a Dutch lexicon for opinion mining.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal ?

Take down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

E-mail address:

(2)

Contents

1 Introduction 11

1.1 List of publications . . . 16

1.2 Data and software . . . 17

1.3 Main contributions of this thesis . . . 17

2 Models and Resources for opinion mining 19 2.1 Introduction . . . 19

2.2 Resources in opinion mining . . . 20

2.2.1 Appraisal Framework . . . 20

2.2.2 Computational lexicons for general language . . . 22

2.2.2.1 Princeton WordNet (PWN) . . . 22 2.2.2.2 Cornetto . . . 25 2.2.2.3 FrameNet . . . 26 2.2.3 Polarity lexicons . . . 28 2.2.4 Emotion lexicons . . . 30 2.2.4.1 WordNetAffect . . . 30

2.2.5 Lexicons and multiple affect . . . 31

2.2.5.1 WordNetAffectPlus . . . 31

2.2.5.2 +/- Effect WordNet . . . 32

2.2.6 Lexicons with semantic classification . . . 33

2.2.6.1 SentiFul . . . 33

2.2.7 Corpora with annotated opinions . . . 34

2.2.7.1 MPQA opinion corpus . . . 34

2.2.7.2 Asher: opinions in discourse . . . 36

2.2.8 Summary . . . 37

2.2.8.1 Polarity . . . 38

2.2.8.2 Multiple actor attitude . . . 38

2.2.8.3 Semantic classification . . . 40

2.3 Towards a new lexicon model . . . 42

2.3.1 Polarity . . . 42

2.3.2 Multiple actor attitude . . . 43

(3)

CONTENTS

2.3.4 Illustrating the model . . . 52

2.4 Conclusions . . . 53

3 Annotation Model 55 3.1 Introduction . . . 55

3.2 Design of the annotation scheme . . . 55

3.2.1 Annotations at word sense level . . . 56

3.2.2 Annotation of attitudes and polarity . . . 56

3.2.2.1 Actor’s attitude (AC) . . . 57

3.2.2.2 Speaker/writer’s attitude (SW) . . . 60

3.2.2.3 SW and AC attitude combined (SW&AC) . . . 61

3.2.2.4 External attitude (extAtt) . . . 62

3.2.2.5 No specific attitude (noAtt) . . . 63

3.2.3 Schematic overview of the annotation scheme . . . 63

3.2.4 Examples of annotations . . . 63

3.2.4.1 Examples with verbs . . . 65

3.2.4.2 Examples with nouns . . . 67

3.2.4.3 Examples with adjectives . . . 68

3.3 Inter-annotator agreement study . . . 70

3.3.1 Composition of a representative sample . . . 70

3.3.2 Annotation task . . . 72

3.3.3 Annotations in numbers . . . 72

3.3.4 Inter-annotator agreement . . . 75

3.3.5 Analysis of disagreements . . . 79

3.3.5.1 Multiple actor attitudes . . . 79

3.3.5.2 External attitude (extATT) . . . 81

3.3.5.3 Positive vs. ’no’ polarity (noPol) . . . 82

3.3.6 Comparison with other studies . . . 83

3.4 Creating the final goldstandard . . . 85

3.4.1 Agreement on the simplified annotation schema . . . 86

3.4.2 Agreement on attitude annotations: AC, SW, noAtt . . . 86

3.4.3 Agreement on polarity annotations: positive, negative, noPol . . . 87

3.4.4 Agreement per lexicon dimension . . . 87

3.4.4.1 Lexicon dimensions and polarity . . . 88

3.4.4.2 Lexicon dimensions and attitude . . . 88

3.4.5 Agreement per semantic type . . . 91

3.4.6 Derived gold standard versions . . . 91

3.4.6.1 Gold standards for attitude . . . 93

3.4.6.2 Gold standards for polarity . . . 94

3.5 Conclusions . . . 95 CONTENTS 4 Acquisition Methods 99 4.1 Introduction . . . 99 4.2 Background . . . 100 4.2.1 Lexicon-based methods . . . 100 4.2.2 Corpus-based methods . . . 101 4.2.3 Fine-grained classifications . . . 102

4.2.4 Multi and Cross-lingual Approaches . . . 103

4.3 Evaluation framework . . . 104

4.4 Cross-lingual transfer Method . . . 105

4.4.1 Introduction . . . 105

4.4.2 Datasets . . . 105

4.4.2.1 Sentiwordnet(SWN) . . . 105

4.4.2.2 Dutch WordNet (Cornetto) . . . 106

4.4.2.3 Gold standard . . . 106 4.4.3 Method . . . 107 4.4.4 Results . . . 107 4.4.4.1 Overall results . . . 107 4.4.4.2 Smaller selections . . . 107 4.4.5 Discussion . . . 108 4.5 Wordnet Propagation . . . 109 4.5.1 Introduction . . . 109 4.5.2 Methods . . . 109 4.5.3 Datasets . . . 111

4.5.3.1 Dutch WordNet (Cornetto) . . . 111

4.5.3.2 Seed lists . . . 111

4.5.3.3 Gold standard . . . 112

4.5.4 Results . . . 112

4.5.4.1 Baselines . . . 113

4.5.4.2 Propagation results with different seed lists . . . 113

4.5.4.3 Various wordnet relations . . . 114

4.5.4.4 Various iterations . . . 115

4.5.4.5 Synset-to-word . . . 118

4.5.5 Comparison with other work . . . 119

4.5.6 Discussion . . . 119

4.6 Lexical Feature Approach . . . 121

4.6.1 Introduction . . . 121

4.6.2 Method . . . 121

4.6.3 Data sets and features . . . 122

4.6.3.1 Lexical unit features . . . 122

4.6.3.2 Synset features . . . 125

4.6.3.3 WordNet domain features . . . 126

4.6.3.4 Ontology features . . . 127

(4)

2.3.4 Illustrating the model . . . 52

2.4 Conclusions . . . 53

3 Annotation Model 55 3.1 Introduction . . . 55

3.2 Design of the annotation scheme . . . 55

3.2.1 Annotations at word sense level . . . 56

3.2.2 Annotation of attitudes and polarity . . . 56

3.2.2.1 Actor’s attitude (AC) . . . 57

3.2.2.2 Speaker/writer’s attitude (SW) . . . 60

3.2.2.3 SW and AC attitude combined (SW&AC) . . . 61

3.2.2.4 External attitude (extAtt) . . . 62

3.2.2.5 No specific attitude (noAtt) . . . 63

3.2.3 Schematic overview of the annotation scheme . . . 63

3.2.4 Examples of annotations . . . 63

3.2.4.1 Examples with verbs . . . 65

3.2.4.2 Examples with nouns . . . 67

3.2.4.3 Examples with adjectives . . . 68

3.3 Inter-annotator agreement study . . . 70

3.3.1 Composition of a representative sample . . . 70

3.3.2 Annotation task . . . 72

3.3.3 Annotations in numbers . . . 72

3.3.4 Inter-annotator agreement . . . 75

3.3.5 Analysis of disagreements . . . 79

3.3.5.1 Multiple actor attitudes . . . 79

3.3.5.2 External attitude (extATT) . . . 81

3.3.5.3 Positive vs. ’no’ polarity (noPol) . . . 82

3.3.6 Comparison with other studies . . . 83

3.4 Creating the final goldstandard . . . 85

3.4.1 Agreement on the simplified annotation schema . . . 86

3.4.2 Agreement on attitude annotations: AC, SW, noAtt . . . 86

3.4.3 Agreement on polarity annotations: positive, negative, noPol . . . 87

3.4.4 Agreement per lexicon dimension . . . 87

3.4.4.1 Lexicon dimensions and polarity . . . 88

3.4.4.2 Lexicon dimensions and attitude . . . 88

3.4.5 Agreement per semantic type . . . 91

3.4.6 Derived gold standard versions . . . 91

3.4.6.1 Gold standards for attitude . . . 93

3.4.6.2 Gold standards for polarity . . . 94

3.5 Conclusions . . . 95 4 Acquisition Methods 99 4.1 Introduction . . . 99 4.2 Background . . . 100 4.2.1 Lexicon-based methods . . . 100 4.2.2 Corpus-based methods . . . 101 4.2.3 Fine-grained classifications . . . 102

4.2.4 Multi and Cross-lingual Approaches . . . 103

4.3 Evaluation framework . . . 104

4.4 Cross-lingual transfer Method . . . 105

4.4.1 Introduction . . . 105

4.4.2 Datasets . . . 105

4.4.2.1 Sentiwordnet(SWN) . . . 105

4.4.2.2 Dutch WordNet (Cornetto) . . . 106

4.4.2.3 Gold standard . . . 106 4.4.3 Method . . . 107 4.4.4 Results . . . 107 4.4.4.1 Overall results . . . 107 4.4.4.2 Smaller selections . . . 107 4.4.5 Discussion . . . 108 4.5 Wordnet Propagation . . . 109 4.5.1 Introduction . . . 109 4.5.2 Methods . . . 109 4.5.3 Datasets . . . 111

4.5.3.1 Dutch WordNet (Cornetto) . . . 111

4.5.3.2 Seed lists . . . 111

4.5.3.3 Gold standard . . . 112

4.5.4 Results . . . 112

4.5.4.1 Baselines . . . 113

4.5.4.2 Propagation results with different seed lists . . . 113

4.5.4.3 Various wordnet relations . . . 114

4.5.4.4 Various iterations . . . 115

4.5.4.5 Synset-to-word . . . 118

4.5.5 Comparison with other work . . . 119

4.5.6 Discussion . . . 119

4.6 Lexical Feature Approach . . . 121

4.6.1 Introduction . . . 121

4.6.2 Method . . . 121

4.6.3 Data sets and features . . . 122

4.6.3.1 Lexical unit features . . . 122

4.6.3.2 Synset features . . . 125

4.6.3.3 WordNet domain features . . . 126

4.6.3.4 Ontology features . . . 127

(5)

CONTENTS

4.6.4 Results . . . 128

4.6.4.1 Results on separate features . . . 129

4.6.4.2 Combinations of features . . . 130

4.6.4.3 Results per part-of-speech . . . 131

4.6.4.4 Results on word level . . . 131

4.6.5 Comparison with other work . . . 132

4.6.6 Discussion . . . 134

4.7 Corpus comparison method . . . 135

4.7.1 Introduction . . . 135 4.7.2 Background . . . 135 4.7.3 Datasets . . . 135 4.7.3.1 Corpus composition . . . 135 4.7.3.2 Gold standard . . . 136 4.7.4 Method . . . 136 4.7.4.1 Assumptions . . . 136 4.7.4.2 Lexicon building . . . 137 4.7.5 Results . . . 138

4.7.5.1 Step1: identifying subjective words without distinguishing SW and AC attitude . . . 138

4.7.5.2 Step2: classifying words into SW and AC categories . . . . 139

4.7.6 Discussion . . . 141

4.8 Lexical Pattern method . . . 143

4.8.1 Introduction . . . 143

4.8.2 Background . . . 143

4.8.3 Datasets . . . 144

4.8.3.1 Seed lists . . . 144

4.8.3.2 Dutch N-gram corpus . . . 144

4.8.3.3 Gold standard . . . 144

4.8.4 Method . . . 144

4.8.4.1 Settings . . . 145

4.8.4.2 Finding the best association measure . . . 146

4.8.4.3 Finding the best cut-off point . . . 148

4.8.5 Results . . . 148

4.8.5.1 Results with automatically generated patterns . . . 149

4.8.5.2 Results of high ranking selections . . . 150

4.8.5.3 Results with linguistically motivated patterns . . . 152

4.8.6 Comparison with other work . . . 154

4.8.7 Discussion . . . 155

4.9 Comparing and combining methods . . . 157

4.9.1 Methods for the identification of positive and negative polarity . . . . 157

4.9.2 Methods for the identification of AC and SW attitude . . . 158

4.10 Discussion and conclusions . . . 162

CONTENTS 5 Use Cases 167 5.1 Introduction . . . 167

5.2 Polarity in Hotel reviews . . . 168

5.2.1 Introduction . . . 168

5.2.2 Background . . . 168

5.2.3 Hotel review corpus . . . 168

5.2.3.1 Composition of the corpus . . . 168

5.2.3.2 Reviewer ratings and reader ratings . . . 169

5.2.4 Methods . . . 170

5.2.4.1 The dictionary lookup approach . . . 170

5.2.4.2 Machine-learning . . . 171

5.2.5 Results . . . 172

5.2.6 Discussion . . . 174

5.2.7 Conclusions . . . 174

5.3 Finding holders and targets in political news . . . 175

5.3.1 Introduction . . . 175

5.3.2 Background . . . 175

5.3.3 The OPeNER Corpus . . . 175

5.3.3.1 Annotations of opinion entities and relations . . . 176

5.3.4 The use of the lexicons in the opinion mining task . . . 177

5.3.4.1 Lexicon with SW and AC attitude (SWAC-lexicon . . . 177

5.3.4.2 Polarity lexicon . . . 179

5.3.5 The OPeNER opinion mining system . . . 180

5.3.5.1 Opinion entity extraction . . . 180

5.3.5.2 Opinion relation extraction . . . 180

5.3.6 Results . . . 182

5.3.6.1 Results on identification of entities (Step I) . . . 182

5.3.6.2 Results on identification of relations (Step II) . . . 183

5.3.7 Discussion and conclusions . . . 183

5.4 SW and AC words in lexicon and corpus . . . 187

5.4.1 Introduction . . . 187

5.4.2 Lexical capacity: SW and AC words in the lexicon . . . 187

5.4.3 Lexical usage: SW and AC words in a general language . . . 189

5.4.4 Distribution of SW and AC words in an opinionated corpus . . . 191

5.4.4.1 Corpus composition . . . 191

5.4.4.2 SW/AC distribution in the WNBC corpus . . . 192

5.4.5 Discussion . . . 194

5.5 Discussion and conclusions . . . 196

6 Conclusions 199

7 Bibliography 203

(6)

4.6.4 Results . . . 128

4.6.4.1 Results on separate features . . . 129

4.6.4.2 Combinations of features . . . 130

4.6.4.3 Results per part-of-speech . . . 131

4.6.4.4 Results on word level . . . 131

4.6.5 Comparison with other work . . . 132

4.6.6 Discussion . . . 134

4.7 Corpus comparison method . . . 135

4.7.1 Introduction . . . 135 4.7.2 Background . . . 135 4.7.3 Datasets . . . 135 4.7.3.1 Corpus composition . . . 135 4.7.3.2 Gold standard . . . 136 4.7.4 Method . . . 136 4.7.4.1 Assumptions . . . 136 4.7.4.2 Lexicon building . . . 137 4.7.5 Results . . . 138

4.7.5.1 Step1: identifying subjective words without distinguishing SW and AC attitude . . . 138

4.7.5.2 Step2: classifying words into SW and AC categories . . . . 139

4.7.6 Discussion . . . 141

4.8 Lexical Pattern method . . . 143

4.8.1 Introduction . . . 143

4.8.2 Background . . . 143

4.8.3 Datasets . . . 144

4.8.3.1 Seed lists . . . 144

4.8.3.2 Dutch N-gram corpus . . . 144

4.8.3.3 Gold standard . . . 144

4.8.4 Method . . . 144

4.8.4.1 Settings . . . 145

4.8.4.2 Finding the best association measure . . . 146

4.8.4.3 Finding the best cut-off point . . . 148

4.8.5 Results . . . 148

4.8.5.1 Results with automatically generated patterns . . . 149

4.8.5.2 Results of high ranking selections . . . 150

4.8.5.3 Results with linguistically motivated patterns . . . 152

4.8.6 Comparison with other work . . . 154

4.8.7 Discussion . . . 155

4.9 Comparing and combining methods . . . 157

4.9.1 Methods for the identification of positive and negative polarity . . . . 157

4.9.2 Methods for the identification of AC and SW attitude . . . 158

4.10 Discussion and conclusions . . . 162

5 Use Cases 167 5.1 Introduction . . . 167

5.2 Polarity in Hotel reviews . . . 168

5.2.1 Introduction . . . 168

5.2.2 Background . . . 168

5.2.3 Hotel review corpus . . . 168

5.2.3.1 Composition of the corpus . . . 168

5.2.3.2 Reviewer ratings and reader ratings . . . 169

5.2.4 Methods . . . 170

5.2.4.1 The dictionary lookup approach . . . 170

5.2.4.2 Machine-learning . . . 171

5.2.5 Results . . . 172

5.2.6 Discussion . . . 174

5.2.7 Conclusions . . . 174

5.3 Finding holders and targets in political news . . . 175

5.3.1 Introduction . . . 175

5.3.2 Background . . . 175

5.3.3 The OPeNER Corpus . . . 175

5.3.3.1 Annotations of opinion entities and relations . . . 176

5.3.4 The use of the lexicons in the opinion mining task . . . 177

5.3.4.1 Lexicon with SW and AC attitude (SWAC-lexicon . . . 177

5.3.4.2 Polarity lexicon . . . 179

5.3.5 The OPeNER opinion mining system . . . 180

5.3.5.1 Opinion entity extraction . . . 180

5.3.5.2 Opinion relation extraction . . . 180

5.3.6 Results . . . 182

5.3.6.1 Results on identification of entities (Step I) . . . 182

5.3.6.2 Results on identification of relations (Step II) . . . 183

5.3.7 Discussion and conclusions . . . 183

5.4 SW and AC words in lexicon and corpus . . . 187

5.4.1 Introduction . . . 187

5.4.2 Lexical capacity: SW and AC words in the lexicon . . . 187

5.4.3 Lexical usage: SW and AC words in a general language . . . 189

5.4.4 Distribution of SW and AC words in an opinionated corpus . . . 191

5.4.4.1 Corpus composition . . . 191

5.4.4.2 SW/AC distribution in the WNBC corpus . . . 192

5.4.5 Discussion . . . 194

5.5 Discussion and conclusions . . . 196

6 Conclusions 199

7 Bibliography 203

(7)

CONTENTS

Dankwoord 219

1

|

Introduction

People have and express their views on a sheer infinite variety of subjects. Is Rome the most beautiful city in the world? How do people feel about the Dutch king? What are the best universities in the world? Would Brexit be bad for London’s financial centre? For all sorts of reasons, people take a great interest in knowing what views other people have on subjects like these. The amount of digitized texts in which people express their opinions and attitudes give us abundant opportunities to obtain answers to these questions.

The language and style that conveys this kind of information is often diverse and com-plex. Opinions and evaluations come in many forms such as judgements, allegations, desires, intentions, beliefs and speculations (Wiebe et al. (2005)). Moreover, we can find opinions in many different texts and text genres such as news, editorials, blogs, forums, reviews and online debates.

Various tools and techniques have been developed for the automatic extraction and in-terpretation of opinionated information from text. To accomplish this, a method is needed to distinguish between opinionated and non-opinionated pieces of text. Also, a method is needed for classifying expressions into sentiment categories such as positive, negative, and neutral. Consider, for example, the following text in which a reviewer describes his visit to a museum in Rome1.

(1) A must for both locals and tourists.

This is it! This museum demands time and respect. An extraordinary example of suc-cessful conversion of a decommissioned power plant into an amazingly spacious and airy exhibition space, showcasing beautiful ancient sculptures from Rome’s imperial times as well as beautiful refined mosaics. Highly recommended, take your time. The writer clearly wants to show his enthusiasm for the museum described. The review offers a number of expressions that contribute to this purpose of which successful, amazingly,

beautiful, highly recommended are the most obvious ones. Expressions such as this is it!, demands time and respect surely add to the positive opinion conveyed by this review, but

require more context for interpretation. If an automatic analysis relies on the principle of compositionality, that is, considers the review’s meaning as the sum of its words, it will certainly be able to classify this review as positive.

1https://www.tripadvisor.nl/Attraction_Review-g187791-d19099c0-Reviews-Centrale_

Referenties

GERELATEERDE DOCUMENTEN

2 de post-laboratorium fase; de tijd tussen het moment van rapportage van de uitslag en het moment van presentatie van het resultaat aan de aanvrager.. Inmiddels

Key

Key

Hoeveel weegt een baby olifant bij de geboorte.

[r]

Samenstelling projectgroep, adviesgroep en andere betrokkenen.. 4

De lonen waren al een aantal jaren niet verhoogd en het bestuur vond dat niet langer verantwoord; het kapitaal van Stop Wapenhandel bestaat tenslotte uit de kennis van

U ontvangt een e-mail dat er een document voor u klaar staat en u deze kunt inzien.. Het document ziet u dus niet onder TAKEN op