Working with Grammar

(1)

Working with

Grammar

Using syntactical information to

identify grammatical structures and

classify texts

P.J.

D

ERICHS

M

A THESIS

,

11 J

ANUARY

2016

(2)

Summary

In this thesis automatically parsed information is used to identify grammatical structures. Successful identification of grammatical structures can lead to better and more extensive grammar checks and it could be useful to improve automatic translation software. The first research question focuses on identifying grammatical structures: Can the identification of grammatical structures in automatically parsed sentences be done? The second research question focuses on using grammatical structures as features to classify texts on text difficulty: Can the classification of texts based on text difficulty be improved using grammatical structures from automatically parsed sentences as features?

The first experiment in this thesis focuses on the accuracy of the identification of the grammatical structures using XPath queries on automatically parsed sentences. These grammatical structures are: long subordinate clauses, long topicalisations, nominalizations, passive voiced sentences, discontinuous main clauses caused by adjectives, discontinuous main clauses caused by dependent clauses and chains of prepositional phrases. The

grammatical structures have been chosen by Bureau Taal. Bureau Taal specializes in the clear and proper usage of the Dutch language. The experiment is done by analyzing annotated sentences. These sentences have been created and annotated by Bureau Taal. The accuracy of the grammatical structures vary greatly. The F1-scores range from 55% for discontinuous main clauses caused by dependent clauses to 95% for nominalizations. The results are promising, but there is no golden standard for comparison. Most of the XPath queries can be improved by using the information from the sentences in the test sets. There were insufficient example sentences to create XPath queries that can identify grammatical structures very accurately before using the test sets.

The second experiment focuses on classifying texts using grammatical structures based on the Common European Framework of Reference for Languages. Bureau Taal has provided 125 annotated texts. These texts have been classified using a Naive Bayes classifier. The accuracy of the combination of the baseline features and the grammatical structures as features have led to an accuracy of 55%, which is an increase in accuracy of 7.5% in comparison to the

(3)

1. Introduction

This thesis will focus on identifying grammatical structures of automatically parsed sentences to determine text difficulty. In order to determine the text difficulty of texts, the identification of the grammatical structures must first be set up. The first of two experiments in this thesis will focus on identifying grammatical structures. In order to test text classification based on these grammatical structures, another experiment will be set up to see if these grammatical structures can be used as features in text classification.

Two research questions will be discussed in this thesis:

1. Can the identification of grammatical structures in automatically parsed sentences be done?

2. Can the classification of texts based on text difficulty be improved using grammatical structures from automatically parsed sentences as features?

The first research question will be answered by performing an experiment. The experiment will use an automated syntactical analysis to extract grammatical structures. A selection of seven grammatical structures has been made:

- Dependent clauses - Long topicalizations - Nominalizations - Passive voice

- Discontinuous main clause by adjectives

- Discontinuous main clause by dependent clauses - Chain of prepositional phrases

The grammatical structures will be discussed in chapter 4. The accuracy of these grammatical structures will be determined. The results of the experiment will be discussed in more detail in chapter 5.

(6)

2 - Only the baseline features

- Only the grammatical structures as features

- A combination of the baseline features and the grammatical structures as features

The experiment will be discussed in more detail in chapter 6.

Chapter 2 will focus more on papers on text complexity. It will also look at relevant literature on informative features in order to determine a set of baseline features.

The origin of this thesis came forth from a question by Bureau Taal, a company that

specializes in teaching effective communication and the use of the Dutch language. Bureau Taal has created a tool, called Texamen, to determine the language level of a Dutch text. This is currently done with the use of statistical information, such as sentence length and average word length. Bureau Taal wants to expand this by looking into grammatical structures and would like to be able to give feedback to the user to show where in the text a grammatical structure should be removed to improve the quality of the text.

These two experiments will give more insight into using grammatical information and

(7)

3

2. Literature on Text Readability and Text Classification

In this chapter will discuss previous relevant researches. The focus will be on the difficulty of text readability. The determination of text complexity will be discussed.

2.1 Determining text complexity

Research concerning the determination of text complexity has been done by Pander Maat et al. (2014). The article of Pander Maat et al. also focuses on extracting text features. These text features are deemed important by professionals. To extract these features, a program has been created called T-Scan (Pander Maat et al., 2014, p. 53).

The large amount of feature classes include lexical complexity, sentence complexity, and part-of-speech tagging. Lexical complexity includes determining the word length, amount of nominalizations and abbreviations. Sentence complexity focuses more on features like sentence length, passives and negations and part of speech tagging. These features can be used, for instance, to compare the complexity of different types of text. The authors use six texts, three texts come from a Dutch social science journal and three come from a weekly magazine for young women, called Flair. The goal is to compare the complexity of these texts using the features from T-Scan.

Pander Maat et al. (2014, p. 68) conclude that the broad amount of lexical measures make it possible to distinguish different genres in the texts of Flair, even more so than only looking at word length differences. There is also a distinct difference in usage of personal and possessive pronouns between the different genres of texts.

The article by Pander Maat et al. offers an interesting view of researching text complexity and shows different important features. Some of these features are grammatical structures, namely nominalizations and passive sentences, used in this thesis as a feature to classify texts. These grammatical structures were shown to be important features to determine text complexity.

2.2 Text classification based on CEFR

(8)

4 CEFR, Common European Framework of Reference for Languages for the Swedish language. The CEFR framework classifies the language difficulty of a text. A division is made in

categories A, B and C. A is a starting language user. B stands for an independent language user and is more versatile in the language. C stands for an advanced language user and the quality of the texts get closer to the quality of a native speaker. All categories are divided into two subcategories, for instance A1 and A2. A text with a A2-rating is a more comprehensive text in comparison to A1. Pilán, Volodina and Johansson use 28 different features with a linear Support Vector Machine classifier (2014, pp.177-178). These features include sentence length, type-token ratio, average dependency depth, amount of nouns divided by the amount of verbs, nominal ratio (the amount of unique words) and average number of senses per word using a lexicon. Their SVM classifier performed better than the baseline, with an accuracy of 71%, 21% higher than the baseline. The article by Pilán, Volodina and Johansson does a similar classification using the CEFR system as will be done in this thesis. It will however be difficult to compare results, because Pilán, Volodina and Johansson focus on texts equal and below B1 level and above B1 level. In this thesis, a classifier will classify texts into all different CEFR levels.

2.3 Text classification to grade essays automatically

Larkey (1998) uses different text classification methods to automatically grade essays. Larkey’s experiment uses manually graded essays from three different subjects.

The experiment used two classifiers. The first classifier being a binary Bayesian independence classifier (Larkey, 1998, p.2). Feature selection was based on the expected mutual information (EMIM) score. The second classifier used the K-nearest-neighbour classifier. The documents were characterized using eleven features (Larkey, 1998, p.3). The features were: the number of characters in a document, the number of words in a document, the number of different words in a document, the fourth root of the number of words in the document, the number of sentences in a document, average word length, average sentence length, and the number of words longer than 5, 6, 7 and 8 characters. The Bayesian classifier works better than the K-nearest-neighbour classifier.

(9)

5

3. Explaining Tools

In order to analyse different grammatical structures in this thesis, a selection of tools are needed. Some of these will now be explained shortly.

3.1 Alpino, a dependency parser

The first tool to be discussed is Alpino. Alpino is a dependency parser by Van Noord (2006). It can be used to determine the roles of words and groups of words in a sentence (van Noord, 2006, p.21). Sentences can create more than one parse, but using a Maximum Entropy Disambiguation Model, the quality of the parses will be judged and the best one will be chosen. The accuracy of Alpino is currently around 90% (Van Noord, 2006, p.38).

In figure 1, an example of a simplified XML file can be found. Many properties have been removed in order to keep the file readable. Sentence 1 is used in the example XML file.

(10)

6

Figure 1. An example of an XML file generated by Alpino with the sentence “Jan zit in de boom”.

The different nodes contain different properties, like begin, end, word, pt and rel. begin and end show the positions of words, pt shows the POS-tags and rel shows the relation of the word in the sentence. Dact can graphically create dependency trees based on this information. Dact will be explained in more detail in the next paragraph.

<?xml version="1.0" encoding="UTF-8"?> <alpino_ds version="1.3">

</node> </node>

(11)

7

3.2 Dact, multi-purpose tool for parses

The second tool used in the experiments is Dact. Dact is an open-source tool, available for multiple platforms (Van Noord et al., 2012, p. 156). Dact provides a graphical visualisation of dependency structures. It also offers the possibility to execute XPath 2.0 queries, highlighting the results. Next to this, it can also create frequency lists based on selected nodes. It is also possible to load macros for XPath queries. Figure 2 shows the dependency tree of the XML file in figure 1.

(12)

8

3.3 XPath, XML Path Language

(13)

9

4. Defining Grammatical Structures

The focus of this chapter will be on performing an experiment to identify grammatical structures. The accuracy of the identification of these grammatical structures will be

determined and discussed. At first, the seven grammatical structures will be explained along with the XPath queries used to identify them.

4.1 Explaining Grammatical structures

The different structures pertain particularly to the Dutch language. These grammatical

structures have been chosen by Bureau Taal. They have chosen these structures based on their experience with Dutch texts. The following grammatical structures will be researched:

- Dependent clauses - Long topicalizations - Nominalizations - Passive voice

- Discontinuous main clause by adjectives

- Discontinuous main clause by dependent clauses - Chain of prepositional phrases

4.1.1 Long subordinate clauses

Long subordinate clauses give extra information. These clauses are often related to the main clause. They normally cannot stand alone as a sentence in the English language. An example sentence with a dependent clause can be found in sentence 2.

(2) [Whenever I go to the grocery store], it starts to rain.

(14)

10 4.1.2 Long topicalization

Long topicalization is a phenomenon that describes the situation of having a lot of words before the verb of the main clause. Bureau Taal has advised that when a sentence has four or more words before the direct verb is given, that a sentence has a long topicalization. An example sentence that has this grammatical structure can be found in sentence 3.

(3) [The joined meetings of the different companies in Europe] cannot be properly planned.

4.1.3 Nominalization

Nominalization is the process of making nouns out of verbs. Examples of nominalizations are ‘living’, ‘contribution’ and ‘running’. In Dutch, there is the distinction of two types of

nominalizations. On the one hand there are nominalizations as an infinitive form, like ‘het lopen’ en ‘het fietsen’ and on the other hand there are derivates like ‘wandeling’ and

‘belasting’. After discussing with Bureau Taal, the decision has been made to focus only on the infinitive forms of nominalizations. The reason for this is that it is not possible to determine in Alpino whether words ending on ‘-ing’ are nominalizations or not.

4.1.4 Passive voice

An example of an active voiced sentence can be found in sentence 4. A passive voiced version of sentence 4 can be found in sentence 5.

(4) I am watering the plants.

(5) The plants are being watered by me.

4.1.5 Discontinuous main clauses caused by adjectives

(15)

11 is longer than two words. An example of DMCs caused by adjectives is hard to give in

English, because it would not occur when correct grammar is used. An example of this can be found in sentences 6 and 7.

(6) De [door de student gewassen] kleding lijkt oud.

The cloths, [which are washed by the student], look old.

(7) Het [door de uitgever vaak uitgestelde] boek is nu eindelijk toch uitgekomen. The book, [which has often been postponed by the publisher], has finally been released.

4.1.6 Discontinuous main clauses caused by dependent clauses

DMCs caused by dependent clauses are a more common form of DMCs in the Dutch language. In this case, a main clause is split into two by one or more dependent clauses. An example of a DMC caused by dependent clauses can be found in sentence 8.

(8) De laatste telefoon van Apple, [die ook bekend is van het maken van computers en

laptops], is aangekondigd bij de electronica-conferentie.

The latest phone from Apple, [which is also known for creating laptops and

desktops], is presented at the electronics conference.

4.1.7 Chain of prepositional phrases

A prepositional phrase is phrase that starts with a preposition. An example of prepositional phrases are: ‘on the road’, ‘at the store’ and ‘in the box’. These prepositional phrases can be chained and can lead to long sentences. An example can be found in sentence 9. According to Bureau Taal, a chain of prepositional phrases needs to have a length of at least two

(16)

12 (9) The man is having a meeting [in the city hall] [in Denver] [with the mayor] [of

Denver] [at nine o’clock].

4.2 Identifying grammatical structures using XPath

In order to get a syntactical analysis of sentences, the dependency parser Alpino will be used. For each sentence entered in Alpino, an XML file will be generated with the parsed

information. With the use of XPath, the XML files will be searched to find relevant

grammatical structures. Dact is used to view the parsed sentences and tree structures. XPath queries can also be entered in Dact. At first, this program is used to find out the XPath queries needed to get the different grammatical structures. Dact gives a detailed overview of the tree structure. This information will be used to determine the XPath queries for each of the

grammatical structures. In order to make the XPath queries more legible, macros can be made. Whenever a macro is made, percent signs (%) can be used to call macros.

4.2.1 Common macros

At first a few common macros were created. These will be used in several of the XPath queries of the grammar structures. The following macros have been made to extract the grammatical structures:

Figure 3. Begin macro. Refers to get the starting position of a node.

(10) It was a hectic day at work.

The macro in figure 3 takes the begin value of a node. In a sentence like in sentence 10: “It was a hectic day at work”. ‘It’ would have a begin value of 0 and ‘hectic’ would have a begin number of 3. If ‘a hectic day at work’ would be one node, that node would have a begin number of 2.

b = """( number(@begin )

(17)

13

Figure 4. End macro. Refers to get the ending position of a node.

The macro in figure 4 takes the ending number of a node. To take the example of above: “It was a hectic day at work”. ‘It’ would have an end number of 1 and ‘hectic’ would have an end number of 4. If ‘a hectic day at work’ would be one node, that node would have an end

number of 7.

Figure 5. Length macro. Refers to the length of a node.

The macro in figure 5 describes the length of a node. If we take the same example sentence from above, ‘It’ and ‘hectic’ would both have a length of 1, ‘a hectic day at work’ would have a length of 5.

These macros can be useful to set a minimum length of nodes or check if a word is the first word in a node or the whole sentence. This notion is also used in one of the XPath queries for the grammatical structures.

The long subordinate clauses have a category of their own in the annotation done by Alpino. There are several categories in Alpino that describe long subordinate clauses. These

categories can be found in the ‘bevat_bijzin’ macro. According to Bureau Taal, a dependent clause needs to be at least three word longs in order to be relevant.

e = """( number(@end ) )"""

(18)

14

Figure 6. An example of a parse with a dependent clause. The marked area covers the relative clause.

(19)

15 In XPath, the definition of this query can be found in figure 6.

Figure 7. ‘Contains subordinate clause’ macro. Used to find nodes with subordinate clauses.

Figure 8. ‘Subordinate clause’ macro. Used to find nodes with subordinate clauses that are at least 4 words long.

In figure 7, the different categories for a subordinate clause are determined. In some cases, a different category and ‘rel’ status is given, causing the need for a disjunction in this macro. Sentence 11 is a sentence where the second part of the disjunction is necessary to obtain the subordinate clause.

(11) [Komt er een leerling in aanmerking voor extra ondersteuning in het LWOO], dan vragen wij u in te stemmen met het gebruik van de gegevens.

[When a student is considered for extra support in the LWOO], we ask you to agree with the use of the data.

The XPath query macro in figure 8 will identify long subordinate clauses with sentences of 4 or more words. bevat_bijzin = """( ((@cat="rel" or @cat="cp" or @cat="conj" or @cat="whrel" or @cat="oti" ) and @rel="mod" ) or (@cat="sv1" and @rel="sat" ) )"""

(20)

16 In order to distinguish DMCs caused by long subordinate clauses and long subordinate

clauses, that do not split up the main clause, a macro has been created to avoid identifying long subordinate clauses when the dependent clause splits the main clause. This macro involves another macro, called ‘max_tang_bijzinnen_inv’ from the DMCs and will be explained in paragraph 4.2.7 in figure 41.

Figure 9. ‘Valid subordinate clause’ macro. Refers to find nodes that are long subordinate clauses, but do not interrupt the main clause.

When this macro is executed, it is done in a fairly slow fashion. In order to speed this up, the query in figure 10 can be used.

Figure 10. ‘Valid subordinate clause’ query. Used to find nodes with long subordinate clauses in a faster way.

4.2.3 Long topicalization

In case of the long topicalization, a more elaborate query is necessary. At first, the first word of the sentence needs to be determined. Sentences can however start with punctuation. That is why it is not advised to take the first instance of a sentence.

(21)

17

Figure 11. A partial parse containing a long topicalization.

In figure 11, a long topicalization can be seen. Everything that is part of the prepositional phrase (PP) that is a branch of the main sentence (smain), is part of the long topicalization.

Figure 12. ‘Word’ macro. Refers to find a word.

Figure 13. ‘First word of the sentence’ macro. Refers to the first word of the sentence.

woord = """( not(@pt="let" )

)"""

eerste_woord_van_de_zin = """( //node[%woord% and not(%e% >

(//node[%woord%]/%e%

)

(22)

18 The combination of the macros in figure 12 and in figure 13 make it possible to return the first word of a sentence.

Figure 14. ‘Word’ macro. Refers to a child node of the main clause.

Figure 15. ‘First node of the parent node’ macro. Refers to the first node of the parent node.

The macro in figure 14 ensures that the node of the long topicalization is part of the main clause. The macro in figure 15 ensures that the node is the first node of the parent node. The combination of these two ensures that the long topicalization is the first node of the main clause.

Figure 16. ‘Last node is not equal to the last node of the parent node’ macro. Refers to a node that is not the last node of the parent node.

The macro in figure 16 determines that the ending number of the node is not equal to ending number of the parent node. This is necessary to make a distinction between the whole main clause and the long topicalization.

Figure 17. ‘Grandchild of the main clause’ macro. Refers to a node that is a grandchild of the main clause.

dochter_van_hoofzin = """( ../@cat="smain" )"""

begint_bij_eerste_node_van_oudernode = """( %b% = ../%b% )"""

laatste_node_niet_gelijk_aan_laatste_node_van_oudernode = """( not(%e% = ../%e% )"""

(23)

19

Figure 18. ‘Last node is not equal to the last node of the grandparent node’ macro. Refers to a node that is not the last node of the grandparent node.

Figure 19. ‘Last node is not equal to the last node of the parent node’ macro. Returns a node that is not the last node of the parent node.

The macros in figures 17, 18 and 19 are similar to the ones in figures 14, 15 and 16, with the difference of the last set of nodes have one extra node between them and the main sentence. In some cases, parses had an extra node in between.

Figure 20. ‘Long topicalization, parent node macro. Combines the parent node macros into one macro for readability.

Figure 21. ‘Long topicalization, grandparent node macro. Combines the parent node macros into one macro for readability.

begint_bij_eerste_node_van_grootoudernode = """( %b% = ../../%b%

)"""

laatste_node_niet_gelijk_aan_laatste_node_van_grootoudernode = """( not(%e% = ../../%e%

(24)

20 The macros in figures 20 and 21 combine all the separate macros based on parent or

grandparent nodes and combine them into separate macros in order to improve the readability of the macros.

Figure 22. ‘Long topicalization’ macro. Refers to a long topicalization.

The long topicalization macro in figure 22 combines the macros to get the first part of a sentence. This first part is not the whole main clause and has four words or more. It also ensures that the long topicalization is not part of a verb or a noun phrase.

In some cases, certain long topicalization also have partial results in them. In order to get to full long topicalization, the macro in figure 23 is used.

Figure 23. ‘Maximal long topicalization’ macro. Refers to a long topicalization without partial results. lange_aanloop = """( (%lange_aanloop_onderdeel_van_oudernode% or %lange_aanloop_onderdeel_van_grootoudernode% ) and not(.//node[@pt="ww" and @pos="verb" and not(ancestor::node[@cat="np"] ) ] ) ) """

max_lange_aanloop = """( %lange_aanloop% and

not(ancestor::node[%lange_aanloop%] )

(25)

21 4.2.4 Nominalizations

In the Dutch language, nominalizations are usually written in the infinite verb form. There are some other forms as well. As mentioned in paragraph 4.1.3, the focus will be on the infinitive verb forms only.

Figure 24. A partial parse with a nominalization. The nominalization is ‘omwisselen’.

In figure 24, a partial parse can be found with a nominalization. The nominalization is accompanied with a dependent word, in this case “het”.

Figure 25. ‘Infinitive is the main word’ macro. Refers to a verb that is an infinitive and is the main word of the node.

infinitief_is_hoofdonderdeel = """( @pt="ww" and @wvorm="inf" and

@rel="hd"

(26)

22

Figure 26. ‘Part of a noun phrase’ macro. Refers to a check if a node is part of a noun phrase.

The macro in figure 25 gets verbs in the infinitive form that are the main part of a node, the macro in figure 26 ensures the node is part of a noun phrase.

Figure 27. ‘Stand-alone nominalizations’ macro. Refers to nominalizations without the accompaniment of other words.

To get nominalizations that are not accompanied by other words the macro in figure 26 will be used. It searches for infinitive verb form that has the function of a subject, object or predicate. Sentences 12 and 13 are examples of nominalizations. Sentence 12 shows a nominalization without accompaniment of words and is found by the second part of the disjunction in the macro in figure 28. Sentence 13 shows an example of a nominalization with the accompaniment of another word. This nominalization is identified with the first part of the disjunction in the macro of figure 28.

(12) [Wachten] is vervelend. It is annoying [to wait].

(13) [Lang wachten] is vervelend.

It is annoying [to wait for a long time].

(27)

23

Figure 28. ‘Nominalizations’ macro. Refers to nominalizations.

The combination of these three XPath queries identify nominalizations as a result.

4.2.5 Passive voice

Figure 29. A partial parse with an auxiliary verb in a passive sentence.

In figure 29, a partial parse can be found with an auxiliary verb. This sentence is a passive one and one of the attributes called ‘sc’ in the node with ‘wordt’ has passive.

This makes it very easy to obtain them. The XPath query in figure 30 can be used to find passive sentences.

Figure 30. ‘Passive’ macro. Refers to auxiliary verbs in sentences in the passive voice.

(28)

24 4.2.6 Discontinuous main clauses caused by adjectives

The main part of this grammatical structure are adjectives with at least a certain length. They split up main causes, creating DMCs (DMC). A limit has been set to at least four words. This limit has been determined by Bureau Taal.

Figure 31. A partial parse with an adjective, causing a DMC.

Figure 32. ‘DMC caused by adjectives categories’ macro. Refers to nodes with the right categories for DMC caused by adjectives.

(29)

25

Figure 33. ‘DMC by adjective’ macro. Refers to adjectives that cause DMCs.

The exclusion of some categories the macro in figure 33 was needed to avoid false positives in the results. The third part in the macro is an existential quantifier and makes sure that the node is part of the noun phrase. This is always the case with an adjective in this grammatical structure. The XPath query in figure 32 might return multiple results when one result is expected. If there were three adjectives preceding a noun, then three results were given, each incrementing in size. In this case, the result with largest amount of words is needed. To do this, the XPath query macro in figure 34 has been used.

Figure 34. ‘Maximum DMC by adjectives’ macro. Refers to adjectives causing DMCs without partial results.

tang_bijv_nw = """( ancestor::node[@cat="np" and .//@rel="det" and not(.//@cat="rel" ) and %lengte% > 5 and not(@rel="cnj" ) and node[@pt="n" and

(some $top in ancestor::node[(@cat="np")] satisfies ($top/%e% = %e%)

) ] ] and %tang_bijv_categorieen% )"""

max_tang_bijv_nw = """( %tang_bijv_nw% and

not(ancestor::node[%tang_bijv_nw%] )

(30)

26 4.2.7 Discontinuous main clauses caused by dependent clauses

This category entails DMCs caused by dependent clauses. When a dependent clause gets too long, it interrupts the main clause, making the sentence unnecessary lengthy.

Figure 35. A partial parse with a dependent clause, causing a DMC.

(31)

27

Figure 36. ‘DMC by dependent clauses categories’ macro. Refers to nodes with the right categories for DMC by dependent.

Figure 37. ‘Is within a main clause’ macro. Refers to a check whether a node is in the middle of a main clause.

Figure 38. ‘Contains a noun phrase’ macro. Refers to a check if a node has a noun phrase. tang_bijzinnen_categorieen = """( @cat="cp" or @cat="rel" or @cat="whrel" or @cat="whsub" or @cat="oti" or (@cat="conj" and ./node[@cat="rel" and @rel="cnj ] ) )"""

valt_binnen_een_hoofdzin = """( (some $top in ancestor::node[(@rel="--" or

@cat="smain" ) ] satisfies ($top/%b% < %b% and $top/%e% > %e% ) ) ) """

bevat_zelfstandig_nw_binnen_groep = """( (some $np in ../node[@pt="n"

] satisfies ($np/%e% < %e%

(32)

28

Figure 39. ‘Is not part of a noun phrase but does fall within the main clause’ macro. Refers to a node does not have a noun phrase as ancestor node, but does have a main clause as an

ancestor node.

In order to make the macro more readable. The macro has been divided into several macros. To find a dependent clause that discontinues the main clause, an existential quantifier is used. This can be seen in figures 37 and 38.

The macro in figure 37 makes sure that the dependent clause’s ancestor is a main clause and that the main clause has a starting position before the dependent clause and an ending position after the dependent clause. The macro in figure 38 ensures there is a node with a noun phrase within the found node.

Figure 40. ‘DMC by dependent clauses’ macro. Refers to a check if a node is a dependent clause, causing a main clause to be split up.

In some cases the dependent clause may be multiple categories simultaneously, giving more results for one dependent clause. To solve this issue, the macro in figure 41 has been made.

bevat_geen_np_binnen_hoofdzin = """( ancestor::node[@rel="--" or @cat="smain" ] and not(ancestor::node[@cat="np" ] ) ) """

tang_bijzinnen_inv = """( %tang_bijzinnen_categorieen% and

%valt_binnen_een_hoofdzin% and

(%bevat_zelfstandig_nw_binnen_groep% or %bevat_geen_np_binnen_hoofdzin% )

(33)

29

Figure 41. ‘Maximum DMC by dependent clauses’ macro. Identifies dependent clauses causing DMCs without partial results.

Figure 42. A partial parse with a chain of prepositional phrases.

In figure 42, an example can be found of a prepositional phrase. This distinctive pattern returns almost each time for a chain of prepositional phrases. To identify chains of

prepositional phrases, a count of prepositional phrases as child nodes is done. This can be done by the macro in figure 43.

max_tang_bijzinnen_inv = """( %tang_bijzinnen_inv% and

not(ancestor::node[%tang_bijzinnen_inv%

]

(34)

30

Figure 43. ‘Chain of prepositional phrases’ macro. Refers to chains of prepositional phrases.

This may return multiple answers if there is a result. If there are three prepositional phrases, then for each prepositional phrase the command to count the amount of prepositional phrases near it will be counted. To avoid this and only to get all of the prepositional phrases in one result, the macro in figure 44 has been created:

Figure 44. ‘Maximum Chain of prepositional phrases’ macro. Refers to chains of prepositional phrases without partial results.

voorzetselketens = """( @cat="pp" and

count(.//node[@cat="pp" ]

) > 1 ) """

max_voorzetselketens = """( %voorzetselketens% and

not(ancestor::node[%voorzetselketens%

]

(35)

31

5 Detection of Grammatical Structures

In order to test the accuracy of the XPath queries, a set of 40 sentences has been created for each category by Bureau Taal. Twenty of these sentence contain the category and the other twenty sentences do not. At first all the sentences were parsed by Alpino and saved as XML files. These XML files were converted to corpus files using Dact. The process of calculating scores has been automated as much as possible by using Python scripts and a command line version of Dact. In the end, the accuracy, recall, precision and the F1-score have been calculated for each category. The accuracy has been calculated by dividing all the correctly identified instances that are the true positives and the true negatives by the total amount of instances. An F-score is more trustworthy when it comes to determining accuracy.

Unfortunately, there is no gold standard to compare these results to. This makes it hard to determine how good the results actually are.

5.1 Results of experiment 1

Table 1. The accuracy, precision, recall and F1-score for the seven grammatical structures.

DMC: dep. clauses

DMC:

adjectives Dep. clauses Long top. Nominalizations

Chain of

prep. phrases Passives Accuracy 65.22% 65.96% 53.19% 80.49% 94.23% 78.57% 93.48% Precision 62.50% 61.11% 53.33% 92.86% 93.55% 86.67% 100.00% Recall 50.00% 55.00% 34.78% 65.00% 96.67% 65.00% 88.46% F1-score 55.56% 57.89% 42.11% 76.47% 95.08% 74.29% 93.88%

Table 1 shows the performance of the seven grammatical structures. The accuracy of the DMCs are between 65% and 66%, however the F1-scores are lower, being 56% and 58% respectively. The dependent (dep.) clauses have an accuracy of 53%. The long topicalizations have an accuracy of 79%. The precision is relatively high (93%) in comparison to the recall (65%). The nominalizations and the passive sentences have an accuracy of 94% and 93%. The chains of prepositional phrases have an accuracy of 79%.

(36)

32 In this paragraph, the different categories will be discussed individually as well as a look at the sentences of the test sets. It is important to determine if the accuracy of a category can be improved.

5.2.1 The quality of the test sets

The test sets and the example sentences, which were used to create the XPath queries, were developed by two different persons. In most cases, this should not pose a problem as long as sentences contain the described categories. In some cases it can cause problems. This will be discussed in more detail per category.

The interpretation of a certain grammatical structure from a student, like the author of this thesis and the language expert, like Bureau Taal, makes it hard to fully comprehend and understand a grammatical structure. This is something that will also become clearer in the discussion of the categories.

Another problem is that some sentences are not correctly parsed by Alpino, this can have multiple causes. If a sentence is grammatically incorrect, it may cause issues for the parsing process. Alpino will always try to deliver a parse and it is never certain if a parse is correct. Some of the incorrectly parsed sentences, however, obtain a typical, ‘flat’ structure with certain nodes that can be identified. It is possible to identify these parses with an XPath query. This has not been tested in this experiment.

5.2.2 DMC: dependent clauses

This category is subcategorized with the name dependent clauses. It includes more clauses, like independent clauses and subordinate clauses. One of these clauses are appositions. These were not properly identified, but are, according to Bureau Taal, relevant to this category. An example of this can be found in sentence 14. This is however a different type of clause from what has been seen in the example sentences. This is obvious to a language expert like Bureau Taal, but wasn’t to the author of this thesis, who focused solely on the example sentences. The XPath query should be adjusted to include these types of clauses, thus increasing the

accuracy. By incorporating these other types of clauses, the accuracy can be increased by 20-25%. In case of the apposition, these are marked as such by Alpino, where ‘@rel=”app”’.

(37)

33 The big question is whether Dirk Kuijt, [the man who was disregarded by many], will be able to make a hattrick again.

5.2.3 DMC: adjectives

The query for this category was, at first, designed to identify a complete noun phrase. After some discussion with the language experts of Bureau Taal, the adjustment has been made to only identify the adjective part of the relevant noun phrases. In this category, a lot of the partial answers were found and 45% of the answers have not been found at all. This is related to parses, where a part of the adjective phrase is split from the rest. An example of this can be found in figure 45. The grey node depicts the found node. In this case “in abominabel slechte

staat”. “verkerende” is also part of this adjective, but is not correctly identified. Sentence 15

is used for the parse in figure 45.

(15) Het [in abominabel slechte staat verkerende] schilderij werd voor een recordbedrag geveild.

(38)

34

Figure 45. An example of a parse where a partial adjective, causing a DMC, is found.

As a test, the test set had been altered to show the noun phrases as answers and the old query that identified the whole noun phrase as adjective. The results can be found in table 2. These results are better and imply that the query should be improved to increase the accuracy.

Table 2. The accuracy scores of adjectives causing a DMC as a noun phrase.

DMC: adjectives (np) Accuracy 80.95% Precision 87.50% Recall 70.00% F1-score 77.78%

(39)

35 sentences focused on a few types of these sentences. The test set of this category, however, focuses on more types of clauses. This made it harder to identify all the clauses correctly. This makes it easier to adapt the query to include these sentences as well, increasing the accuracy of this category. An example of this is sentence 16.

(16) De rechter verklaart de stichting in meer dan 100 gevallen niet-ontvankelijk, [maar de

voorzitter zegt desondanks tevreden te zijn met het vonnis].

The judge has declared the institute that it’s not admissible in over a 100 cases, [but the

chairman has said to be content with the verdict nevertheless].

5.2.5 Long topicalization

The recall of this XPath query is relatively low, being 65%. It is not quite clear why some of the long topicalizations are not correctly identified. Further research into this issue will lead to a great increase in accuracy.

It looks easier to just look at the position of the verb in the sentence to determine if a sentence contains a long topicalization. It is however necessary that the whole first part of the sentence, up to the verb, is part of the main clause. This resulted in a rather large XPath query. Further investigation into these parses like figure 46 should be done to increase the accuracy of this XPath query.

5.2.6 Nominalization

(40)

36 (17) Dasmag is een nieuwe uitgeverij die het uitgeven van boeken lucratiever wil maken voor

jonge schrijvers die nog lol hebben in het [schrijven] van boeken.

Dasmag is a new publisher, that wants to make the publishing of books more lucrative for young writers, who still have fun with the [writing] of books.

Figure 46. A parse of a nominalization as a noun instead of a verb.

(41)

37

Figure 47. A parse of a chain of prepositional phrases as sibling nodes.

Improving the XPath query so that it takes these sibling chains of prepositional phrases in addition to the child chains of prepositional phrases will increase the recall greatly.

5.2.8 Passives

This XPath query has very good scores and identifies passive sentences very well with an accuracy of 93.48%. It resembles the accuracy of Alpino itself greatly. The missed passive auxiliary verbs do not have the label ‘passive’. In sentence 18 the word ‘is’ is incorrectly labeled as a copula instead of passive.

(42)

38

6. Text Classification

In the previous experiment, the focus has been the identification of grammatical structures. Using that information, grammatical structures can be used as features to classify texts. In this experiment the grammatical features will be used along with the baseline features. These baseline features have been taken from Larkey’s (1998) research.

6.1 Classify texts using grammatical features

To classify texts Bureau Taal has offered 125 manually classified texts. This means all these texts have been given a language level, based on the CEFR scale. From low to high, the scores are: A1, A2, B1, B2, C1 and C2. The idea is that the texts become more complex as the score rises. The texts were presented in a large document file. The first 104 texts had a

notation to distinguish titles and paragraphs. The last nineteen texts did not have this. The last nineteen texts also included lists. This made the automatic conversion more difficult. Most lists are part of the sentence preceding hem. This is why the lists have been added to the preceding sentence. This would sometimes result in large, ungrammatical sentences. Of these last nineteen texts, only four texts had this issue. These few wrongly prepared texts could cause the accuracy of classifying to decrease. This experiment will check if the grammatical features are informative enough to do a proper classification. Bureau Taal has requested to keep these texts classified, so the content of these texts will not be presented in this thesis. In order to be able to process the texts in Alpino, the texts had to be prepared. Each sentence had to be placed in a separate line. In order to do this, a Python script has been made to

accomplish this. All 125 texts have been passed through Alpino to get the XML files of each line per text. Afterwards, the XML files have been searched using the XPath queries in Dact to find the amount of occurrences of grammatical structures per text. As a reference, the custom features from Larkey’s (1998) research have been used as a baseline. The baseline features are:

- The number of characters in the document - The number of words in the document

- The number of different words in the document

(43)

39 - The number of sentences in the document

- Average word length - Average sentence length

- Number of words longer than 5 characters - Number of words longer than 6 characters - Number of words longer than 7 characters - Number of words longer than 8 characters

The classifier used is the Bayesian classifier. In this thesis, the Naive Bayes classifier will be used to classify the texts. To implement, train and test with the Naive Bayes classifier, NLTK, a Natural Language Toolkit for Python is used. Both feature sets will be tested in a ten-fold cross validation.

6.2 Results

Mean Standard deviation Standard Error Baseline features 47.50% 16.55% 5.52% Grammatical features 44.17% 18.02% 5.70% Combination of baseline and grammatical features 55.00% 14.80% 4.68%

Table 3. The results of ten-fold cross validation on feature sets of the baseline features and the grammatical features using a Naive Bayes classifier.

(44)

40

6.3 Discussion

6.3.1 Classifying texts

In order to calculate the baseline and grammatical features a Python script was created. A Naive Bayes classifier was trained and tested with the feature sets in a ten-fold cross validation. This was done separately so the results could be compared. With the baseline feature set, an accuracy was calculated of 47.50%. The grammatical feature set performed with an accuracy of 44.17%. This means that the grammatical features still offer a great amount of informativeness. With the combination of both feature sets, an accuracy of 55.00% is obtained, an increase of 7.5% over the accuracy of the baseline features. This is a good result and it means that the grammatical features are very suitable to use as features to classify texts. It is also noticeable that the combination of both feature sets lead to a lower standard deviation and standard error, making the results more trustworthy.

6.3.2 Informativeness of grammatical features

(45)

41

7. Conclusion

The focus of this thesis was mainly to determine if it is possible to identify grammatical structures using automatically parsed syntactical information. The extraction of grammatical structures from XPath queries on parses from Alpino is complex and yielded an accuracy of between 53% and 94%. Long subordinate clauses are the hardest to extract, having an accuracy of 53%. Passive sentences and nominalizations can be identified best, having an accuracy of respectively 93% and 94%. Some of the XPath queries, like the DMC by dependent clauses and adjectives can still be optimized greatly to increase the score. DMCs queries can probably be improved a lot, but for that, a better understanding of the different components that can cause DMCs need to be researched.

The first research question was as follows: Can the identification of grammatical structures in automatically parsed sentences be done? The grammatical structures can be identified, but there is a difference in accuracy between the different grammatical structures. Passive sentences and nominalizations can be determined very reliably, but long subordinate clauses cannot be determined with great accuracy at this time. Further research into this category will yield better reliability for the determination.

As an extension to use the information from the XPath queries, a classification has been done using the grammatical structures as features. In order to assess the performance of this

classification, a baseline has been set to compare the grammatical features to. These

grammatical features still hold a lot of information to classify texts. The combination of both feature sets lead to an increase of accuracy of 7.5%, having an accuracy of 55.00%. This means that the grammatical features can used very well to classify texts based on language level. It might be interesting to test other grammatical structures in a language. It is not sure if improving the XPath queries will lead to better classification, as the informativeness of the individual features is still unclear.

(46)

42 Fore future research it will be interesting to look more into improving the XPath queries to get more reliable results. It will also be interesting to see how informative the grammatical

(47)

43

8. References

Berglund, A., Boag, S., Chamberlin, D., Fernández, M.F., Kay, M., Robie, J., Siméon, J., 2010. XML Path Language (XPath) 2.0 (Second Edition). [online] Available at:

<http://www.w3.org/TR/xpath20/> [Accessed 20 September 2015]

Larkey, L.S., 1998. Automatic Essay Grading Using Text Categorization Techniques. In:

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM. pp. 90-95.

Noord, G.J.M. van, Bouma, G., Eynde, F. van, Kok, D. de, Linde, J. van der, Schuurman, I., Tsjong Kim Sang, E. and Vandeghinste, V., 2012. Large Scale Syntactic Annotation of Written Dutch: Lassy. In: P. Spyns, and J. Odijk, ed. Essential Speech and Language

Technology for Dutch. Berlin: Springer. pp. 147-164.

Noord, G.J.M. van 2006. At Last Parsing Is Now Operational. In: Piet Mertens, Cedrick Fairon, Anne Dister, Patrick Watrin (eds.), 2006. Verbum Ex Machina. Actes de la 13e

conference sur le traitement automatique des langues naturelles. Leuven: TALN. pp. 20-42.

Pander Maat, H., Kraf, R., Bosch, A. van den, Dekker, N., Gompel, M. van, Kleijn, S., Sanders, T. and Sloot, K. van der, 2014. T-Scan: a new tool for analyzing Dutch tekst.

Computational Linguistics in the Netherlands Journal 4, [online] Available at: <

http://www.clinjournal.org/sites/clinjournal.org/files/05-PanderMaat-etal-CLIN2014.pdf> [Accessed 3 October 2015]

Pilán, I., Volodina, E. and Johansson, R., 2014. Rule-based and machine learning approaches for second language sentence-level readability. In: Proceedings of the Ninth Workshop on

Innovative Use of NLP for Building Educational Applications. Baltimore: Association for

(48)

44

8. Appendix

8.1 Example sentences

In ‘example sentences.xlsx’ the different example sentences per category can be found. The XPath queries have been based on these sentences.

This file can be found at: http://www.let.rug.nl/vannoord/Scripties/PatrickDerichs/example sentences.xlsx

8.2 Macros

In ‘macros.txt” a formatted plain text file with the macros can be found. The formatting follows the format used in chapter 4.2.

This file can found at: http://www.let.rug.nl/vannoord/Scripties/PatrickDerichs/macros.txt

8.3 get baseline data script file

In ‘getBaselineData.py’ one can find the Python script to calculate the baseline features per text. In order to do this, the text file must have the same name as the folder name. For example: in folder ‘Text 001’, the text file must be named ‘Text 001.txt”. This will create a file with baseline features for each folder. In case of ‘Text 001’, the file ‘Text 001 baseline features.txt’ would be created.

This file can be found at:

http://www.let.rug.nl/vannoord/Scripties/PatrickDerichs/getBaselineData.py

8.4 Naïve Bayes classifier script file

(49)

45 The file needs to look as follows, where the name and the value is separated by a tab:

language_level B2 tangconstructie_bijzinnen 3 tangconstructie_bijvnw 0 bijzinnen 6 lange aanloop 2 nominalisaties 11 voorzetselketens 1 passief 3