Research Internship report (ReMa Language & Cognition)

(1)

Research Internship report (ReMa

Language & Cognition)

(2)

2 Report of Internship at IT University of Copenhagen

September 2020 – January 2021

Course code LTR000M25 Credits 25 ECTS

Student name Anouck Braggaar

Student number S2672863

Internship Institution IT University of Copenhagen

Period 01-09-2020 – 15-01-2021

External Supervisor Dr. Rob van der Goot

Internal supervisor Dr. Gosse Bouma

Date Report 18-01-2021

(3)

Introduction

With this report, my internship period at the IT University of Copenhagen has come to an end.

Unfortunately, due to the COVID-19 situation, my internship has gone somewhat differently than I had in mind when I started my search for an internship. The original plan was to spend three months of the internship in Copenhagen. Just four days before my departure, the travel advice changed and I was not able to go the Copenhagen anymore. I was very happy that my supervisors both agreed to do the internship online. And thus, I could continue working on my project. Luckily the initial project plan was also online-proof so we didn’t have to change a lot to the project itself.

This report will focus on my evaluation of the internship. In this report you will find a short description of the placement providing organization. A brief discussion of the project itself and what my activities were. And finally, an evaluation based on the learning outcomes.

Placement organization

The internship took place at the IT University of Copenhagen, which is established in 19991_{. As the}

name mentions the main focus is on IT research, offering for example bachelor and masters programs in Data Science and Software Development.

My internship took place at the Natural Language Processing research group at the Computer Science department. The group is led by Barbara Plank and consists of several postdocs and PhD students. My supervisor on the project was Postdoc Rob van der Goot. The interests of the group are very diverse and there are several different projects, ranging from domain adaption to transfer learning to speech processing2_.

Every week there are lab meetings were the members of the team give updates on their work so far and what they are planning to do. They also use this meeting to discuss new literature and give each other feedback on their plans and papers. There are also biweekly meetings were the bigger NLP-North group meets up and discuss literature over lunch.

Unfortunately, due to COVID-19 I wasn’t able to visit the ITU myself. I had hoped to at least visit for a short period, but this was not possible during my internship period. Luckily, the team made it possible for me to attend all their meetings as they were all hosted online (even when they could all be at the ITU themselves). This way I was still able to attend everything. Of course, you miss the networking and small-talk a bit more in this situation but they were certainly able to make me feel part of the team!

1_{https://en.itu.dk}

(5)

Internship Activities

In the next paragraphs I will shortly describe the research project and my activities. A short

summarization can be found in Table 3 which is taken from the Appendix in the ReMa Internship Rules. Preceding the placement

At the start of 2020 I started thinking about my internship. As it happened, just as I was thinking about possible projects/places, Rob emailed me about PhD-positions at the ITU. These positions were about parsing low-resource languages and he mentioned that working on Frisian was also a possibility here. As Rob was my supervisor for my bachelor thesis which was about Frisian-Dutch code-switch detection, he knew I was interested in this kind of research. So, I asked him if I could do a placement at the ITU about this subject and luckily this was a possibility. The next step was to come up with a research plan. This plan changed slightly during the internship as we focused more on data selection than on for example few-shot approaches.

Position in organization

For this internship I made my own research proposal. This means that me and my supervisor were the only ones working on this project. Luckily there were also other members of the team working on similar projects and were able to provide feedback on the work I did.

Tasks

In the next section I will briefly discuss the project and the tasks I performed. Full results and descriptions of the experiments can be found in both abstracts we submitted to the workshops. Broadly speaking we worked on annotating a small set of sentences and creating a model for

dependency parsing the low-resource language Frisian. We wanted to see if automatic data selection on instance level (from existing treebanks) was able to outperform the best single treebank for a new domain/language. We worked on Frisian-Dutch code-switched data from

the Fame corpus created by Yilmaz et al. (2016). Annotations

The first part of the project focused on annotating a set of sentences both for development and test data. First, both Rob and I annotated 150 sentences each in batches of 50. After every batch we discussed our annotations and decided what the best option was. Table 1 shows the scores between the rounds. For the other 250 sentences I did the

annotations and Rob checked them. Then again, we discussed difficult cases. In both cases if we didn’t know the best solution we sometimes checked with other people. Annotation did take up quite a lot of time, especially in the beginning when I didn’t have much experience with annotating.

Annotation was especially hard because the data consisted of spoken Frisian and there are not yet a lot of treebanks for spoken and code-switched data. There is some previous work that focusses on the creation of a treebank for spoken code-switched Turkish-German (Çetinoğlu and Çöltekin, 2019), the creation of a treebank for an Arabic dialect that contains code-switching (Seddah et al., 2020) and the creation of a treebank for spoken Komi-Zyrian with switches to Russian (Partanen et al., 2018). We tried to use the existing guidelines as much as possible and in some cases adapted the guidelines or made up our own. We found that we had little issues with the code-switch nature of the data. Figure 1 shows an example of one of the sentences that we had annotated. As can be seen this sentence stops after “en

Table 1 Scores between annotators

(6)

6 dan”, while normally after these words the sentence would continue. Because of the spoken nature of our data this often occurs.

At the end of the annotation process I already began some of the experiments to make the work a bit more diverse.

Experiments

We decided to work with Latent Dirichlet Allocation (LDA, topic modelling) to select instances from existing treebanks. We also tried Gaussian Mixture Models (GMM) at the beginning but found that this algorithm took much longer to run than LDA. Therefore, we decided to go with LDA. Previous work on selection methods focusses on domains (Plank and Van Noord, 2011) or focus on parser selection (Litschko et al., 2020). Therefore, we thought that our method was quite novel in the context of dependency parsing.

We tried 24 existing treebanks for single treebank experiments. The highest scoring 8 (on LAS) were used for LDA. The best single treebank was Dutch Alpino. This was our baseline together with a run on the eight treebanks simultaneously. We ran LDA with 8, 16, 32, 64 and 128 components/topics, and selected 1000, 2000 and 4000 of the most similar sentences (to see

if the amount of data is of influence). The sentences were ranked with the Euclidean distance from the Frisian data (for the LDA model we used Frisian sentences that were not in our development/test set).

We ran the experiments using MaChAmp (Van der Goot et al., 2020), which is a deep biaffine parser initialized with mBERT. Table 2 shows our best model which used 128 components and 2000 sentences compared to both baselines. As can be seen our model outperforms the baselines on LAS and UAS but unfortunately not on POS.

We also did some extra experiments to see if we could improve our best scores. Muller et al. (2020) have shown that transliteration to the script of a related high-resource language can be quite helpful. Although Dutch and Frisian share the same script, we saw that Frisian contained a lot of diacritics which resulted in many unseen wordpieces and differences in tokenization. Therefore, we decided to remove the diacritics. On development data we achieved a LAS of 55.8 which was lower than our best model. We also tried to make our training data (the selection of sentences using LDA) more similar to our Frisian data (like the example in Figure 1) by cropping some sentences and adding orphan relations. Vania et al. (2019) have shown that similar methods of data augmentation can be very helpful. Our modifications resulted in a LAS of 56.5 on development which was also lower than the best model. Lastly, we replaced mBERT with XLM-R. This resulted in a LAS of 57.0 on development data which is also slightly lower than our best model.

In the papers that we have submitted for the workshop you can find a full discussion of the results including more results on development data and analysis of the errors.

Output

We made two submissions to workshops. The first was an abstract of two pages to Resourceful-20203

which is a workshop focusing on the creation of resources for low-resource languages. Our paper was called: Creating a Universal Dependencies Treebank of Spoken Frisian-Dutch Code-switched Data. This abstract was accepted and we gave a presentation about it at the workshop.

3_{https://gu-clasp.github.io/resourceful-2020/index.html}

Table 2 Results best model (128

components/2000 sentences) versus baselines. Dev over 5 random seeds, test over the best random seed of dev.

(7)

A second paper of four pages was submitted to the workshop Adapt-NLP (2021) and is still pending. This paper focusses less on the annotation process but more on the experiments. Due to the fact that this is an anonymous submission and still pending, I will not mention the title here.

4_{https://universaldependencies.org}

x Meetings (lab meetings, individual meetings): I attended the weekly Monday meetings of the research group at the ITU. Every week we discussed some papers and our plans for the next week. I also attended bi-weekly meetings of the lab were we also discussed papers. Every week I had a meeting with my supervisor Rob van der Goot to discuss the progress of the project.

x Material preparation for experiment(s): For our experiments we first needed to annotate some data with POS-tags and Universal Dependencies. We used the data from the Fame corpus by Yilmaz et al. (2016) so we did not need to collect data.

x Learning a specific research technique: I worked with a Bert based parser named MaChAmp for our main experiments. As I had not worked with such a parser before, I did learn quite a lot from this.

x Learning specific analyses: We analyzed the outputs of our experiments and also did significance testing on our runs.

x Corpus construction: We annotated a set of sentences and will submit them for the next version of Universal Dependencies4_.

x Data analysis: This also touches upon the things mentioned on the specific analyses. We analyzed our data to see how we could improve the parser and to see what the most made mistakes were. We also did something similar in the creation of our annotations as after batches of 50 sentences we discussed and revised the annotation guidelines.

x Software development: For the project I had to develop small programs such as a topic modelling program to select the most similar sentences to our Frisian data and a program that was meant to make those sentences more similar to our Frisian data.

x Presentations at internship institute: I presented a paper and our own project at the weekly meetings.

x Abstract preparation for conference: We submitted an abstract at Resourceful 2020 and were accepted for presentation.

x Presentation at conference or workshop: I presented our abstract at Resourceful 2020 in a ten-minute presentation with a short Q&A session afterwards (all online).

x Article preparation: We submitted a paper for Adapt-NLP 2021 (still pending).

(8)

8

Evaluation of Learning Outcomes of Internship

Here I will discuss the learning outcomes as they were discussed in the placement work plan.

1 Knowledge and Understanding

1.1 Have a thorough knowledge of at least one theoretical and methodological approach within linguistics: dependency parsing for a low-resource language

During this internship I learned a lot about annotating and parsing low-resource languages. As I had a very basic understanding of parsing and annotating Universal Dependencies, I certainly learned a lot. Especially because we were annotating spoken data, I think I learned to think a bit more outside of the box than just annotating more “standard” sentences. I also learned a lot more about parsing.

Developments in parsing go very fast and, in my bachelor, I had mainly focused on older techniques such as SVM/clustering. During my masters I did have an introduction into neural networks but this was also limited. During my internship I was able to look into the more recent models such as mBERT.

2 Applying Knowledge and Understanding

2.1 Be able to formulate an academic problem independently, and in so doing, to select, apply and where necessary adapt an adequate theoretical framework and one or more relevant research methods.

For this project we worked on annotating and parsing a low-resource language. This is a field that is still very much in progress. We deviated a bit from the original research proposal and found that data selection/adaption was not done in this context yet. My supervisor was very helpful in finding new literature about this as I was not very aware of the most recent literature about this topic.

2.2 Be able to make an original contribution to knowledge in at least one subdiscipline in linguistic

I think we did make contributions in annotating and parsing Frisian as low-resource language.

Especially for annotating we had to develop guidelines and revise existing ones. For parsing we looked into data selection which was not done yet for such a task.

2.3 Be able to independently formulate a research proposal

Before starting the internship, I created a small research proposal which was attached to the internship contract. During the internship our focus shifted slightly towards data selection as we saw similar methods being used in different papers.

3 Making Judgements

3.1 To make use of the research results of others and evaluate these critically

For this project I read a lot of papers and some of the methods from these papers we adapted for our own project (such as domain adaption and data augmentation). I think that also the weekly and biweekly meetings were very useful as we discussed a broad range of literature that was also very relevant for my project. I also visited the workshop on treebanks and linguistics theories (TLT-2020; online) to watch talks that were relevant for my topic.

3.2 Be able to make connections between their own specialist knowledge of a subdiscipline of linguistics and other related disciplines, for example psychology, neurology or information science

I think that this project was a very good mix of syntax and computational linguistics. In the annotation process I learned a lot about Universal Dependencies and the relation to syntax. The experimental part of the project involved data selection and parsing. For data selection we used LDA which was actually a new method for me. Parsing was done with MaChAmp (van der Goot et al., 2020) which is a deep biaffine parser with mBERT embeddings. Although it was not essential for the project to fully understand such a parser, I would like broaden my knowledge about such models.

(9)

4 Communication

4.1 Be able to participate actively in a research group working on an academic project

4.2 Be able to work with other students and lecturers on an academic project

Communicating with my supervisor went quite well. Although it would have been easier to have been in Denmark. Other than working with my supervisor on the project I did not work together much with other students/lectures, because we were doing a separate project. I did attend the weekly meetings and got feedback from the others. I really enjoyed the weekly meetings as we also discussed new literature, which was very useful.

4.3 Be able to participate in international academic debate in the chosen area of

specialization and to present an academic problem convincingly in English, both orally and in writing

We were able to submit to two workshops. One of them is still pending at the moment. I presented the other one at the Resourceful workshop, were I also got questions and the opportunity to talk to other researchers in the field. I think though that my speaking skills for English could be improved. It was also quite difficult to present something in front of the camera as the workshops were held online.

5 Learning Skills

5.1 Be able to keep abreast of the latest developments in linguistics and broaden and deepen their own knowledge and understanding

I think that I am more aware now of recent developments and how to stay up-to-date. Especially the group meetings were very helpful. This helped me to be able to critically read and evaluate papers.

5.2 Be able to reflect on the implications of one’s work for the development of linguistic theories

Our annotations will hopefully be admitted to the Universal Dependency treebanks. This means that these will be open to use by anyone. Our results were somewhat better than a single treebank model trained on Dutch but not significantly better. It does show that selecting fewer instances can be a fruitful method. Future research can take these results into account to reach a better performance for low-resource languages.

Overall Evaluation

Place in program

I think I learned a lot from this internship and I am very glad with the topic and how the internship overall went. In this internship I used the skills that I obtained from different courses I took in the masters ranging from linguistic analysis to methodology and statistics to learning from data. I am also keen to explore the topics of multilingual models and dependency parsing more. Therefore, I am planning to take the course natural language processing next semester and I will look into this topic for my thesis.

Knowledge & Skills

As becomes clear from the learning outcomes I have learned a lot doing this internship. Besides those points, I have gotten a clearer understanding of what it means to do research. This internship has shown me that I do want to pursue a career in academia.

Unfortunately, I did the entire internship from home. This has shown me that I am able to work from home, but it also showed me that in some cases this can be quite difficult. At some points it was quite hard to stay motivated. I liked my research topic and the team but I still missed the personal

(10)

10 quite nice and helped me stay motivated. In the past I sometimes got as feedback that I should ask for help more. In this case I feel like I could have asked for less help and maybe try more without asking first.

Next to this, I also think that I can work a bit more structured. I think it was all right but it could be better. It would be easier if I documented some versions of programs etcetera better. This I will take with me when writing my thesis next semester.

Conclusion

Although the entirety of the internship was unfortunately online, I did learn a lot. My programming skills were “updated”, I learned about the troubles of annotating and I learned more about parsing. Next to this I have gained experience in doing research and I learned more about what it means to do research. I think that it was overall a very nice experience and I would like to thank my supervisors and everybody from NLP-North. Thanks to the members of the NLP-North team I did feel as being part of the team although I haven’t seen any of them in person!

References

Çetinoğlu, Ö., & Çöltekin, Ç. (2019). Challenges of Annotating a Code-Switching Treebank. In Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019) (pp. 82-90).

Van der Goot, R.,Üstün, A., Ramponi, A., Plank, B. (2020) Massive Choice, ample tasks (machamp): a toolkit for multi-task learning in nlp.

Litschko, R., Vulić, I., Agić, Ž., & Glavaš, G. (2020). Towards Instance-Level Parser Selection for Cross-Lingual Transfer of Dependency Parsers. arXiv preprint arXiv:2004.07642.

Muller, B., Anastasopoulos, A., Sagot, B., & Seddah, D. (2020). When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models. arXiv preprint

arXiv:2010.12858.

Partanen, N., Blokland, R., Lim, K., Poibeau, T., & Rießler, M. (2018). The First Komi-Zyrian Universal Dependencies Treebanks. In Second Workshop on Universal Dependencies (UDW 2018), November 2018, Brussels, Belgium (pp. 126-132).

Plank, B., & Van Noord, G. (2011, June). Effective measures of domain similarity for parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 1566-1576).

Seddah, D., Essaidi, F., Fethi, A., Futeral, M., Muller, B., Suárez, P. J. O., ... & Srivastava, A. (2020, July). Building a user-generated content north-african arabizi treebank: Tackling hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1139-1150).

Vania, C., Kementchedjhieva, Y., Søgaard, A., & Lopez, A. (2019, November). A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 1105-1116).

Yilmaz, E., Andringa, M., Kingma, S., Dijkstra, J., Kuip, F., Velde, H., ... & van Leeuwen, D. A. (2016). A longitudinal bilingual Frisian-Dutch radio broadcast database designed for code-switching research.

Research Internship report (ReMa Language & Cognition)