Internship report Research on alignment Saioa Cipitria Iturria S4100387 MA Applied Linguistics July 2020 Internship supervisor: Marije Michel Internship supervising lecturer: Rasmus Steinkrauss

(1)

Internship report

Research on alignment

Saioa Cipitria Iturria S4100387

MA Applied Linguistics July 2020

Internship supervisor: Marije Michel

(2)

Table of contents

1. Introduction 2

1.1. University of Groningen: Research 2

1.2. Research on alignment 3

2. Assignments and projects during the internship 3

3. Evaluation of the learning goals 5

4. Reflection on the process and learning goals 6

References 9

Appendices 10

(3)

1. Introduction

1.1. University of Groningen (RUG): Applied Linguistics & Research

The University of Groningen (RUG), founded in 1614, is an academic institution among the top universities in research and education (University of Groningen, n.d.). The university’s education quality as well as the opportunity to study in English are important factors in one’s decision to study here.

The Master’s programme of Applied Linguistics (AL) attracted me while I was doing my Erasmus+ programme at the RUG. I attended the presentation of the programme in the Master’s week, and the overview of the courses offered motivated my choice to enrol in this programme.

The AL programme offered an opportunity to take an internship as an elective course. From the very beginning, I found this opportunity to be highly interesting for my developing career, as I wanted to add some hands-on experience to my personal development in research.

One of the staff members from AL, Marije Michel, offered an internship on research, which looked highly interesting to me. We had a meeting to talk about the project, in addition to what was expected from me. I was accepted for collaborating on her research project, together with Dr. Christine Appel (Open University of Catalonia), on alignment in teletandem interactions between learners of English and Spanish. My internship was expected to last from the beginning of February to mid-April. I was highly motivated to participate in this project, as I believed that I could contribute to the project by virtue of my background as a highly advanced English learner and a Spanish native speaker. In addition, I appreciated the

(4)

idea of having hands-on experience in research, as this could help me determine my future career.

1.2. Research on alignment

The study of alignment is a recent field, in which the tendency to repeat one another’s linguistic (or non-linguistic) choices are examined (Costa, Pickering & Sorace, 2008). This can happen at many linguistic levels, from which morphosyntactic and lexical alignment were analyzed here.

In the study by Michel and Appel (in preparation), teletandem task-based interactions between L1-L2 speakers were analyzed by means of lexical and morphosyntactic trigrams. Specifically, the interactions consisted of English and Spanish native speakers learning each other’s language. Half of the conversations were performed in English, and the other half in Spanish.

2. Assignments and projects during the internship

During my internship, I got familiarized with the project, transcribed and coded all the data, and presented results in an Excel spreadsheet. This section explains all of these in detail.

Firstly, during the first week, I became acquainted with Dr. Christine Appel, a professor from the Open University of Catalonia (Spain), who was a collaborator of the project. I was already familiar with the Computerized Language Analysis (CLAN)

programme (MacWhinney & Snow, 1990), by virtue of the Research Methodology Language Development course.

(5)

As some of the interactions were previously transcribed by the Conversation Analysis (CA) standards, I listened to the interactions and manually changed the text’s format to the Codes for the Human Analysis of Transcripts (CHAT) transcription standards. Some extra information, which was not available in the previous document, was added. These included, for example, phrases that the first transcriber did not understand or a more thorough

indication of overlapping dialogue. The interactions that were not worked in before were also transcribed.

Once the transcripts were ready, we splitted them with the purpose of the

transcriptions being analyzable by the ‘MOR’ grammar in CLAN. Each interaction was divided into two parts: English and Spanish. The main language of each part (4-5 lines) in the interaction was crucial to decide in which language file that part was going to be.

Subsequently, the ‘%mor’ tier was run, where the morphology of each word could be derived. I corrected all the ‘L2’ or ‘?’ results from the ‘%mor’ tier of each transcription file, as we were not interested in whether the word was in one language or the other, but rather its morphology. Thereafter, all the files were put back together, which were later divided into subtasks.

Different codes were tested in order to find the output that provided an analyzable outcome. When we obtained the ‘ideal’ code, the coronavirus outbreak hit, which made us pause the analysis of the interactions. The codes that we decided to use were the following ones (morphological and lexical, respectively):

cooccur -t* +t% +t*ST001 +sm|*,o% +s"*" +n3 +o +b 01_001_101_1_0_Questions.cha cooccur -t% +t*ST001 +sm|*,o% +s"*" -s"&-*" -s"&+*" -s"&=*" +n3 +o +b

(6)

The internship was retaken in an online form in May. The codes provided a list of all the trigrams used in a given interaction, which were copy-pasted into an Excel file in order to examine the interactants’ between-speaker alignment ratio.

Finally, an extensive methodology was written (see Appendix A) for the benefit of the researchers to understand what was done and be able to write a future methodology section in their paper (Michel & Appel, in preparation). Appendix A thoroughly describes the

procedures followed and includes visualizations to help understand these.

We held weekly meetings, where I would update Marije and Christine, and they would indicate the direction in which they would like to continue the analysis. Besides, I transferred this information as well as the knowledge I acquired through the process to a student assistant, who needed to analyze text-chat data for another project of Marije. For instance, I helped him set up CLAN and showed him the general transcription rules that the CHAT standards indicate.

3. Evaluation of the learning goals

By virtue of this internship, I have obtained skills in what needs to be done to work in a research project. These include the capacity to transcribe and code the data, as well as the procedures to follow to run a research project.

I have mainly worked on transcribing and analyzing the data obtained from the interactions between the English and Spanish language learners. This has contributed to my knowledge on handling data in order to get analyzable material, in addition to my

(7)

In the aforementioned weekly meetings, some decisions regarding the analysis were taken, such as the decision on the ‘cooccur’ code used. In addition, I indicated possible problems with some of the ideas proposed (limited to my knowledge), and I always tried to help the researchers take the best option possible. Furthermore, I have developed skills to propose a number of available options to continue with the analysis. For instance, I tried out different combinations in the ‘cooccur’ code, with slightly different outputs, until we

obtained the exact code that we finally used. These little decisions taken contributed to my decision-taking and problem-solving skills, which are valuable aptitudes in research.

Granted the freedom to choose my own working schedule, I have responsibly arranged my own timetable. Certainly, when the office was still open, I arrived at

approximately the same time daily, creating a pattern that I consider has been beneficial for my productivity in the internship. The concept of not having a tight schedule was new to me; yet, I have been responsible for this aspect of the job.

4. Reflection on the process and learning goals

Reflecting on what I have learned thanks to this internship project, I could highlight different aspects, from simple tasks, such as working both on my own and in a research group, to more complex ones, such as the decisions taken to analyze the data.

I have learned to work quite independently, but still asking for help when needed, which is also relevant in a research group. We held weekly meetings in which we discussed how to move further, and any possible doubts were clarified. On the part of the researchers, they were always really helpful, taking the time needed to solve any doubt or question that I had.

(8)

Besides, I am aware of all the decisions that one has to take when doing research. Even the smallest change in the decisions taken during the process could change the outcome of the study, which poses a big responsibility for data handling. Many of these decisions were consulted with the research project leaders; yet, some freedom was provided. After this experience, I will have a closer look at these minor decisions, as I have observed that they may be of sum importance for major papers.

Additionally, thanks to the help provided by Christine Appel, I have managed to create tasks in the SpeakApps platform (SpeakApps) for my thesis project. Many factors needed to be taken into account: the slightly different pictures, the instructions that each of the learners could see, the cognitive load of each subtask, etc. This has been really beneficial, since it provides some practical experience on task creation, as compared to the theory on tasks in the Teaching Methodology course. This new skill could help me create more tasks in possible future projects.

I exceeded my expectations in my ability to use CLAN and Excel, programmes that I only used for a number of limited analyses beforehand. For example, in Excel, the three of us decided on the formulas used for the morphological types, and I adapted these formulas myself in order to fit them for the morphological tokens analysis. I had limited experience in using both CLAN and Excel, thus, I have certainly enhanced my skills on these softwares.

As a downside, it could be noted that time constraints were not strictly followed. I believed the data could be analyzed (including statistics) by the end of the internship period, which proved not to be possible, even after working all the hours that we had signed for. The Covid-19 virus might have had an impact on this, since, despite working the same amount of hours, I was more productive in the office environment. In order to work from home, it would

(9)

have been beneficial if the university could have provided me with a laptop with keyboard settings adequate to the settings from the RUG desktop. Yet, I quickly adapted to the settings that I had in my own computer. My adaptation to these new situations has been fairly easy, as I always felt comfortable with my work and its environment.

Perhaps, I could have sometimes worked more independently; however, I wanted to ensure that I was on the right direction, and that the researchers agreed with the methodology followed.

All things considered, this internship has enormously developed my research skills, which may be beneficial for a future career in academia. Due to the learning benefits that this internship has provided me, I have decided to continue working on the project (with the approval of both Marije Michel and Christine Appel).

(10)

References

University of Groningen. (n.d.). University of Groningen. https://www.rug.nl

Costa, A., Pickering, M.J., & Sorace, A. (2008). Alignment in second language dialogue. Language. _{Language and Cognitive Processes, 23(4), 528-556. doi:}

10.1080/01690960801920545

MacWhinney, B., & Snow, C. (1990). The child language data exchange system: An update.

Journal of Child Language, 17_{(2), 457-472. doi: 10.1017/S0305000900013866} Michel, M., & Appel, C. (in preparation). Lexical and syntactic alignment during

Spanish-English teletandem meetings: looking at task and language effects. SpeakApps. _SpeakApps.http://www.speakapps.eu/

(11)

Appendices

Appendix A

Methodology - Report This document can also be found at

https://docs.google.com/document/d/1h3pE9vWbNpeIre8BfR0pHg8ededOKrdqkrrYhVtkUG 8/edit (it can be viewed by anyone at the University of Groningen). The appendix of the document is included here as ‘A.X’ (A.A, A.B…) for the purpose of clearly distinguishing the appendices of each document.

Methodology - Report

For the purpose of studying lexical and morphosyntactic alignment in L1-L2 telecollaboration meetings, the overlapping three-grams were examined, both lexically and morphosyntactically.

Transcription

Eight out of twelve conversations were previously transcribed, following the

Conversation Analysis (CA) transcription guidelines, which did not provide as much detail as the Codes for the Human Analysis of Transcripts (CHAT) transcription (see the basic rules followed in Appendix A). For instance, the direction of overlaps was never indicated, which sometimes complicated its reading, especially when several overlaps followed each other (Image 1). In addition, the dialogues were saved in a Word document format, which had to be switched to Computerized Language Analysis (CLAN) from the Child Language Data

(12)

were listened to in order to adjust them to the CHAT standards in CLAN. Other adjustments were also made, such as turning the first word of the sentence to lowercase, capitalizing ‘I’ and adding chunks that were previously not understood. All the interactions were listened to in the Voicewalker app, as this allowed us to play the audio in loops, repeating little chunks of the conversation (5 seconds) three times, lessening the need to manually return repeatedly. Likewise, following the CHAT standards, the following tiers were written in the beginning and end of the transcript (Table 1) (example taken from 01_001_101_1_0_Questions):

Table 1._{Tiers in the beginning and end of the transcript, divided by a line.}

@Begin

@Languages: eng, spa

@Participants:ST101 St_101 Participant, ST001 Participant

@ID: eng, spa|change_corpus_later|ST101|||||Participant||English_speaker| @ID: eng, spa|change_corpus_later|ST001|||||Participant||Spanish_speaker| @Transcriber: TO

@Reviewer: SCI @Task: 1_0_Questions

@End

1_{Languages: The existing languages in the transcript. In this case, ‘eng’ for English, and} ‘spa’ for Spanish.

(13)

2_{Participants: All the information about the participants. In this case, their number, the} languages they speak (‘eng’ and ‘spa’), and their native language as an additional note.

3_{Transcriber and reviewer: Initials of the people who handled the transcript. Code added} by changing the depfile.

4_{Task: Task number. In this case, task 1, subtask 0, called ‘Questions’. Code added by} changing the depfile.

5_{Note: Some of the tiers are compulsory: ‘Begin’, ‘Participants’, ‘ID’ and ‘End’.}

Image 1. _{Interaction between ST002 and ST102, discussing the colours they see in the} ‘Shiny balls’ task. Transcription by the CA standards (left) vs. transcription by the CHAT guidelines (right).

After adjusting the eight aforementioned transcriptions, the four remaining interactions were transcribed into CLAN, once again, following the CHAT guidelines. In doing so, all the interactions were ultimately transcribed. All the transcripts were checked by

(14)

the programme (Esc + L) to look for any possible mistake. Indeed, before saving the file, it is highly recommended to check every transcript after making the slightest change to it.

The ‘Coder’ tier was set up by unlocking the university computer. This resulted useful as we had more freedom to manipulate the programme. See Appendix B for further detail on this.

The %mor tier: Division by language

In order to be able to efficiently run the %mor tier, the files were first divided by language; this is, two files were created per each transcript. Thereby, the English file was mostly in English and the Spanish file was primarily in Spanish. Some words from the other language were sometimes included, namely, code-switching instances and short answers in the other language. The point in which a piece of interaction did not follow another, i.e. there was another piece in the transcript of the other language, was indicated, since our aim was to build the conversation up again after running the %mor tier.

It is noteworthy that some parts of the interaction could not be divided in such a straightforward way. For example, some participants decided that each of them would speak in their L2, this is, the native Spanish speaker would talk in English, whereas the English native would talk in Spanish. This made it difficult to divide the interaction per language, as each line consisted of a different language in an alternating style. Thus, in this case, the transcript was analyzed by the %mor code in one of the languages. This language was chosen by the sentence length of the interaction, i.e., if there was slightly more text in English, the English %mor was run; yet, it was not always clear-cut. The rest of the interaction, which was not automatically analyzed, was manually done by the transcriber’s knowledge of both languages. For instance, when there was an Spanish word in a predominantly English

(15)

interaction (code-switch), this was not analyzed by CLAN, as the output would only reveal a question mark. This question mark was removed and the morphology of the Spanish word was added.

The %mor tier was conducted by language file, where the grammars incorporated in CLAN were used. The English grammar (ENG) was employed for the English fragments of the interaction, whereas the Spanish grammar (SPA) was utilized in the Spanish sections. The following code was used to run the %mor tier, where the transcript ‘01_001_101_English’ is used by way of illustration:

mor +d1 01_001_101_English.cha

This provided us with the morphological analysis of the interactions. In the event that a Spanish word was used in the English transcript, CLAN would label it as ‘L2’. The

morphology of these words, in addition to the words that CLAN did not identify (labelled as ‘?’), were, thus, manually specified. It should be noted that, in the Spanish ‘%mor’ tier, all the words were automatically translated into English (e.g.: pro:sub|yo=I). For instance, see Image 2 below, where the %mor tier is added in the English language file.

(16)

Image 2._{The %mor tier is added to this interaction in English between the speakers ST010} and ST110.

Once the %mor tier was run, the transcripts in different languages of each interaction were brought together, i.e., copy-pasted into a new CLAN file, ensuring that the conversation followed the same order as in its first version. This resulted in a complete transcription (see Image 3).

(17)

Image 3._{The entire transcription (English and Spanish) with the %mor tier. Interaction} between ST009 and ST109.

Division to sub-tasks

Once all transcripts were as a whole and containing the %mor tier, the transcripts were divided into tasks. This is, five different files would arise from an interaction consisting of a single task, these comprising the free dialogue parts and each subtask (4). This would later allow us to compare the different proportion of alignment by task type. See Figure 1, 2 and 3 below for its visualization, and Appendix C for the screenshots of the different tasks.

(18)

Figure 1._{Division of sub-tasks in Task 1. Conversation 01_001_101 as an example.} Conversation number _ Student 1 number _ Student 2 number _ Task number _ Subtask number _ Name of the task

In this case, it is the conversation pair number 01, with students 001 and 101, performing task 1, with the subtasks 0_Questions, 1_Car, 2_Dogbowl, 3_Shinyballs, 4_Umbrella and 5_Wrapup. The numbers reflect the order in which the subtasks appeared, thus, this number and the name of the task (e.g. 2 and Dogbowl) would always go together provided that it was part of Task 1. The names of the tasks, often based on the difference itself, were written to help us identify which task the students were performing.

The first numbers of the students, i.e. _{001 and 101 indicated the native language of the} speaker, as all the students starting with 0 were native speakers of Spanish, and all of those starting with 1 were native speakers of English.

(19)

Figure 2._{Division of sub-tasks in Task 2. Conversation 06_006_104 as an example.}

Following the same rules as in Figure 1. In this case, conversation number 06, between speakers 006 and 104, performing task 2 and the following number referring to each of the subtasks.

(20)

Following the same rules as in Figure 1. In this case, we can see the conversation number 07, with the Spanish native speaker 006 and the English native 106, accomplishing task number 3, and the following number referring to each of the subtasks.

Morphological analysis

Before starting with our analysis, all the commas (cm|cm) were removed from the %mor tier, by means of the ‘Find & Replace’ option (replacing it for nothing), in order to get more accurate morphological three-grams, as the commas were not important in our analysis.

A new file was opened, in which, after setting the right directory, where the transcript being analyzed is located (e.g., x:\MyDesktop\Internship\CLAN\01_001_101), the following code was run (the transcript 01_001_101_1_0_Questions is used as an example):

cooccur -t* +t% +t*ST001 +sm|*,o% +s"*" +n3 +o +b 01_001_101_1_0_Questions.cha

Meaning of each part:

● cooccur → Search for anything occurring together. ● -t* → Exclude data on main speaker tiers from analysis. ● +t*ST001 → Include data from a given speaker (e.g., ST001).

● +sm_{|* → +sm automatically targets the %mor tier, thus we are looking for everything} (*) occurring in the %mor tier.

● o% →

● +s"*" → Find word sequences, in this case, any sequence. If a specific sequence is needed, we could type: +s"in the tree" (example from CLAN).

● +n3 → Three-grams

(21)

● +b → Match words specified by +/-s only at the beginning of cluster. ● 01_001_101_1_0_Questions.cha → File to be analyzed

This code searched for all the three-grams (n3) in the transcript regarding the questions subtask produced by the speaker ‘ST001’, in the instance provided. The output in CLAN showed all the three-grams ordered by the number of times that were used by that same speaker (see Image 4).

Image 4._{CLAN output after running the ‘cooccur’ code.}

As can be observed, in this specific transcript, there were 10 instances of a preposition being followed by a determinant article and a noun; 6 occurrences of a modal, followed by

(22)

a personal pronoun and a verb; 4 instances of n prep det:art, n prep n and pro:obj (e.g. him) v inf; and so on.

Morphological types

Subsequently, this output was copied into Excel, where a column was created for each speaker of a given transcript. The numbers were, at first, eliminated, as our purpose was to find the three-gram types that were used and aligned in a between speaker fashion. To examine this, all the constructions were copied to a common column and alphabetically sorted. This way, by means of a code that allowed us for comparison (=IF(F2=F3; 1; 0)), we could easily spot those three-grams that were aligned. Finally, the alignment ratio was calculated by dividing the number of aligned constructions between all the existing types in the given transcription. Image 5 displays the resulting file.

(23)

01_001_101_1_0_Questions.

1_{Column A: Name of the transcript being analyzed.}

2_{Column B: Three-grams produced by ST001, ordered by frequency.}

3_{Column D: Three-grams produced by ST101, ordered by frequency.}

4_{Column F: Sum of all the three-grams produced by both speakers, alphabetically ordered.}

5_{Column G: Aligned constructions (1), non-aligned constructions (0). Formula used (in G2):}

=IF(F2=F3; 1; 0). The column has a conditional formatting in which cells equal to 1 are filled with a yellow color.

6_{Column H: Total number of between-speaker aligned constructions (H2); the possibilities of}

alignment, i.e., all the non-aligned three-grams used in the interaction. Formulas used: (in H2) =SUM(G1:G470) (470 being the total number of constructions); (in H4) =469-H2 (469 being 470-1, as the first cell does not contain a three-gram).

7_{Column I: Alignment ratio. Formula used: =H2/H4}

Once the morphological types in all the transcriptions were analyzed, all the alignment ratios were put into a common Excel sheet, where we were able to calculate the mean and SD values of these. See Image 6 for illustration.

(24)

Image 6._{All the alignment ratios on morphological types from real couples compiled in a} single Excel sheet.

Morphological tokens

Further, we were interested in the alignment ratio regarding the morphological tokens, since, this way, their alignment would be better reflected. Indeed, if the two speakers used a particular construction more than once, counting the tokens would reflect the times in which this construction was used. Therefore, the following code was run again, resulting in what we saw in Image 4 : 1

cooccur -t* +t% +t*ST001 +sm|*,o% +s"*" +n3 +o +b 01_001_101_1_0_Questions.cha

The list was copy-pasted into Excel; yet, this time, the numbers were not eliminated. As we aimed at having the number separated from the three-gram, i.e., in another column, we first looked at any feature that could divide these. We found that there were two space

(25)

characters between the beginning of the cell, the number and the three-gram (Space Space Number Space Space Three-gram). Therefore, by using the ‘Find & Replace’ function, we replaced two space characters for a semicolon. Numbers that contained two ciphers only had one space before the number, thus, a semicolon was manually added to these instances, since we wanted all the numbers to be in a common column. All in all, this allowed us to divide the data into different columns, by the ‘Text to column’ function in Excel. On its options, we chose the ‘delimited’ data type, which was to be divided by semicolons.

Once the numbers and the three-grams occupied different cells, the instances of both participants were copied into common columns, having two identical lists. One of the lists was ordered from largest to smallest, this is, the three-gram that occurred more often was on top; whereas the other list was alphabetically ordered, i.e., A to Z. Both columns were ordered by the ‘Sort & Filter’ function available on Excel.

Next to the alphabetically ordered list, we used the same IF function as above, yet, this time, when two three-grams were identical, the number of their occurrences were counted. The formula used for this was =IF(L2=L3; K2+K3; 0). This code was extended throughout the whole list of three-grams. The column had a conditional formatting in which values higher than 0,9 were highlighted in red. The sum of this list was calculated by the formula =SUM(M1:M470), just as for the morphological types above. To calculate the non-aligned tokens, the number resulting from the last formula was deducted from the total number of all the instances of three-grams, including the times in which they were used. Finally, the alignment ratio was calculated by dividing these two values =N2/N4. See Image 7 for the resulting document.

(26)

Image 7._{Excel sheet resulting from the morphological analysis on tokens.}

1_{Column A: Name of the transcript being analyzed. In this case, interaction 01_001_101, performing the} questions task.

2_{Column B: Number of instances of each morphological three-gram by speaker ST001.} 3_{Column C: Morphological three-grams by speaker ST001.}

4_{Column E: Number of instances of each morphological three-gram by speaker ST101.} 5_{Column F: Morphological three-grams by speaker ST101.}

6_{Column H and I: All three-grams, by both speakers, ordered in a ‘Largest to Smallest’ fashion.} 7_{Columns K and L: All three-grams, by both speakers, ordered alphabetically.}

8_{Column M: Identifying the aligned constructions. Formula: =IF(L2=L3; K2+K3; 0).}

(27)

case, as there were 470 different constructions. This number varies per analysis.

10_{Cell N4: The non-aligned constructions. Formula: =566-N2. Note: 566 in this particular case, as it was the} sum of Column K. This number varies per analysis.

11_{Column O: Alignment ratio. Formula: =N2/N4.}

Lexical analysis

Lexical tokens

In order to make sense of the three-grams we observed until now, we runned another code, which addressed the actual words that the interactants used, i.e., the lexicon. Before running the code, all the ‘_{¿’ and ‘¡’ symbols (needed in Spanish before a question or an} exclamation) were eliminated from the tasks performed in Spanish (i.e., 2_Dogbowl and 4_Umbrella), as well as the ‘mixed’ tasks (i.e, 0_Questions and 5_Wrapup) up until the interaction number 03_003_103, as the remaining files did not have any of those. Indeed, these symbols counted as a word in our analysis, which caused to have some instances of two actual words. Subsequently, the following code was run:

cooccur -t% +t*ST001 +sm|*,o% +s"*" -s"&-*" -s"&+*" -s"&=*" +n3 +o +b 01_001_101_1_0_Questions.cha

● cooccur → Search for anything occurring together. ● -t% → Exclude data from the % tiers from analysis.

● +t*ST001 → Include data from a given speaker (e.g., ST001).

● +sm_{|* → +sm automatically targets the %mor tier, thus we are looking for everything} (*) occurring in the %mor tier.

(28)

● -s"&-*" → Exclude fillers (&-ah, &-eh, etc.) to appear as words in our output. ● -s"&+*" → Exclude unfinished words from our output, e.g. ‘&+um umbrella’. ● -s"&=*" → Exclude actions, such as &=laughs, to appear as words in our output. ● +n3 → Three-grams

● +o → Output ordered by descending frequency of occurrence.

● +b → Match words specified by +/-s only at the beginning of cluster. ● 01_001_101_1_0_Questions.cha → File to be analyzed

The output list was copy-pasted to Excel, where we followed the same steps as for morphological tokens. Aligned constructions were highlighted in green. See Image 8 for the resulting document.

Image 8._{Excel sheet resulting from the lexical analysis on tokens.}

(29)

Car task.

2_{Column B: Number of instances of each lexical three-gram by speaker ST001.} 3_{Column C: Lexical three-grams by speaker ST001.}

4_{Column E: Number of instances of each lexical three-gram by speaker ST101.} 5_{Column F: Lexical three-grams by speaker ST101.}

6_{Column H and I: All three-grams, by both speakers, ordered in a ‘Largest to Smallest’ fashion.} 7_{Columns K and L: All three-grams, by both speakers, ordered alphabetically.}

8_{Column M: Identifying the aligned constructions. Formula: =IF(L2=L3; K2+K3; 0).}

9_{Cell N2: The sum of all the aligned constructions. Formula: =SUM(M1:M250). Note: 250 in this} particular case, as there were 250 different constructions. This number varies per analysis.

10_{Cell N4: The non-aligned constructions. Formula: =288-N2. Note: 288 in this particular case, as it was} the sum of Column K. This number varies per analysis.

11_{Column O: Alignment ratio. Formula: =N2/N4.}

Lexical types

Lexical types were also examined, as this would later allow us to compare it to the alignment ratio for lexical tokens. Thus, the Excel sheets used for the lexical tokens were copy-pasted into new folders, where we eliminated the numbers. As we did for

morphological types, the formula in Column G was (=IF(F2=F3; 1; 0)). See Image 9 for the resulting Excel sheet.

(30)

Image 9._{Excel sheet resulting from the lexical analysis on types.}

1_{Column A: Name of the transcript being analyzed. In this case, interaction} 01_001_101, performing the Car task.

2_{Column B: Lexical three-grams by speaker ST001.} 5_{Column D: Lexical three-grams by speaker ST101.}

7_{Column F: All three-grams, by both speakers, ordered alphabetically.}

8_{Column M: Identifying the aligned constructions. Formula: =IF(L2=L3; 1; 0).}

9_{Cell H2: The sum of all the aligned constructions. Formula: =SUM(M1:M250). Note:} 250 in this particular case, as there were 250 different constructions. This number varies per analysis.

10_{Cell H4: The non-aligned constructions. Formula: =249-H2. Note: 249 in this} particular case, as it was the sum of Column G minus 1. This number varies per analysis. 11_{Column O: Alignment ratio. Formula: =H2/H4.}

(31)

Fake couples

Fake couples, i.e., two speakers that never talked to each other, were created in order to analyze their alignment ratio. These couples consisted of NS-NS or NS-NNS pairs. By doing so, we could determine whether the alignment ratios in the real couples were actually alignment or it was merely the task that elicited certain constructions more than others. This was done for all the analyses previously done, this is, morphological and lexical types and tokens.

Identifying Excel sheets by color

As you may have noticed, the aligned constructions of each analysis (morphological / lexical types / tokens) have carried a different color. Table 2 quickly shows the colors of each of the analyses. This was done to easily identify which type of analysis was performed in each sheet.

Table 2._{The different colours used to identify the different Excel sheets at first glance.}

Morphology Lexicon

Types Tokens

(32)

References

MacWhinney, B. & Snow, C. (1990). The child language data exchange system: An update.

(33)

Appendix A.A

Main CHAT transcription rules and codes

Slightly adapted from Rasmus Steinkrauss - Research Methodology course

@tiers

● At the begin and end of whole transcript: ○ @Begin

○ @End

● The following three tiers may be inserted automatically using the menu _{Tiers > ID} Headers_:

○ @Languages: → 2

■ Followed by three-letter code for language(s) used in the transcript. See p. 31 in the CHAT manual for all language codes.

○ @Participants: →

■ Followed by the speaker’s number + their role in the transcript (in this case, participant)

○ @ID: →

■ Automatically inserted through _{Tiers > ID Headers.} ○ @Comment: →

■ Inserted wherever needed

(34)

● *SPEAKER: →

● The utterance has to end on . or ! or ?

○ These characters cannot happen within the utterance. ● Only names and the pronoun “I” start with a capital letter

○ The start of an utterance is a lower case letter ● No numbers within the utterance → Write them ● Overlap: [>] and [<] (Image A1)

○ [>] means the overlap is with the next utterance, [<] means it is with the previous utterance.

○ Mark with < > the chunks that are in overlap

Image A1_{. Interaction between ST008 and ST108 with several overlaps.}

As can be observed, the two participants say ‘el tiempo’ at the same time; ST108 says ‘todos los días’ while the other one laughs; and participant ST108 says ‘llueve’ while ST008 says ‘raining’.

● Interruption: +/.

○ If an utterance is interrupted, write +/. at the end -- Only at the end of the utterance

(35)

● Unfinished word: &+word ● Partly pronounced words: ( )

○ *ST101: (be)cause I thought… 3

● Pause: (.)

● Code switches: @s -- word@s

● Focus on language: @l -- letter@l ○ Used when spelling letters.

○ *ST101: You spell window like w@l i@l n@l d@l o@l w@l. 2

● Unintelligible word(s): xxx

● Laughter: &=laughs -- More of these in the CHAT manual (p. 64) 4 ○ &=coughs, &=gasps...

● Filled pauses: &- -- &-eh, &-ah, &-oh, &-ehm, &-hm ○ Also: confirmation -- &-huhuh.

(36)

Appendix A.B

Unlocking the university computer

For the purpose of having the transcriber and the task number clear in each of the CLAN files, we wanted to add the ‘@Coder’ and ‘@Task’ tiers. However, given the software type installed in the computers at the University of Groningen, which limits students to make changes to programs, we had to unlock the computer from this system, since it did not allow us to make any changes in the ‘depfile’, where these tiers can be added.

The ICT staff at the university helped us unlock the computer at the office, where we discovered that, for unlocking a computer, we would need admin rights, i.e., the student assistant needed a _{p number to unlock the computer. Once she got a p number, the computer} was unlocked from the university’s software. On its installment, several programmes, which were regarded as possibly important, were also installed, such as the Office package and Skype (for communicating to each other). When the computer was ready, CLAN was downloaded from its website (https://talkbank.org/software/).

Subsequently, we localized the depfile (e.g., x:\Documents\CLAN\lib) and made the changes we aimed for, i.e., changing the depfile by adding the ‘@Coder’ and ‘@Task’ tiers. The programme worked perfectly fine after this.

(37)

Appendix A.C

Screenshots of the different tasks

Task 1

Subtask: 0_Questions Student A (UOC)

Student B (OU / ICD)

Subtask: 1_Car Student A (UOC)

(38)

Subtask: 2_Dogbowl Student A (UOC)

(39)

Subtask: 3_Shinyballs Student A (UOC)

(40)

Subtask: 4_Umbrella Student A (UOC)

(41)

(42)

(43)

Subtask: 1_Sizeballs Student A (UOC)

(44)

Subtask: 2_Camelbirds Student A (UOC)

(45)

Subtask: 3_Fastfood Student A (UOC)

(46)

Subtask: 4_Monkey Student A (UOC)

(47)

(48)

Task 3

(49)

Subtask: 1_Blueball Student A (UOC)

Subtask: 2_T-shirt Student A (UOC)

(50)

Subtask: 3_Girls Student A (UOC)

(51)

Subtask: 4_RabbitBear Student A (UOC)

(52)

(53)

(54)