Research internship report

(1)

Research internship report

Dialectometric analysis and automating processing of dialect recordings

February 2020

AUTHOR Raoul Buurke

COURSE Res. Master’s Res Internship Linguistics (LTR011M20)

SUPERVISING LECTURER prof. dr. Martijn Wieling

INTERNSHIP SUPERVISOR Martijn Bartelds, MA

STUDY PROGRAM ReMa Language & Cognition

STUDENT NUMBER S2507501

ORGANIZATION Centrum Groninger Taal & Cultuur

ADDRESS Cascadeplein 4, Groningen

PHONE NR. 050 5992060

(2)

Contents

Page

1 Introduction 3

1.1 Organization . . . . 3

1.2 Dialectometry. . . . 4

1.3 Personal background . . . . 5

1.4 The position. . . . 5

1.5 Learning outcomes . . . . 5

2 Summary per month 6 2.1 September 2019 . . . . 6

2.2 October 2019 . . . . 7

2.3 November 2019. . . . 8

2.4 December 2019. . . . 9

2.5 January 2020 . . . 11

3 Evaluation 13 3.1 Summary of acquired skills . . . 13

3.2 Learning outcomes . . . 14

3.3 Contributions to projects. . . 14

3.4 Supervision . . . 15

3.5 Career development . . . 15

4 Conclusion 16

(3)

1 Introduction

1.1 Organization

The Centrum Groninger Taal & Cultuur (Knowledge Centre for Groningen’s Language and Culture;

henceforth CGTC) is an organization focusing on the preservation and promotion of cultural heritage in the province of Groningen. It is the result of a merger in 2017 between the former Bureau Groninger Taal en Cultuur (Bureau for Groningen’s Language and Culture) and the Huis van de Groninger Cultuur (House of Groningen’s Culture). It is funded by both the provincial government as well as the Univer- sity of Groningen (UG). Their main objective is to do research into Groningen language (colloquially known as Gronings) and culture that can be presented to the public in various ways^[1], such as festi- vals, but even several smartphone applications. Their two most prominent public activities are the Dag van de Groninger Geschiedenis (Day of Groningen History) and the Dag van de Grunneger Toal (Day of Groningen language), which are focused on talks and demos about respectively Groningen’s culture and language, but also include live music and food. As can be expected from such a set-up there are usually hundreds to thousands of visitors to these activities, which reflects the organization’s success at achieving its goals.

The CGTC team is a mixture of prominent artists, managers and UG staff members. Research within the CGTC is carried out by or under supervision of prof. dr. Goffe Jensma, prof. dr. Martijn Wieling, or dr. Wilbert Heeringa. Goffe Jensma’s own research focuses on the cultural heritage of Friesland, but he spearheads the Woordwaark research project, which is an ongoing endeavor to produce an digital interactive database of the Groningen language. Already or to be included elements are an online dictionary between Groningen language and Dutch, an interactive map of which words are considered valuable and to be passed on by the older generation (the Van Old noar Jong/From Old to Young project), and recordings of Gronings words. The latter database is developed within the Stemmen van Grunnen (Voices of Groningen) project, which is where I was installed.

Stemmen van Grunnen is based on an earlier project for Frisian, i.e. Stimmen fan Fryslân^[2]

(Voices of Friesland; see Hilton, 2019). In both cases speech data is collected through a quiz-like web application, which is based on research carried out byLeemann et al.(2016). They introduced the idea to estimate where a speaker comes from in the province on the basis of their ‘accent’, which is determined by letting the participants select their own variant of a word in their standard language. They are given a list of options that is determined on the basis of earlier phonetic research, which is in this case the Goeman-Taeldeman-Van Reenen project (Taeldeman & Goeman,1996). In a decision-tree like fashion the applications settles on the most likely geographic source of a speaker’s

‘accent’, i.e. the unique combination of selected variants. Specifically for Stemmen van Grunnen there are 10 standard Dutch words (see Table1) on the list. Note that the standard Dutch word was also always given as an option, so the list of options consisted of both Low Saxon variants plus the standard form, yielding between 3 and 16 options for each phrase. This may seem like too few data to account for the number of possible places of origin across all participants, but note that in principle 459,841,536 unique combinations can be obtained by multiplying the options for each phrase. In re- ality the number of possible locations will be much lower due to the fact that some combinations are impossible, but theoretically there is no reason to be concerned about granularity. After completing the quiz participants are prompted for basic personal information^[3], such as age range (given in 10 year intervals) and gender. Finally, a map of the Netherlands is shown with the three most likely places of origin. The participant ideally also then marked the actual place of origin by clicking on the interactive map, but this has not been the case for all participants.

[1]https://www.cgtc.nl/missie/

[2]https://www.cgtc.nl/stemmenvangrunnen/

[3]Regrettably, due to unsuspected technical failure, much of this information was lost for the data that was collected at the moment this work is written. These problems are now fixed for future sessions, but could not be analyzed within the span of the internship.

(4)

Table 1

The list of words presented in the Stemmen van Grunnen application with their Dutch and Low Saxon variants in IPA.

Standard Dutch word [IPA] Available dialect variants [IPA]

aarde [a:rdə] [o:ədə, ɪ:ədə, ɪ:rdə, ɪət, o:rdə]

bed [bɛt] [bɛ:rə, bɛ:r, bɛdə]

briefje [bri:fjə] [brɑɪfkə, brifkə, brɑɪfki, brɑɪfi, bri:fi, bri:fin, bri:fɪs, brɑɪfin, brefkə, bʀe:əfkən, brefkəs]

eitje [eɪt͡ɕə] [ɑ:ɪkə, ɛɪkə, a:ɪgə, ɑ:ɪxi, ɛɪχi, løʏt͡ɕə æɪ, ɛɪxin, ɛɪxis, ɛɪkən, ɛɪəs]

huis [hœʏs] [hus, hys]

kinderen [kɪndərən] [kɪndər, kɪnər, kɪndə, kinə, kindərs, kindəs, ke:ndər, kɪnn̩, ʋɪxtər, ke:ndə, kinda]

muis [mœʏs] [mus, mys, moʊzə]

twee [tʋeɪ] [tʋæɪ, tsʋæɪ, tʋi, tʋæɪə, tʋɪjə, tʋeɪə]

vier [fi:ər] [fɑɪər, fa:rə, fe:r, fæ:rə, fi:rə, fi:ə, fɛə, fɑ:ɪrə]

wij kloppen [ʋɛɪ klɔpən] [ʋi klɔʔm̩, ʋi klɔp̚m̩, ʋɑɪ klɔpt, ʋeɪ klɵpt, ʋi klɔpən, wulə klɔpt, ʋi klɒpt, ʋi klɒpts, ʋɛɪ klɑʔm, ʋɔɪ klɔp̚m, ʋɪlə klɔp, ʋʏli klɔpə, ʋi klɒp, ʋʏli klɒp̚m̩, ʋɪlə klɔp̚m̩]

1.2 Dialectometry

The field of dialectometry arose in the 1970s and 80s as a result of seminal work by Séguy (Séguy, 1973) and Goebl (Goebl,1982). The former put forth the idea that it was possible to ‘measure dialects’

as opposed to arbitrarily deciding how much detail is informative, which was especially in response to prior dialectological studies detailing dialects to such an extent that general patterns in the data were obscured. Séguy decided to instead quantify dialect distance by counting how many lexical items differed between varieties. Goebl picked up this idea, but instead of measuring differences approached the problem by approximating similarity, and consequently and successfully put forward clustering techniques. Dialectometric analyses are relatively rare in linguistics proper, chiefly due to it being demanding on the researcher: analyses typically require both adeptness at statistical modeling and thorough knowledge of sociolinguistics, which is only infrequently the case.

At the UG dialectometry has been a fruitful field of study due to combined efforts from prof.

dr. John Nerbonne (now professor emeritus), prof. dr. Martijn Wieling, dr. Wilbert Heeringa, and their colleagues over the years. Dialect differences have frequently been investigated by computing Levenshtein distances between words of different language varieties. The Levenshtein distance (LD) has been introduced as a general algorithm byLevenshtein(1966) to determine how many changes in one string are required to change it into another, although it should be noted thatKruskal(1983) was among the first to suggest its application in linguistics. This straightforward method has been applied numerous times within dialectometry (e.g. Kessler,1995;Heeringa,2004;Wieling,2012), although not always in the same fashion. Kessler used the most ‘basic’ version of the Levenshtein distance, in which every difference between two strings is uniformly valued. This means that different operations to get from one string to another (e.g. substitution, deletion, or insertion) are all assigned the value 1. In the scope of Table2: all variations are computed to have an LD of 1, except [hyzə], because

‘changing’ it into [hys] requires two operations. Heeringa considered using weighted Levenshtein distances, for example on the basis of several phonological feature bundling systems or acoustic features (e.g. Bark-scaled spectrograms, cochleagrams, and formant tracks). Strikingly, these more gradual methods did not uniformly yield a more sensitive measure of dialect differences, as indicated by the latter group of approaches correlating less with perceptual data.

One may be skeptic as to what degree perception is indeed the ‘gold standard’ to index success of dialectometric measures, but it has been confirmed to be useful as such by others. Wieling et al.(2014) showed that LD correlates strongly (r=-0.81) with non-native-likeness as well, which

(5)

Table 2

Example string operations and corresponding Levenshtein Distance (LD) for variations of [hys].

h y s Operation LD

y s deletion 1

h u s substitution 1

h y s ə insertion 1

h y z ə insertion + substitution 2

moreover indicates the usefulness of LD as an indicator of perceptual variation.Gooskens & Heeringa (2004) furthermore found high correlations (also around 0.80) between their perceptual and Leven- shtein distances of Norwegian dialects, which even more directly shows the usefulness for dialect differentiation. It is especially surprising that the acoustical approaches, which were proposed by Heeringa to reflect perceptual intuitions better than string-based methods, do not outperform the phone-based Levenshtein distances.

1.3 Personal background

Leaving aside the curious effectiveness of the Levenshtein distance in dialectometry, there is no doubt that it is useful. A great benefit is that the algorithm is intuitive, and straightforward to execute, which makes it excellent for students to apply it in research. This is where I came in. From my own BA background I had little experience with the Levenshtein distance, or for that matter with language variation in general. I was only familiar with the algorithm in so far that historical linguistics studies have been done that used it to generate phylogenetic trees of language families.

My interest in sociolinguistics was mostly sparked in or shortly after theoretical syntax and semantics classes, in which the main aim was to devise a model that would explain the underlying structure of all languages, and therefore explaining everything with as little as possible. Venerable as this practice may be, I have often thought that perhaps the richness of languages is not only its efficient communicative power, but also the many forms in which it exists and changes. Dialects are especially interesting in this respect, because the multi-levelled microvariation between them has no intrinsic communicative goal. Given this intriguing fact and my longstanding affection for Groningen dialects, I gradually became interested in dialect research.

One last piece of motivation stems from my affinity with and interest in statistical methods and quantitative methodology. As noted before, the combination of statistics and dialect research is spearheaded at the UG. Considerable amounts of data were already available next to ongoing projects, which provided me with a unique opportunity to investigate different parts of the dialect research process.

1.4 The position

Combining aforementioned interests virtually entails dialectometry. Prof. dr. Martijn Wieling is actively involved with Stemmen van Grunnen and introduced me to Martijn Bartelds, who supervised my internship. After an initial meeting to discuss possible angles of research, we agreed that it would be beneficial for both the project and Bartelds’ PhD research to automatize processing of the Stemmen data as much as possible. This neatly aligned with my personal interests, and gave me the opportunity to develop a new skill set, which are outlined and evaluated in following sections.

1.5 Learning outcomes

The learning outcomes, in the form of general tasks, as originally stated at the beginning of the internship were the following:

(6)

1. a considerable amount of self-study (i.e. reading books on similarity measures and programming)

2. programming and evaluating the similarity measures (i.e. Levenshtein distances)

3. ideally also programming a web interface for the pipeline output (and the reading required for 4. making phonetic transcriptions alongside coding the similarity measuresit)

5. writing up the accompanying documentation and any scientifically related works (e.g. abstracts for conference presentation, or a scientific paper)

The underlying idea was that these tasks would contribute both to the Stemmen project, future parts of Martijn Barteld’s PhD project, as well as personal skill development. The first task was to allow me to delve deeply in both programming as well as dialectometry on my own terms. The second task was especially useful for the Stemmen data, as the results allowed for gauging how well the web application succeeds in what it aims to do: collecting speech data in a rapid and decentralized fashion, by allowing participants to record themselves in uncontrolled environments. The third task was a personal request from Martijn Bartelds, which would be particularly useful for his PhD research, because it allows for indexing the reliability of the data that he plans to use. The fourth task is a less interesting task in and of itself, but it is simply required and enables the intended analyses. The last task is particularly useful from a career perspective, either from an academic or even non-academic point of view. Having publications enables getting a PhD, while having written and documented computer programs is useful on virtually any job market. The degree to which all these learning outcomes have been successfully fulfilled is discussed in the next section.

2 Summary per month

The following subsections are a summary of key activities during the internship.

2.1 September 2019

The first month of the internship I was introduced to the data through the Woordwaark API. After an initial meeting with my supervisor it became clear that the prime objective was going to be to inspect, order, clean, and combine all relevant data. The form in which this was to be done was relatively free form, but it should in the end be a single “pipeline” process that could be repeated daily on a server.

The data comes from two sources: metadata from the API, and the audio recordings from a webserver through SSH (a cryptographic network protocol). Combining these two streams of information was a challenge, as it required me to build a web scraper for the first time. I familiarized myself with the concept through internet searches, and decided that a combination of two Python libraries was the way to go: requests and BeautifulSoup4.

Along the way of coding the web scraper, I read the documentation of BeautifulSoup4^[4], as well asMitchell(2018). This book proved very useful, chiefly because it explained how to traverse HTML trees, which was crucial to successfully retrieving plain text from the different structures contained in the API webpages. Moreover, it showed how useful regular expressions (a ubiquitous concept in programming, which goes back as far as toKleene,1951) are for retrieving relevant information from complex pieces of text. This, in turn, led me to readFriedl(2006), which enabled building a reliable web scraper and summarizing it all in one single file of metadata.

The web scraper generally functions as follows. First, a BeautifulSoup session is initiated and given the credentials of the API. Then the main page of the API is used as a starting point and all hyperlinks are extracted from participant data containing web pages. This resulted in a list, which was sorted from earliest to latest session, of HTML pages. Traversing down each HTML tree the relevant data is extracted using so called CSS selectors, which is a clever use of the page structure to

[4]https://www.crummy.com/software/BeautifulSoup/bs4/doc/

(7)

pinpoint the relevant text contained in the tables on each web page. When available, the coordinates of the participant’s self-selected place of origin were also extracted from the API. Finally, these data were concatenated and exported to a single output text file.

Next to building the web scraper, I gained access to the web server for the project by the SSH protocol. It became clear upon inspection of the audio that they were not uniformly processed. While the audio container was always WAVE, the audio codecs within them were different depending on the device and browser used by participants. The audio files were therefore often seen as broken by media players. To resolve this, they were converted to proper 16 bit mono WAVE files using the popular FFmpeg tool^[5]. The pipeline, at the end of September, looked roughly as in Figure1.

metadata data stream index API participant pages extract text from participant pages extract place of origin coordinates (if available)

export concatenated data to text file

audio data stream download files from webserver convert files to WAVE file format

Figure 1

A schema of the processing pipeline as of September 2019.

2.2 October 2019

This month was more focused on cleaning up the audio files than aggregating the metadata. Several ideas were discussed during meetings with both my supervisor and the regular ones from the speech lab. Initially we settled on first normalizing and consequently cutting off silences from the audio recordings using Linux command line tools, e.g. SoX^[6]. This proved unsatisfactory for the data, because there was so much variation between recordings that the default cutoff values were too lenient.

The result was that in many cases there was still a significant amount of noise in the recordings, so another and more flexible approach was needed.

I decided to try forced alignment on the audio recordings. I had been introduced to the concept relatively shortly before starting the internship, and had been working with it for a while under a different project at the university. Forced alignment is a task within computational linguistics, or more specifically the field of speech recognition. The goal is to temporally segment an audio recording into annotated parts of speech, usually on phone or word level. In order to do so, a forced aligner is trained on speech data from a particular language and compares the resultant acoustic model to speech in the target recording. For the comparison itself, the trained aligner requires two inputs:

the aforementioned model and also conventional transcription (typically orthographic or phonetic), which constrains the possible output transcriptions of the model to (ideally) the relevant ones. For an intuitive and more detailed explanation, especially of the statistical models usually used, seeYuan et al.(2018).

The justification for using forced alignment in the pipeline is as follows. If the process is successful, it means that the algorithm has located the onset and offset of speech within a recording.

The most significant advantage is that it would also have found the relevant sound, assuming no

[5]https://www.ffmpeg.org/

[6]http://sox.sourceforge.net/

(8)

highly similar words are pronounced in the recording, which is a dynamic solution to the variability in the data. In order to accomplish this, one would need the transcription of what is to be aligned and a suitable acoustic model. Both of these were available: the former was presented in any case in Stemmen, and for the latter the WebMAUS aligner (Kisler et al.,2017) for Dutch could be used.

Unsurprisingly, there was no aligner available that was trained on Low Saxon, as this would require a significant amount of data. The rationale for using the Dutch version of WebMAUS was that it likely comes closest to Low Saxon in terms of phonetics and training one ourselves would require resources we did not possess. In order to draw the former conclusion, I e-mailed the developer of the aligner and was told the acoustic model was trained on data from the Corpus Gesproken Nederlands (CGN; see Taalunie, 2014). Investigating its metadata revealed that from the 1803 verified Dutch speakers (i.e. their birthplace was in the Netherlands as opposed to Flanders) about 30% are from provinces in which Low Saxon is spoken. This may be taken as a rough estimate of how many speakers may be influenced by or are possible speakers of Low Saxon variants. This number is not particularly high, but also not devastatingly low. Given the scarcity of alternatives this is taken as acceptable.

2.3 November 2019

My supervisor was on holiday for most of this month, so I mostly discussed my progress and options with prof. dr. Martijn Wieling. While most of my time had been dedicated to coding the pipeline up til now, I was made aware that dialectometry had been pushed to the background. In order to remedy this, I manually transcribed 10% of the available recordings at the time (n=377) phonetically for evaluation purposes and an abstract submission for Methods XVII (August 3–7, 2020 in Mainz, Germany).

We were interested in how well the approach of Stemmen works for dialect data collection.

If, for example, participants consistently select an option given in the application (e.g. [mys]), but actually pronounce something different (e.g. an unlisted variant or alternative listed one), this can be taken as evidence that the method is not reliable. To that end, we computed Levenshtein distances between the manual transcription and the selected/provided one. The average Levenshtein distance was 0.5 (SD = 0.4), whereas the averaged normalized (over alignment length) Levenshtein distance was 0.09 (SD = 0.06), which indicates there was usually a small difference between the transcriptions.

Next to this, we wanted to know whether when participants pronounced something different than selected they opted for the closest option. Levenshtein distances between listed options and the manual transcription were therefore ranked, and consequently normalized from 0 to 1 (0 representing the ‘best alternative’, and 1 the most different one). In 16% of all cases, participants produced a form not present on the list. In a majority of these cases (72%) participants selected an optimal variant from the list. Similarly, in the 84% of cases when the pronounced variant was on the list, about 90% of the time participants selected an optimal variant. The effectiveness of the Stemmen approach is moreover confirmed by the mean of 0.04 (SD = 0.05). These positive results were put into the abstract and submitted.

Tentative testing of a possible MAUS implementation started late this month. The aligner has many possible parameters to change, which may drastically influence the results. Again, I had extensive e-mail correspondence with the developer of MAUS, i.e. Florian Schiel^[7]. Next to explaining in what format MAUS requires the data to be, which is quite specific, he also made me aware that it was possible to feed MAUS a list of alternative transcriptions. Assuming Stemmen’s list of words to be relatively exhaustive^[8], this entailed potential for the procedure.

Incorporating the parameters then required two files next to the audio file: the selected tran- scription, and the list of alternatives from Stemmen. The former was provided in a UTF-8 encoded CSV

[7]See https://www.phonetik.uni-muenchen.de/institut/mitarbeiter/schiel/Schiel.html. Florian has been extremely helpful and patient in answering my mails, and I am truly grateful for this.

[8]It was already shown in the earlier analysis that 16% of the times participants pronounced something not on the list. This is, in my opinion, a small and acceptable number given the variability in the data.

(9)

metadata data stream index API participant pages

...

generate NRUL + CSV files

audio data stream download files from webserver convert files to WAVE file format run WebMAUS and save TextGrid

Figure 2

A schema of the processing pipeline as of November 2019.

Table 3

Manual transcription replacements to accommodate MAUS’s acoustic model for Dutch.

Original Stemmen phone MAUS replacement (SAM-PA) Reason

æ ɛ (E) Original not included, closest alternative chosen

ʋ w (w) Original not included, closest alternative chosen

ɒ ɔ (O) Original not included, closest alternative chosen

ɵ ə (@) Original not included, closest alternative chosen

ʊ : (:) Only occurs in [moʊzə], so [o] is lengthened

ʀ r (r) Original not included, closest alternative chosen

n̩, m̩, p̚ n, m, p (n, m, p) No diacritics available, so they are removed

t͡ɕ tʃ (tS) Original not included, closest alternative chosen

ɕ ʃ (S) Original not included, closest alternative chosen

file, containing text in the following format: ‘XXX;word’. The selected words and options were ex- tracted from the scraped metadata and Stemmen itself. The ‘XXX’ designates a required orthographic transcription of the word next to it, but it can be set to null by providing it as such, while it leaves the rest of the procedure intact. The list of options is provided in a proprietary NRUL file, and con- tains lines of this format: ‘word>alternative’. The number of lines is therefore equal to the number of alternative options minus one (‘word>word’ is unnecessary). There is a difference between so called NRUL files and RUL files for MAUS, and they deal with the conditional transition probabilities from the word to the alternative. NRUL files give all alternatives the same likelihood, while in RUL files you can specify the likelihood of each variant. NRUL files were opted for, because there is no obvious way to extract or compute likelihoods for each alternative. See Figure2for the updated schema.

Note that the output of the audio data stream is now a Praat TextGrid (Boersma & Weenink,2020) with phonetic annotation on the phone level.

Two final remarks are in order. First off, while the CSV files can contain transcription in IPA format, the NRUL files cannot. They require SAM-PA (an ASCII-based alternative for IPA) transcriptions, and therefore the CSV files are coded in SAM-PA as well for consistency. Secondly, phonetic symbols that are not in the Dutch acoustic model^[9]break the alignment process. This means that manual modifications were necessary in some cases. They are reported in Table3and implemented as a Python dictionary. Multi-syllable strings are checked first in order to avoid errors, specifically for the cases of [t͡ɕ] and [ɕ]. Vowel lengthening and shortening modifications are not included for brevity (e.g. [a:] is in the inventory, but not [a]).

[9]https://clarin.phonetik.uni-muenchen.de/BASWebServices/services/runMAUSGetInventar?LANGUAGE=nld-NL

(10)

2.4 December 2019

Some initial testing with MAUS showed mixed results, because often the onset of words was aligned much too early due to sound artifacts, such as pops, hisses, or cross-talk. I proposed two solutions for the pipeline: (1) an algorithm for adaptive noise reduction (NR) of the audio files, and (2) a procedure to estimate ‘sound contours’ in the audio and select the most likely to be the relevant phrase.

The adaptive NR procedure worked as follows. Using the LibROSA library^[10] in Python: the recording was cut up into non-silent segments according to a threshold based on the greatest peak in the audio, and then silent segments were determined by selecting sound bounded by the non-silent ones. Afterwards, the ‘silent’ segments were concatenated together (if n>1, otherwise use the single segment). A noise profile from this file was determined and applied using SoX’s default settings.

The second procedure was devised in order to assist the first one. There is no literature on this approach as far as I am aware, but it is based on intuitions about the recordings. Mainly the fact that problems previously encountered in MAUS output mostly had to do with either short burst of sound, or prolonged segments of cross-talk. To kill two birds with one stone, I argued that the temporal segmentation of LibROSA could be combined with two parameters of each resultant segment:

the mean ‘intensity’ and its length. The former was estimated by computing a Short Time Fourier Transform for each segment (after normalizing the audio using SoX), which yielded a spectrum. Then Root Mean Square (RMS) sound pressure values for each sample of the spectrum representation, which were then averaged to a single grand mean RMS. This value was consequently multiplied by the segment length in seconds. The procedure therefore yields a set of values, and the maximum value is taken to reflect the most likely to be the segment in which the word is spoken.

The rationale for the procedure above is as follows. One can assume that participants do not say complete sentences, because Stemmen’s instruction clearly say to only pronounce the relevant word. Should there be any noise in the form of short bursts, then these segments will be have a low values due to the short segment length. Should there be any cross-talk, and (crucially) it is not too close to the microphone, then it should have a lower value than the word pronounced due to a lower mean RMS. A few problematic scenarios are implied from this, e.g. when (a) someone speaks relatively softly in comparison to the surrounding, (b) cross-talk is relatively loud, or (c) someone pronounces full sentences surrounding the word. There is no obvious reason to assume any of these as a structural issue across all recordings, however, so this needs to be tested. The pipeline at the end of the month was as in Figure3.

...

audio data stream download files from webserver convert files to WAVE file format

normalize audio to 0 dB adaptive noise profiling RMS method + extraction run WebMAUS and save TextGrid Figure 3

A schema of the processing pipeline as of December 2019.

[10]https://librosa.github.io/librosa/

(11)

Table 4

Accuracy rates for two language settings by full and relative transcription matching.

Word Accuracy independent Accuracy NL

Full match Relative match Full match Relative match

aarde 0.34 0.75 0.47 0.84

bed 0.28 0.78 0.48 0.84

briefje 0.25 0.76 0.30 0.76

eitje 0.02 0.43 0.07 0.64

huis 0.93 0.97 0.93 0.97

kinderen 0.03 0.57 0.33 0.78

muis 0.95 0.98 0.79 0.83

twee 0.11 0.60 0.12 0.60

vier 0.31 0.72 0.42 0.73

wij kloppen 0.19 0.73 0.35 0.82

Total 0.34 0.73 0.42 0.78

2.5 January 2020

The last month mostly consisted of evaluating parameters for MAUS. These analyses are summarized below.

The noise reduction step in the pipeline was evaluated by using available audio recordings on the server at that time (n = 3728). Transcriptions were generated these files using the Dutch acoustic model, and both with and without the noise reduction step. Comparing the two sets of transcriptions, it became clear that in only 32 cases there was a difference in the produced transcription. Manual inspection of these 32 cases and the corresponding audio showed that in 13 cases the result was better (i.e. closer to the human annotation) with noise reduction enabled, while in 6 cases the disabled version was more accurate. The rest was either unfiltered noise (n=8; latest version of the pipeline filters out these recordings) or a word that was not on the list and therefore incomparable. Given these results, it was decided that the noise reduction step was not highly effective, and therefore left out of the pipeline in order to minimize processing time.

The language parameter was also tested. Through correspondence with Florian Schiel, it became clear that MAUS also has capacities for dealing with unspecified languages. The acoustic model for this parameter setting is a meta-model consisting of all the other ones in the database. Languages with most training data available are ranked higher, so that if the vowel /a:/ appears in 5 languages, the /a:/ from the top ranking language is taken for the acoustic model. From both an intuitive stand- point, i.e. it seems logical to use a language-independent model to allow for as much variation as possible, and a practical one (not having to rely on manually adapted transcriptions to fit the Dutch model), it was fruitful to test this parameter. In order to do so, the same subset of recordings as for the abstract submission was used (n=377). Levenshtein distances were computed between the transcription produced by MAUS and the manual ones.

The accuracy rates for both language settings are summarized in Table 4. Note that for all values it goes that the analysis only includes words that were on the list in order to ensure fair comparison. For each language parameter two sets of values are presented: (1) ‘full match’, i.e. whether or not the MAUS transcription perfectly matches the manual one, and (2) ‘relative match’, how much of the MAUS transcription matches the manual one. Each word is assigned either 0 (mismatch) or 1 (match) for the former, while values for the latter check can range from 0 to 1, as it pertains to a success rate over the alignment length. It can be concluded that the Dutch acoustic model in general outperforms the language-independent model for our use case. The results of ‘twee’ is an interesting exception, but it should be noted that the option for it are highly similar (recall Table1), which likely causes the unreliable results. It should also be noted that the accuracy of MAUS is currently not

(12)

0 1 2 3

0 1 2 3 4

MAUS onset (in seconds)

Manual onset (in seconds)

1 2 3 4

0 1 2 3 4

MAUS oﬀset (in seconds)

Manual oﬀset (in seconds)

0 1 2 3

0 1 2 3 4

MAUS onset (in seconds)

Manual onset (in seconds)

1 2 3 4

0 1 2 3 4

MAUS oﬀset (in seconds)

Manual oﬀset (in seconds)

Figure 4

Scatter plots of annotation boundary timings with (LOESS-fitted) 95% confidence intervals. Y-axis values represent the timings from manually annotated recordings (n=354). X-axis values represent those from the pipeline. Upper left and right: onset and offset values (RMS method enabled). Lower left and right:

onset and offset values (RMS method disabled).

particularly high when it comes to full matching. The picture is more promising in terms of relative success for each word, but the accuracy is still not impressively high.

Lastly, and perhaps most crucially, it was evaluated how well MAUS detects the same onsets and offsets of speech as a human annotator. In order to do this, a set of 354 recordings were manually annotated for the onset and offset of target words. These were then correlated with the timings as produced by the pipeline, both with RMS enabled and disabled. In the former setting, Pearson’s correlation between onsets and offsets was 0.85 (p<0.01) in both cases. Repeating the process for when the RMS method is disabled yields correlations of 0.30 (p<0.01) for the onsets, and 0.35 (p<0.01) for the offsets. See also the scatter plots in Figure4. These results show on one hand that using the RMS method improves the accuracy of aligner output by a considerable amount, especially when it comes to onsets. They moreover show that the full process is adequately accurate, e.g. for employment on a server. Clearly, there are still outliers: for example, the alignment tends to be too early for onsets, and in one case even more than 2 seconds too late. It is not entirely clear what happens in these cases as of yet, but it is likely that these indicate problematic recordings.

The current state of the pipeline is shown in Figure 5. Note that audio files with a length of

(13)

less than 100 ms are now automatically removed and marked in the metadata. See Section3.3for an explanation why.

...

audio data stream download files from webserver convert files to WAVE file format remove recordings with length < 100 ms

normalize audio to 0 dB RMS method + extraction run WebMAUS and save TextGrid Figure 5

A schema of the most recent version of the pipeline.

3 Evaluation

3.1 Summary of acquired skills

The following are skills I gained or strengthened during my internship, as well as useful experiences.

Note that the list is unordered and not exhaustive:

• Substantial experience with programming in Python, with applications in e.g.:

– Text manipulation, e.g.

* Regular expressions.

* Translating between IPA and SAM-PA.

– Sound manipulation, e.g.

* Batch processing of sound files (normalizing and conversion)

* Implementing noise reduction (techniques).

* Computing Fourier transforms.

* Computing Root Mean Squares.

* Extracting sound segments.

– Web scraping, e.g.

* Batch processing HTML pages in a structured manner.

* Traversing HTML pages to extract textual information.

– Basic statistical procedures required for dialectometric analyses.

• Capacity to carry out basic dialectometric analyses, e.g.

– Computing Levenshtein distances (using Python libraries)

– Working with different type of Levenshtein distances (e.g. normalized and rank-normalized)

• Exploring dialectometric literature and learning more about (the history) of the field.

• Experience with collecting dialect data research at, for example, outreach events.

• Writing up an abstract for a relevant conference.

(14)

• Making phonetic transcriptions of dialect recordings (and experiencing how challenging it is).

• Considerable experience the (Web)MAUS forced aligner, e.g.

– Reading through its extensive documentation to consider its many parameters.

– Applying it to the Stemmen data with various settings (such as different languages or types of pre-processing).

– Evaluating forced alignments in different manners, such as in onset matching, but also by comparing different output sets by Levenshtein distances.

• Batch processing Praat TextGrids using Python (usually done with Praat’s own considerably more inflexible programming language).

• Learning the Gronings dialect by merit of ten lessons at the CGTC (with thanks to Martijn Wieling).

• Gaining valuable insights into motives of dialect speakers to preserve their language, or to not do so.

3.2 Learning outcomes

To what degree the learning outcomes in Section1.5are successfully met is mostly implied by Section 3.1, but a few points are worth mentioning. First off, it is clear I did not get to develop the web interface which was stated in the learning outcomes. This is mostly due to the fact that coding the web scraper was a greater challenge than planned, and evaluating the different parameters of the forced alignment also took longer than expected. This is not all that regrettable to me, and was mostly something Martijn Bartelds suggested we could go on to do if we would have had time left to do. It was essentially optional, and therefore perhaps should not have been put in the learning outcomes per se, but in the beginning we did not anticipate working so much on forced alignment.

This endeavor took more time than planned, although it caused no dissatisfaction on either side to my knowledge.

In the same spirit: while at first I anticipated to have done more in dialectometry itself than programming of the pipeline, I did have the opportunity to do (some of) it. This means that currently, i.e. for my thesis, I have to catch up more on my dialectometry knowledge, but at the same time I have written a program that satisfactorily provides (better) data. Moreover, since I dabbled so much in the data recordings I have gained an affinity with them that I know what to be aware of when analyzing them.

Lastly, one point that has been somewhat substituted by programming was the writing of scientifically related works. While I wrote two abstracts during my internship, one of which has so far been accepted, and I did write up some documentation for the code on GitHub, I have not written a paper about my findings. This is somewhat more regrettable, but I am certain that there are enough findings to write about. They will simply have to be written after the internship.

3.3 Contributions to projects

I have contributed to several ongoing projects during the internship to different degrees. Most of my time was dedicated to Stemmen, for which I have shown that the type of data collection, i.e. the design of the web application, is reliable (recall the analyses for the conference abstract). This may be helpful (e.g. in securing funding) in the future for different dialect areas.

I have chiefly developed a pipeline that deals with the highly variable sound recordings. It is impossible to test whether every single time a perfectly clean recording rolls out of it, but there is evidence that it does so reliably. This is particularly useful for my supervisor Martijn Bartelds, as he needs as clean as possible audio recordings to continue ongoing projects on his side (e.g. dialectometric measures that rely completely on acoustic data), but also for future studies that wish to use this data and can then proceed more efficiently.

(15)

Moreover, it provides a clearer view of how many useful recordings there are in general, e.g.

by cutting away recordings that are shorter than 100 ms or when the WebMAUS procedure does not return valid TextGrids (indicating faulty audio). This is actually crucial, because by counting the number of audio files on the server one gets the wrong impression. There are hundreds of audio files on the web server that are in fact not relevant recordings, e.g. short ones that were generated during testing the application, which was often necessary. One last change to be implemented in the pipeline is to not download the audio files from the server directly, but to use the metadata as a starting point instead. This has the significant advantage that when sessions are deleted in the API they are also automatically ignored by the rest of the pipeline. Currently, when sessions are deleted in the API their audio files remain on the server.

Lastly, of course, I also helped with collecting more data at outreach events. Even with a design as flexible and undemanding (on the researcher) as Stemmen, it remains an important part of the research process. Ideally, the web application allows for kind of snowball effect, so that when one dialect speaker goes through the quiz it is passed on to close acquaintances. This did not seem to be happening, so it proved necessary to both gain attention by the media as well as to actively promote the application on events.

3.4 Supervision

During my internship I had weekly meetings with my supervisor Martijn Bartelds, and I e-mailed my weekly progress at the end of the week. These meetings were always insightful and helpful, especially when I tended to hang too long on or was carried away by a particular aspect of the work. Had I not had the implicit assistance in keeping with the schedule, it would have been a much less productive internship. As an example of such focusing assistance: I had thought of (too) many parameters to test and to implement in the pipeline, but Martijn repeatedly reminded me that every single step had to be tested for its usefulness and that a cleaner and more minimal pipeline was preferred. This led to abandoning the noise reduction step in the pipeline, to which I was initially opposed, but his arguments for it were valid. Martijn was also indispensable when the evaluation measures had to be devised, because each parameter in the pipeline had its own conceptual reason for being set one way or another. I had little experience with evaluating such measures, but Martijn suggested creative solutions for every case. For these insights and a great deal more, I am grateful, especially when taking his busy schedule into account. It is impressive to do all of this, while retaining a signature sense of humor.

Lastly, I was both advised and supervised by prof. dr. Martijn Wieling on account of the study program. While his schedule is no less busy than Martijn Bartelds’, I could rely on his wide-ranging expertise whenever I wanted to. Instead of doing biweekly meetings separately, we mostly kept each other up to date in regular Speech Lab meetings. Due to a holiday in December and temporarily fewer lab meetings, I think Martijn may have lost sight of what I was doing at some point during the later stages of the process. To remedy this, and on Martijn Bartelds’ advice, I wrote the activities summary more extensively than I usually would have. Leaving aside these minor setbacks, I can only be grateful for Martijn’s input, and especially so when Martijn Bartelds was away on holiday in November.

3.5 Career development

Like many other research master students, I also consider pursuing a PhD. This internship has, thanks to the independence I enjoyed, allowed me to do many things I consider useful for such a career path. These include writing up the abstract, reading up on background literature, but also attend the many required meetings. Being a researcher at the university is clearly a demanding job, but perhaps equally rewarding. The aforementioned independence moreover allowed me to dedicate almost all my time to doing actual research and working with the data at hand. Moreover, my master’s thesis

(16)

will be on Low Saxon as well, and I will use the data we collected during the internship, which has perhaps the most direct influence on my career.

The familiarity with data as difficult to organize as the Stemmen data has been especially invaluable for future research endeavors, which is a reassuring thought given the somewhat monu- mental challenge it at times was. Next to this, I have carried out numerous tasks and analyses, which can in all likelihood be published in a journal when properly organized or presented at conferences.

In fact, part of the work has already been accepted for the Anéla/VIOT Juniorendag 2020 and will likely also be presented at the TABU Dag 2020, both of which are organized by the university. All the aforementioned merits can advance a career in academia.

Nonetheless, it should be noted that I have considered for a while pursuing a career outside academia as well. This does not come out of nowhere, of course. In fact, when I interviewed Martijn Wieling once about how he attained his PhD, he more or less warned me about a career in academia with convincing arguments. While it did not scare me away from the path, it did allow me to notice other ones. I have also noticed during my internship that I enjoy working with data of any kind.

This is why the substantial amount of time dedicated to programming did not demotivate me at any point in the process, as programming allows for just that. While I can hardly call myself a proficient programmer (especially since I am surrounded with talented ones), I have truly become at the very least as proficient as I had hoped to be when designing the internship. Martijn validly pointed out that in order to be seen as desirable for work at the university it is extremely useful to learn programming.

Adding to that, it can easily be said that being able to code is useful in virtually all areas of the job market, so dedicating most of my internship to it feels like time well spent.

4 Conclusion

I noticed it is difficult to summarize a whole internship in a single report, because there are simply too many experiences to include. The most important ones have been reported here, but there have been numerous lesser ones that equally contributed to my internship. In any case, I can conclude with great satisfaction that I actively notice how I improved (in terms of skills and knowledge) over the course of the internship. While I was somewhat reluctant to write programs, and only did so when necessary, before starting the internship, I now tend to write programs for menial tasks. In fact, I wrote a web scraper to check an extremely busy housing listings page, and I am fairly certain that it resulted in obtaining my current home.

Other than programming, I had the opportunity to be involved in local language and culture activities. These include the weekly Gronings lessons with the Speech Lab members and meeting proud native speakers. It is a unique experience to find so many new fascinations in an area I have lived virtually my entire life, while most people look further from home for such delights, although I think it may be a shared feeling among those of us who study dialects.

Lastly, I think it is fair to say that this report does not finalize my work. Several things are still to be implemented in or tested for the pipeline, such as rewriting the code such that it uses the metadata as a starting point for the process as opposed to the audio files from the web server. I remain active in the Stemmen project at until my studies are at an end, and if the future is kind, at least a little longer. I express my final gratitude to the speakers of Low Saxon for not leaving the language decisively moribund, and to the members of the CGTC and Groningen Speech Lab I have not explicitly named in this report: thank you.

References

Boersma, P. & Weenink, D. (2020). Praat: doing phonetics by computer [Computer program, Version 6.1.09].http://www.praat.org/. Retrieved 26 January 2020.

Friedl, J. (2006). Mastering regular expressions. O’Reilly Media, Inc.

(17)

Goebl, H. (1982). Dialektometrie: Prinzipien und Methoden des Einsatzes der numerischen Taxonomie im Bereich der Dialektgeographie. Wien: Osterreichische Akademie der Wissenschaften.

Gooskens, C. & Heeringa, W. (2004). Perceptive evaluation of levenshtein dialect distance measure- ments using norwegian dialect data. Language variation and change, 16(3), 189–207.

Heeringa, W. J. (2004). Measuring dialect pronunciation differences using Levenshtein distance. PhD thesis, University of Groningen.

Hilton, N. (2019). Smartphone sociolinguistics in minority language areas: ‘Stimmen’. Paper presented at International Conference on Language Variation in Europe (ICLaVE) 10, Leeuwarden.

Kessler, B. (1995). Computational dialectology in Irish Gaelic. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, (pp. 60–66). Morgan Kaufmann Publishers Inc.

Kisler, T., Reichel, U., & Schiel, F. (2017). Multilingual processing of speech via web services. Com- puter Speech & Language, 45, 326 – 347.

Kleene, S. (1951). Representation of events in nerve nets and finite automata. Technical report, RAND Project Air Force Santa Monica CA.

Kruskal, J. (1983). An overview of sequence comparison: Time warps, string edits, and macro- molecules. SIAM review, 25(2), 201–237.

Leemann, A., Kolly, M.-J., Purves, R., Britain, D., & Glaser, E. (2016). Crowdsourcing language change with smartphone applications. PloS one, 11(1), e0143060.

Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, (pp. 707–710).

Mitchell, R. (2018). Web scraping with Python: Collecting more data from the modern web. O’Reilly Media, Inc.

Séguy, J. (1973). La dialectométrie dans l’Atlas linguistique de la Gascogne. Société de linguistique romane.

Taalunie (2014). Corpus Gesproken Nederlands - CGN [Data set, Version 2.0.3].http://hdl.handle.

net/10032/tm-a2-k6. Available at the Dutch Language Institute.

Taeldeman, J. & Goeman, A. (1996). Fonologie en morfologie van de Nederlandse dialecten: Een nieuwe materiaalverzameling en twee nieuwe atlasprojecten. Taal en Tongval, 48, 38–59.

Wieling, M. (2012). A quantitative approach to social and geographical dialect variation. PhD thesis, University of Groningen.

Wieling, M., Bloem, J., Mignella, K., Timmermeister, M., & Nerbonne, J. (2014). Measuring foreign accent strength in english: Validating Levenshtein distance as a measure. Language Dynamics and Change, 4(2), 253–269.

Yuan, J., Lai, W., Cieri, C., & Liberman, M. (2018). Using forced alignment for phonetics research.

In Huang, Jin, & Hsieh (Eds.), Chinese Language Resources and Processing: Text, Speech and Language Technology. Springer.