Understanding Unconstrained User Speech Behavior for Eyes-free Dictation

(1)

Understanding Unconstrained User Speech Behavior for

Eyes-free Dictation

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Ines Camara

11850752

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

24/08/2018

!

1

st

_Supervisor

₂

nd

_Supervisor

Dr Frank Nack

Dr Can Liu

Faculty of Science, University of Amsterdam

School of Creative Media, City University of

Hong Kong

(2)

Understanding Unconstrained User Speech Behavior for

Eyes-free Dictation

Ines Camara

University of Amsterdam inescamaracalzas@gmail.com

ABSTRACT

Eyes-free dictation is a suitable text entry method when our hands and eyes are busy, and also for the visually impaired. However, current dictation systems are not designed for an eyes-free use. The interaction with those systems is based on the use of commands and a fixed syntax, being necessary for users to memorize and re-peat keywords and exact phrases. This can be challenging when users do not have any visual feedback. Relaxing current technical limitations would be beneficial for users in the context of eyes-free dictation. The current research is focused on understanding how users perform eyes-free composition and revision when not con-strained by the use of a fixed syntax. A Wizard of Oz study was conducted with the purpose of identifying natural interaction strate-gies and patterns. The key findings are presented in five themes that emerged from the qualitative analysis of the collected speech data: specificity level, editing strategies, target location strategies, switching between composition and revision, and system response. This thesis contributes with a set of recommendations, that can be used to inform the design of future systems allowing for an eyes-free use and unconstrained speech.

1 INTRODUCTION

Voice-based interfaces are an emerging trend opening up new pos-sibilities. Virtual assistants using speech as input/output like Siri1, Alexa2or Google Assistant3, allow us to interact with our devices when our hands and eyes are busy, for instance, during cooking or driving. At the same time, those interfaces are becoming more and more conversational, allowing for a ’natural’ use of speech. For example, the newest version of Google Assistant allows for contin-ued conversation, which eliminates the need for wake-up-words when a user wants to issue multiple commands in a row (e.g. OK, Google).

However, voice-based systems used for text generation have not advanced that far. Current dictation systems like Dragon4or Google’s Voice Typing5still require users to memorize and speak a fixed syntax of commands to perform editing operations (e.g.select Monday,change to Friday). Because of that, it is difficult to use them without having any visual feedback. Eyes-free dictation can be desirable in certain contexts. For example, when we are walking, visually engaging with our device to reply to an email or message, can be challenging or dangerous. Furthermore, eyes-free dictation is a suitable text entry method for the visually impaired.

1_{https://www.apple.com/sg/ios/siri/} 2_{https://developer.amazon.com/alexa} 3_{https://assistant.google.com/} 4_{https://www.nuance.com/dragon.html}

5_{https://support.google.com/docs/answer/4492226?hl=en}

In order to facilitate eyes-free dictation and design user-centered interfaces, we should loose the constraint of a command-based interaction. Therefore, it is valuable to investigate how users would interact with dictation systems when not constrained by current technical limitations, to guide the design of future systems.

This research is concerned with the following question:how do users naturally perform eyes-free dictation tasks without the con-straint of using a fixed syntax? To be able to answer this question, a number of sub-questions need to be addressed. For example, how do users speak to the system and how do they expect the system to operate and respond? Or, how do they switch between composition and revision operations?

To answer these questions, a Wizard of Oz study was conducted. A system was simulated allowing for eyes-free dictation with an unconstrained use of speech, in order to elicit users’ natural interac-tion strategies and patterns. Prior to the study, in order to identify the crucial system operations and responses for completing the tasks of eyes-free text composition and revision, a pilot study was conducted.

This paper first examines previous literature on dictation sys-tems and speech patterns, both relevant for the study of users’ interaction with eyes-free unconstrained dictation. Next, the paper describes the methodology used for the purpose of this research, presenting the design and findings of the pilot and main studies conducted. Then, the results of the main study are discussed, and recommendations are proposed, with the purpose of informing the design of future systems. Last, potential future research is discussed in the paper.

2 RELATED WORK

2.1 Dictation systems

Dictation systems are used to transcribe speech. Examples of dicta-tion systems are Dragon NaturallySpeaking6, Google Docs Voice Typing7, or Apple Dictation. Current dictation systems allow for continuous speech, which means users can speak like they do in normal conversation, instead of dictating single words. The inter-action with current dictation systems is based on the use of voice commands and a specific grammar, required to operate on the text (e.g. select, delete, insert...).

With a focus on the interaction between users and those sys-tems, rather than on the computational side of speech recognition, this section examines state-of-the-art dictation applications. Even though conversational interfaces are an emerging trend, there is a lack of academic research focusing on facilitating a “natural” or un-constrained interaction with dictation systems. Previous research 6_{https://www.nuance.com/dragon.html}

(3)

on dictation systems has mainly focused on improving the pro-cesses of error detection and correction [5, 10, 12], which have been proven the most challenging and time-consuming tasks for users of speech recognition [1].

Eyes-free dictation is still not a trend but has already been studied by some researchers. Ghosh et al. [3], after evaluating the limita-tions of current dictation systems in an eyes-free context, designed a voice-based interface for eyes-free text processing. Their study proved that users can accurately complete text revision tasks us-ing spoken commands (e.g. delete, insert, etc.) without any visual feedback. However, their study design was trigger-based (e.g. par-ticipants were asked to highlight a line when they heard a certain word), so the findings do not necessarily reflect how users would naturally perform editing operations. Behaviors such as fumbling or thinking out loud are not considered in the study design. The research presented in this thesis, has principally been motivated by the work of Ghosh et al. [3]. Continuing with the same line of research, this thesis is interested in understanding what happens when users are allowed to speak based on an open grammar.

Eyes-free dictation has also been explored in the driving context by Labsky et al. [7]. They proposed a dictation interface to be used by drivers to compose messages, with the purpose of minimizing visual distractions. The authors compared drivers’ performance when dictating and when typing, showing that the former method has significantly better results. Their evaluation also proved that dictating was faster than typing, although the quality of the message produced by voice was worse. This work shows the potential of eyes-free dictation as a text entry method when hands and eyes are busy, for instance, during driving. However, it is not mentioned in the article whether users are required to use a fixed command syntax to edit the messages.

The work of Vertanen and Kristensson [14] took a significant step towards relaxing current technical constraints. The proposed voice-based technique for correcting speech recognition output does not require the use of commands. This technique enables error correction by just re-dictating the intended text, as opposed to the common approach in which users need to first select the target, and then speak the modification, using specific commands. This study proved that recognition accuracy is higher when correct words of context are spoken in the re-dictation (words to the left or right of the target). Notwithstanding, this technique still requires users to speak in a certain manner and does not take into account natural behaviors like fumbling. For example, to change the day of the week in the sentence “On Monday I will go to the dentist”, users would have to say: “On Friday” or “On Friday I”. Furthermore, this technique was not intended for eyes-free use, but only for hands-free use.

Also focusing on error correction is the work by Feng and Sears [2]. The authors of this article proposed a voice-based technique for improving navigation in dictation systems. Navigating to the target of modification is the first step to error, correction taking up one third of users time, as reported in the article. To reduce the time users spend issuing navigational commands, confidence scores (level of certainty) are used to define anchor words in the text. Those anchors can be used to navigate the text more efficiently with commands likego to previous/next or move up/down. This approach was proven to reduce the number of commands needed

for navigation and the command failure rate. However, it is not appropriate for eyes-free use, as the anchors need to be visually identified; and it requires the use of a fixed syntax.

Multimodality has been explored by Suhm [13] to make the repair process more efficient in dictation systems. The multimodal interface presented in this paper allows users to choose between keyboard and mouse, speech, gesture, and handwriting, to correct the recognition output. This study provided evidence that more than one modality is more efficient than only re-speaking. However, most of these modalities are not suitable for hands or eyes-free use.

2.2 User behavior in speech recognition

Speech is a very natural communication method for humans, but it is not so easy to understand for computers. To design conversa-tional voice interfaces, it is necessary to understand how people naturally speak. This section examines existing literature on how people speak when interacting with voice interfaces and with other humans. Even if not specifically focused on dictation, the study of speech patterns is relevant to the purpose of the research presented in this thesis, as it aims at understanding how to facilitate a more “natural” interaction with speech recognition.

In the context of speech patterns, disfluencies have been largely studied. The idea is that if we understand the characteristics of speech disfluencies, we will be able to build better speech recogni-tion systems that can identify them, and therefore correct them. Self-repair is an example of speech disfluency, that occurs when speakers immediately correct their own speech errors: “A central theme is genderequa- whether women and men are equal”. Nakatani and Hirschberg [9] elaborated a predictive model, the Repair Interval Model (RIM), that used speech cues (lexical, prosodic and acoustic) to detect and correct self-repair in spontaneous speech. They posed that repair events follow a particular structure and identified three consecutive intervals: thereparandum or portion to be repaired, the disfluency interval or interruption, and the repair or modification. The RIM has been a common approach to the detection of speech repair [6]. To elaborate this model, the authors used the Air Travel Information System (ATIS) dataset, which contains spontaneous speech of subjects in travel planning scenarios. Notwithstanding, the current research is interested in a different domain, which is understanding spontaneous speech in the context of text genera-tion.

Yang Liu et al. [15] addressed other speech disfluencies in their work about enriching speech recognition. They studied, for exam-ple, the appearance of repetitions in human-to-human speech: “On Monday I- On Monday I am going to the doctor”; and the use of fillers like so, anyway, I mean. They proposed a computational framework to extract information from speech about disfluencies and sentence boundaries, in order to annotate the recognition out-put and improve its readability. For example, this approach is useful to add punctuation signs to the speech recognition text output. However, this work is focused on human communication, while the work of the current thesis is focused on human-machine interaction for the purpose of dictation.

Hyperarticulation has also been studied in the context of speech recognition. Hyperarticulation is usually motivated by recognition errors. Users of speech recognition systems sometimes elongate 2

(4)

utterances and pauses, or alter their pitch, when trying to recover from errors. System failure to identify hyperarticulated speech can result in error cascading. Oviatt et al. [11] elaborated a model, the Computer-elicited Hyperarticulate Adaptation Model (CHAM), to predict hyperarticulated speech and improve the error resolution process. This study is based on speech data collected in a lab envi-ronment. Thus, when in a real mobile scenario, variability in the speech signal could increase due to, for instance, the level of noise. Apart from speech disfluencies, other behaviors have been stud-ied. In the context of human-machine communication, the patterns reported by Large et al. [8] suggested that people speak to intelli-gent aintelli-gents as if they were human. In their study of the interaction between drivers and the simulation of an intelligent driving assis-tant, it was observed that users were, for example, polite in their communication. Politeness was reported as a possible strategy to soften expressions or reduce the impression of authoritativeness towards the listener. In the study, it was also observed that users were vague or imprecise at times. For instance, when using deictic references, which are common in human conversation. Utterances such as “Takethat one”, or “Move there”, are considered vague since they need to be disambiguated by computers with the help of contextual information. However, the simulated system in this work is a driving assistant, that can be used to operate the car’s music system or request information about the route, for example. Thus, the domain is different from the one studied by the current research.

3 PILOT STUDY

Prior to the main study, an exploratory study was conducted, that allowed for the identification of crucial system operations and responses needed for completing the tasks of eyes-free text com-position and revision. Twenty two participants volunteered over a period of three months.

This pilot study was divided in three phases. The system opera-tions and responses were gradually defined throughout the course of the study. Also, the experiment design was adjusted over time based on the observations. For instance, the tasks were adjusted.

Participants were asked to perform different composition/revision tasks. The instructions for the experiments were the same in all three phases. Prior to the experiment, participants were always briefly introduced to voice interface. They were asked to instruct the system using speech, in order to complete the composition/revision tasks. For that, they were asked to speak freely, in whatever way they preferred. The system was always manually operated by the experimenter.

3.1 Phase 1

In this phase, the research question was:how do users revise text eyes-free? To tease out strategies and behaviors, participants were provided with a text containing obvious grammatical errors (e.g. Hi Dan, how is you?).

The system operations and responses had not yet been defined in this phase, thus the system did not perform any action other than reading the text out loud.

3.1.1 Participants. Twelve participants (six male, six female, mean age 30.5) volunteered for the experiment. Two participants

used voice interface on a weekly basis, nine had used it occasionally (once or more), and one had never used it before. Six participants had never used dictation software before, and six had used it occa-sionally.

3.1.2 Apparatus. As the system did not perform any operation other than reading the text, the apparatus used in this phase was rather rudimentary.

iPhone’s 88note editor was used for pasting the text to be revised, and Apple’s Voice Over9was used to read the text out loud. A pair of earphones was provided to participants to listen to the voice interface.

iPhone’s 8 voice recorder10was used to record the experiment. 3.1.3 Procedure. Participants were asked to listen twice to a piece of text, written in casual English and containing grammatical errors. The second time, they had to revise the text eyes-free.

The system behavior is described below:

(1) The text for revision was manually pasted into the iPhone’s note editor

(2) iPhone’s Voice Over was manually triggered to read the text out loud to participants.

(3) The reading was not stopped when participants barged in. (4) Confirmation to changes was not provided done on behalf

of the system.

(5) Participants could not navigate to specific portions of the text.

After completing the tasks, participants were asked to fill in a post-experiment questionnaire. The questionnaire contained close-ended (e.g. evaluating system ease of use in a 5-point Likert scale) and open-ended questions (e.g. providing feedback on how to im-prove the system).

3.1.4 Data analysis and Results. After finishing the experiment, a qualitative analysis was performed in the audio recordings from the experiment. The categories presented below emerged from the qualitative analysis of the speech data, and the findings were complemented by the responses provided by participants in the post-experiment questionnaires.

• Editing behavior:

– operations: all editing operations performed by partici-pants were focused on error repair.

– strategies: different strategies were observed for error cor-rection. Participants usedre-dictation, which is defined in this work as the act of re-speaking the intended text providing correct words of context.

System “It was a very interested city” Participant “Interesting city”

Participants also usedcommands or keywords (e.g. delete, change to). They used synonyms (e.g. delete, remove, scratch) and phrases containing commands (e.g. “Can you delete...?”). Another observed strategy, was the use of high-level or more abstract instructions. Those are instructions that re-quire a certain level of understanding or inference from the system to be performed.

8_{https://www.apple.com/sg/iphone-8/}

9_{https://www.apple.com/sg/accessibility/iphone/vision/} 10_{https://support.apple.com/en-sg/HT206775} 3

(5)

Table 1: Responses to the question “How easy/difficult was using the system?” in a 5-point Likert scale

1 - Very easy 6

2 3

3 1

4 2

5 - Very difficult 0

Participant “You have a couple of mistakes on that sentence... hum check your grammar tenses” •Speaking style:

– human-machine interaction: participants’ utterances were mostly limited to instruction expressions (e.g. “Stop”, “Delete the street”, “On Friday, not at Friday”).

– human-human interaction: some participants were more verbose in their expressions, speaking as if the interface was a person.

Participant “After London, you can say...” •Speech disfluencies:

– repetitions/false starts:

System “There are a cool market”

Participant “There are- there are cool markets” •Participants’ experience:

– ease of use: since the system in this initial phase was very rudimentary, it was valuable to understand how dif-ficult/easy was it for participants to use it. Table 1 shows participants’ assessment of the difficulty/easiness of using the system.

– system response: with the goal of defining crucial sys-tem responses for the next phase, participants were asked about their opinion of the current system response. Al-most all participants qualified as negative not being able to interrupt the reading. Most of them mentioned that they would prefer receiving feedback on changes done. – navigation: some participants mentioned that they would

improve the navigation of the text

3.2 Phase 2

In this phase, the research question was:how do users perform eyes-free text revision operations beyond error repair? For that, a text without obvious errors was provided to participants. Two variables were introduced in this phase: a casual style text and a formal style text, with the purpose of understanding behavior differences in participants when performing revision tasks.

Based on the findings from the previous phase, the basic system operations and responses had already been defined in this phase. The system in this phase operated and responded: allowing par-ticipants to barge in with corrections; allowing them to navigate trough the text (e.g. repeating a sentence); and providing them with feedback on changes done. The system response was manually typed by the experimenter.

As a consequence of the above, the apparatus in this phase had to be adjusted: iPhone’s note editor did not allow for an easy text

navigation, and neither was it convenient to type in the note editor the system response.

3.2.1 Participants. Six participants (two male, four female, mean age 28.8) volunteered for the experiment.

3.2.2 Apparatus. A MacBook Pro11was used for easy typing and text navigation. An online Text-to-Speech (TTS) reader12with a typing box was used, used for pasting the text for revision and typing the system response. By clicking on any part of the text, the TTS reader started reading it. The MacBook’s speaker was used to reproduce the system response.

iPhone’s voice recorder was used to record the experiment. 3.2.3 Procedure. Participants were asked to perform two re-vision tasks. Fortask 1, they had to listen and perform eyes-free revision on a piece of text, written in casual English and containing different types of errors (e.g. The capital of France, which is Lyon). Fortask 2, they had to listen and perform eyes-free revision on a self-written, formal style text (e.g. a paper abstract), that was provided by participants prior to the experiment.

(1) The text for revision was pasted by the experimenter into the TTS typing box.

(2) The TTS reader was manually triggered to start/stop the reading when participants requested/barged in.

(3) The system response was manually typed in the TTS typing box, and the reader triggered to start/stop the reading when appropriate.

(4) Two types of system response were provided to conform changes: either a simple message “Change done”; or repeat-ing the changes.

After completing the two tasks, participants were asked to listen to the experiment audio recording, in order to discuss about their used strategies and behaviors. A semi-structured interview followed the experiment, in which the aim was to discuss each participants’ specific behaviors (Example of interview question: “You mostly used commands, why do you think is that?”). Participants’ preferences regarding the two system responses were collected.

3.2.4 Data analysis and Results. After the experiment was fin-ished, a qualitative analysis of the audio collected during the ex-periment and the post-exex-periment interviews was performed. The findings are reported within the categories that emerged from the phase 1 of the study

– operations: as expected, editing operations went beyond error repair (e.g. insertion of new pieces of text). – strategies: the use of strategies reported in phase 1 was

also observed in this phase. For example, participants used re-dictation.

System “They didn’t wanting work” Participant “They didn’t want to work”

Commands were also used, with an unconstrained vocab-ulary.

11_{https://www.apple.com/sg/macbook-pro/} 12_{https://ttsreader.com/}

(6)

System “They swallow his prey whole, without chew-ing gum”

Participant “Replace chewing gum by chewing” More abstract instructions, requiring a certain level of inference from the system, were noted as well.

“I am missing some commas there” •Speaking style:

– specificity: participants’ utterances ranged from very spe-cific to very unspespe-cific. In some cases, participants even think out loud and make questions while instructing the system, which is considered very unspecific.

Very specific: “Delete Amazon”

Very unspecific (thinking out loud): “Flowers, I don’t know what that means but it won’t be flowers. Maybe nukes, or power, or plants in this case, probably. I will correct it with plants”

Very unspecific (questioning): “What? The UK Presi-dent? We are not talking about the UK President... So, can you repeat the part of the UK president?” •Speech disfluencies:

– self-repair: “I make use of a varie- hum... of different tech-nologies (pause).”

•Participants’ experience:

– ease of use: even if the system allowed participants to barge in, participants reported to feel rushed by the sys-tem.

“I felt like if I don’t stop it now, it will be too late” – system response: participants were asked about their

pref-erences on the system response. Most participants pre-ferred when the system repeated the changes done. – expectations: some participants mentioned expecting the

system to respond like a human.

“If you are speaking to a normal person, that’s how the person will respond right?”

3.3 Phase 3

In this phase, the research question was:how do users compose and revise text eyes-free? Thus, a composition task was added to the experiment. This final phase served as pilot for the main study.

The system operations and responses had been refined, based on previous findings. A speech recognizer was simulated by manually typing speech. The simulated system allowed for unconstrained speech. The apparatus remained the same in this phase.

3.3.1 Participants. Four participants (one male, three female, mean age 28) volunteered for this experiment.

3.3.2 Apparatus. The TTS typing box was used to manually type participants’ speech and the system response. The TTS reader and the MacBook’s speaker, were used to reproduce the system response.

iPhone’s voice recorder was used to record the experiment 3.3.3 Procedure. Participants were asked to perform two com-position/revision tasks. Fortask 1, they had to perform eyes-free composition and revision of a formal style email. The topic for the email was free.

Fortask 2, they were asked to perform eyes-free composition and revision of a casual style Facebook post. The topic for the Facebook post was free.

(1) Participants’ speech was manually typed in the TTS typing box. Speech disfluencies such as fillers (hum, eh, so, any-way...), self-repair, and repetitions were discarded by the experimenter.

(2) The TTS reader was manually triggered to start/stop reading when participants requested it or barged in.

(3) The system response was manually typed in the TTS typing box, and the reader was triggered to start/stop the reading, when appropriate.

(4) Three types of system responses were provided: to inform of error (e.g. missed words); to ask for confirmation (e.g. locale of an operation); and to confirm an action (e.g. changes done). (5) Additionally, the experimenter replied on behalf of the sys-tem to participants’ questions that could be answered with a Google’s search query (e.g. “Is it people is, or people are?”) After completion of the two tasks, participants were interviewed. The interview style was the same as in phase 2, with a semi-structured schedule and similar questions.

3.3.4 Data analysis and Results. The recordings from the ex-periment and the post-exex-periment interviews were qualitatively analyzed after the study, similarly as in previous phases. The find-ings are reported below.

– composition: participants composed the entire piece be-fore revising it.

– revision: the use of previously reported editing strategies was observed in this phase: re-dictation, commands and more abstract instructions.

– redundancy: some participants were redundant in their editing expressions.

“Changethe to a. A French movie” (commands and re-dictation)

• Speaking style:

– specificity: participants’ utterances again ranged from very specific to very unspecific. For example, participants were unspecific when using deictic references. Deictic references need of contextual information to be disam-biguated (e.g. “Change that”).

– human-human interaction: participants mostly spoke to the system as if it was human (e.g. “Can you insert...?”) – composition

∗ speed: participants dictated very fast, being almost im-possible for the experimenter to type down. However, some participants dictated more slow when compos-ing/revising a formal style text.

∗ sentence delimiters: participants did not introduce any sentence delimiters when composing (e.g. punctuation signs). Instead, they used long utterances, linked by the and conjunction.

∗ pauses: participants’ pauses did not correspond to natu-ral sentence boundaries.

(7)

– revision: even if participants did not use sentence delim-iters when composing, when navigating the text they orga-nized the text in sentences (e.g. “Repeat the first sentence”). •Speech disfluencies:

– repetitions: “About the- (pause) about the gender equality-, whether the gender is equal”

– self-repair: “I would give the movie- eh I would give the series, three out of five stars”

– fillers: so, anyway, you know, well... •Participants’ experience:

– system response: some of the system responses resulted disturbing to participants, specially the ones informing of error. A participant reported to feel annoyed when the system could not understand an instruction.

– expectations: participants were not sure of what to expect from the system prior to the experiment, and they adapted their expectations as they went.

3.4 Conclusion

Prior to simulating a system allowing for eyes-free dictation with unconstrained speech, a pilot study was conducted.

The pilot study was divided into three phases and allowed for a preliminary exploration of users’ strategies and interaction patterns. Based on the observations from the three phases of the study, crucial system operations and responses needed for the tasks of eyes-free composition and revision were identified, and the design of the main study was defined.

4 WIZARD OF OZ STUDY

In order to understand how users perform eyes-free dictation tasks when not constrained by the vocabulary and a command-based interaction, a Wizard of Oz study was conducted. Wizard of Oz is a design methodology that allows for the evaluation of a system without a working prototype. Typically, the system response is partially or totally simulated by a human, without the subjects/users being aware of it [4].

For the purpose of this study, the operations and response of a dictation system were simulated. The system could be used com-pletely eyes-free to compose and revise text, and allowed for an unconstrained use of language. Participants in the study were not aware of the system being operated by a human.

4.1 Simulated system

Based on the results from the exploratory study, the crucial system operations and responses were defined.

Text insertion operations of more than one sentence were auto-mated, using state-of-the-art speech recognition13. All the other operations were manually performed by the experimenter, accord-ing to the followaccord-ing rules:

•System operations

– Execute commands with unconstrained vocabulary and grammar (e.g. change, replace, “I would like to switch...”)

13_{https://cloud.google.com/speech-to-text/}

– Discard fillers (e.g. hum, uh, OK, so, anyway...); repeated words; think aloud words and communication to system (e.g. questions)

– Fix self-repair phrases (e.g. “I would rate this movie- I mean, this series”)

– Punctuate sentences based on prosody and semantics of content

– Perform specified compound operations (e.g. “Delete all OKs”)

– Perform requested operations even if containing unmatched or erroneous patterns from the text, when obvious (e.g. “In the Netflix show” instead of “In the Netflix series”) – Repeat full modified sentence after modification has been

done

– Resume/start reading text after silence

– Understand and disambiguate deictic references, when ob-vious (e.g. “Delete that sentence”) with the cursor position – Undo

• System responses – Inform error

∗ “I did not understand your instruction.” ∗ “I could not find the words you mentioned.” ∗ “Where should I do it?”

∗ “Wait, I missed something. I will repeat the last sen-tence.”

– Ask for confirmation

∗ “Are you sure to delete ... «the content to be deleted»” – Confirm an action

∗ Repeat the sentence with changes ∗ “I am ready”

∗ “OK”

∗ “OK, I have deleted that”

∗ “I will repeat the last recorded sentence.” ∗ “Working on it”

4.2 Participants

Seven participants (three male, four female, mean age 23.2) volun-teered for the experiment.

One participant had never used voice interface before, and six had used it occasionally (once or more). Four participants had never dictated to another person before, and three had done it occasionally. Three participants had never dictated to a device before, and four had done it occasionally. Two participants were native English speakers. The others had English as a second language.

4.3 Apparatus

4.3.1 Hardware. A piece of cardboard separated the participants from the experimenter/system. On the participants’ side, a Blue Yeti microphone14placed in front of them was used to collect speech input. It was connected to a laptop, to feed the speech input into the recognizer. A speaker was used to reproduce the system response to participants.

On the experimenter’s side, the system was locally running in a MacBook Pro. Another speaker was used to reproduce the system response to the experimenter.

14_{https://www.bluedesigns.com/products/yeti/} 6

(8)

An iPhone camera was used to film the experiment. Figure 1 shows the experiment setup.

Figure 1: Experiment setup. Participants were sit on the right side and the experimenter/system on the left side

4.3.2 Software. A prototype was designed in-house for the pur-pose of this study, integrating the following components:

• Google’s speech recognition to convert participants’ voice into text.

• An editing interface that allowed for manual modification of the inputted text in real time.

• A Text-to-Speech (TTS) reader to reproduce the text and the system response.

• Predefined messages to: inform of error, ask for confirmation or confirm an action.

The recording software Camtasia15running in the MacBook was used to capture audio fed from the microphone, and the screen, displaying the recognition output.

4.4 Procedure

Participants were briefly introduced to voice interface. Only during the briefing and after the experiment participants could talk to the experimenter. During the experimental sessions, the piece of card-board separating avoided participants realizing of the experimenter manually operating the system.

Participants were asked to compose and revise four pieces of text using eyes-free dictation. They were instructed to speak to the system in whatever way the preferred, and to try to not be limited by their knowledge of voice interfaces.

To revise the text, they had to instruct the system to read it out loud; they could do that at any point during the task. To make changes, they had to instruct the system in any way they preferred. 15_{https://www.techsmith.com/video-editor.html}

Participants were asked to make sure the content they had gen-erated was ready to be published online, in order to complete the task.

During the experiment, participants’ speech was converted into text in real time. The experimenter performed the necessary opera-tions (e.g. repeating, editing) when requested by participants. The experimenter manually operated the system response.

In order to become familiar with the interface, participants first performed a training task, in which they had to compose and revise a short message (2-3 sentences) to a friend or family member.

After the training task was completed, participants proceeded to complete the experimental tasks. After completing the four tasks, participants were interviewed. The interview followed a semi-structured schedule: participants were asked about their par-ticular strategies and behaviors. The entire experiment took around one hour.

4.4.1 Tasks.

• Task 1. Compose and revise two Facebook posts (100 words) choosing two of the following topics:

– The last book they had read or one they like

– The last movie/show/series they had watched or one they like

– A product they had recently purchased

– A recent piece of news or an article they had read online – Their latest vacation or an eventful one

• Task 2. Compose and revise two formal emails (100 words) choosing two of the following topics:

– An email to a professor stating their interest in a project and asking him/her to collaborate.

– An email to the Office of Financial Services asking for a stipend or allowance to do an internship abroad. – An email to Office of Student Affairs requesting a change

of apartment.

– An email to the insurance company making a claim (e.g. for a car accident, stolen phone...).

– An email asking for refund on a faulty item bought online.

4.5 Data analysis

Audio and video data were collected from the experiment. After the study was finished, a qualitative analysis was conducted in the data. First, a literal transcription was performed of the audio recordings, which means that fillers (hum, eh...), and pauses, were not discarded. Then, the transcript was coded for thematic analy-sis. Further analysis was performed in the transcript, to identify sequences of composition and revision.

To understand which strategies and patterns are most used by participants, a quantitative analysis was conducted.

5 RESULTS

This research is concerned with understanding how users perform eyes-free dictation tasks when not constrained by current technical limitations. A number of subquestions are expected to be answered as well, like: how do users speak to the system and how do they expect the system to operate and respond? Or how do users switch between composition and revision tasks?

(9)

In line with these questions, the following themes are presented, that emerged from the thematic analysis of the audio recordings from the experiment.

5.1 Specificity level

Participants’ utterances do not always consist of unequivocal in-structions. And even if they contain instructions, utterances can also have noise. For example, in the expression “Could you change...?”, the first two words do not provide the system with any valuable information. There is noise in participants’ utterances, for example, when they speak to the system as if it was human, often using polite words, or when they think out loud. The system must be able to identify and discard every part of the utterance that is not intended as an instruction.

Participants’ expressions range from very specific to very unspe-cific, as shown in figure 2.

• Think out loud: these expressions may contain an implicit instruction or an operation suggestion

P02 “Can you play back? There wasanother error” But that is not always the case. In the expression below, the second sentence does not contain any valuable information. P02 “Sorry, can you stop just there? Hum,I can’t remem-ber what I was supposed to say”

• Questions: participants ask questions to the system. These can be more direct.

P07 “Can you spell the last word?”

But, they can also contain implicit instructions. In the ex-pression below, it is not clear if the participant wants the system to repeat the sentence.

P02 “Did I say- Is that really addictive?”

• High-level commands: these are more abstract instructions that, in any case, contain operation requests.

P04 “Have only one whether in the sentence”

• Low-level commands, dictation with commands and re-dictation: these will be explain in the next subsection (5.2) about editing strategies. They are the most specific and also the most used expressions.

5.2 Editing strategies

Editing is part of text revision, consisting of insertion, replacement and deletion operations. In this work, insertion is not counted as revision but as composition, when the inserted words make a complete sentence (e.g. P01 “Add a new sentence after the fifth one: I am available Monday through Friday and have attached my resume to the bottom of this email”)

Participants employ different strategies to request for editing operations, as shown in figure 3.

• Re-dictation: re-speaking the intended text with correct words of context (to the right or left of the target)

System “I’d like to claim a refund or gain replacement–” P02 “Orrequest a replacement”

• Low-level commands: use of keywords that refer to an oper-ation. Keywords can be contained in a sentence.

P02 “Can youadd dear sir slash madam?”

Figure 2: Specificity in participants’ expressions requesting for operations, from the least specific to the most specific

• Re-dictation with commands: strategy that combines the use of keywords and repetition of the intended text in the same expression.

P03 “Could youswitch the last sentence to: I hope there’ll be a Friends reunion (...)”

Low-level commands appeared as the most used strategy, being even the only strategy used by some participants. Some participants adapted their strategies throughout the course of the experiment. For instance, P05 started using re-dictation with commands, to make sure his request was being understood. Then, after the first interactions with the system, he switched to just re-dictation.

Figure 3: How participants request for editing operations

5.3 Target location strategies

A revision expression consists of theoperation and the target. The operation can be edit (insert, replace, delete) or read (repeat, stop, resume). For example: “Delete cat” or “Repeat the first sentence”.

Participants use a variety of ways to refer to the target of the operation, as shown in figure 4.

(10)

• Descriptive reference: use of expressions that require a cer-tain level of inference from the system. In the expression below, the identity relation between the matric card number and the actual number, needs to be inferred.

P03 “Could you change thematric card number to eight zero one one four two two eight?”

• Pronouns: use of deictic references that need of contextual information to be disambiguated.

P02 “Please changeit: because season one was pretty good”

• Use of absolute position of the target in the text P01 “Add,to the beginning, the Pixar short”

• Relative position of the target: use of its position with respect to another word/s

P07 “Add abefore really”

• Text to be modified: use of textual content P01 “Deletegrowing potatoes”

The most common strategies are the use of the absolute position and the use of the text to be modified. Normally, the absolute po-sition is used forread operations (e.g. “Repeat the first sentence”). The textual content, however, usually appears afteredit operations (e.g. “Deletemany”)

Figure 4: How participants locate the target of the operation

5.4 Switching between composition and

revision

In the absence of keywords or different modes (e.g. composition vs proofread mode), it is necessary to understand how participants switch between composition and revision operations.

Figures 5 and 6 show examples of composition-revision sequences recorded from the experiment. Each line in the figures represents a single utterance. For instance, in the fictitious example below, participant’s utterance would be coded as one line.

System “Hey”

Participant “Hey, how are you?” System “I am good, thanks”

The two patterns observed in those sequences are described below:

• First composing, then revising: some participants compose the entire piece before revising it. In doing so, they indicate they have finished composing, for example by asking the system to repeat the text. In the example shown in figure 5, P01 marks the end of composition by saying “Can you read this back?” Then, she starts with revision until she is satisfied with the result.

However, some participants remain silent after composition, expecting the system to take action and read the text. • Alternating composition and revision: some participants

in-terweave the actions of composition and revision, sometimes even in the same utterance.

In the example below, participant’s utterance starts with composition, then switches to revision, and then back again to composition. The system must be able to recognize the command “erase” in the utterance, and understand that what comes after the instruction is intended as new text.

P03 “I recently came across (...) an article about (...) Completelyerase that sentence please (...) So, this article I just read is about”

As shown in figure 6, some participants compose in chunks, then they revise them, and then compose new chunks. In do-ing so, sometimes the switch between composition/revision is not marked. For example, P04 starts composing again without indicating she is going to add new information. The system must be able to understand that the utterance is in-tended as new text to be appended.

System “You can hire a bike and ride around the island, which is hundred kilometers in circumference.” P04 “The island (...) has a dense forest.”

Figure 5: Composition-revision sequence based on partici-pants’ utterances. “Others” refers to any expressions that cannot be categorized as neither composition nor revision.

5.5 System response

Predefined response messages were programmed into the system to: inform of error, ask for confirmation to an action, and confirm an action. A description of the messages has been provided in section 4.1.

The aim was to understand which system responses are the most/least used, to find out if some are needed more/less than others.

Figure 7 shows the number of appearances of each system re-sponse per participant. The confirmation message “Working on 9

(11)

Figure 6: Composition-revision sequence based on partici-pants’ utterances. “Others” refers to any expressions that cannot be categorized as neither composition nor revision.

it” seems to be the most used. This response was provided after new text insertions. The speech recognition output was manually punctuated by the experimenter, thus a certain delay can be ex-pected before the system could start reading. There were technical difficulties related to Google’s API for speech recognition, which generated longer delays for the first three participants. That is why the occurrence of the message is more frequent. Also, the more frequent the experimenter’s intervention was (e.g. with P03, as she combines composition and revision), the more frequent the occurrence of this message is.

During the interviews, some participants mentioned they liked that the system was using this confirmation message.

After that, the confirmation message “OK, I have deleted that”, is the most used. This makes sense, since is one of the confirmation responses of the system to changes done – the other one is repeating the sentence with changes in case of insertion and replacement.

Apart from the confirmation messages, some of the messages to inform of error are used often. For instance, the message “Wait, I missed something. I will repeat the last sentence.”

It was noted that the message “Are you sure to delete...?”, to ask for confirmation to an action, was never used by the system. This message had the purpose of informing users of risky operations of which they could not be aware (e.g. deleting an entire paragraph by accident), which is a limitation of not having visual feedback. Participants did not try to perform any risky operation during the experiment.

6 DISCUSSION

6.1 Use of commands

The study reported in this thesis found that, even if participants are allowed and encouraged to speak freely, they use a lot of commands and keywords. Further research should be necessary to conclude this, but it could be related to the following factors.

• Participants are used to interact with computers using com-mands, not only when speaking to voice-based interfaces but also when interacting with graphic user interfaces (e.g. cut, copy, paste... are examples of commands)

• Participants are not aware of other strategies to instruct the system (e.g. re-dictation for editing operations)

P07 was not aware that she could re-speak the intended text instead of using commands. For example, she says “Change

Figure 7: Number of occurrences per participant of each pre-recorded system response

it was really to it was a really”. She could have just said “It was a really”

• Participants do not trust the system’s capabilities to under-stand their issued instructions and performed their intended operations.

P01 “I was hoping to be more specific so that the system can understand me”

P06 “If I just (re-dictate), it may not know whether I want to add or to remove”

P07 “Maybe the system won’t know which word I want to change... What if it changes another word?”

The use of commands may be suitable for some operations, such as deleting a portion of the text. However, in some cases, the use of commands is more slow and error-prone. When participants try to repeat exact phrases, for instance to replace a word, sometimes they make mistakes.

6.2 Switching between composition and

revision

The strategies for switching between composition and revision were proved to vary, even within subject. Two participants often switched between composition and revision, without indicating that. This could be due to the lack of visual feedback, which might make users think in a more chaotic way – users do not have a clear overview of the text.

(12)

In order to reduce the cognitive load for the user (e.g. having to remember switch phrases like “Proofread that”), the burden of distinguishing between composition and revision should be on the system. So, if for example, users start composing immediately after revision without indicating the switch between the operations, the system should be able to identify the input as new information.

6.3 Thinking out loud

Results showed that participants sometimes think out loud, or speak words they do not intend to be typed. This could be due to the users’ perception of a certain lack of time to think. As opposed to other entry methods such as keyboarding, it could be a limitation of eyes-free dictation.

P07 “When I am typing, I can think at the same time”

6.4 Organizing the text

As shown in the study, participants normally do not indicate sen-tence boundaries. The below example is only a rare case of an utterance containing punctuation signs.

P03 “Dear sir or madam. Next paragraph- wait comma and next paragraph, with an indentation”

However, it was observed that participants navigate the text using sentences (e.g. “Repeat the first sentence”). In doing so, they expect the system to punctuate the text for them and also to do it “their way”. For instance, when participants say “Delete the third sentence”, they expect the system to delete the sentence they con-sider is the third one.

6.5 Formal vs casual dictating style

With the goal of understanding if there were differences when par-ticipants dictate formal against casual text, parpar-ticipants were asked to perform different eyes-free dictation tasks. No differences were observed in terms of user strategies and behavior. However, in the post-experiment interviews, six participants mentioned that they would prefer to use the system only for composing and revising short messages to friends or social media posts. Only one partici-pant, P02, mentioned that he would prefer dictating formal emails, as formal emails normally “follow a set pattern”

6.6 Human-human interaction

It was observed that, in some cases, participants interact with the system as if it was human. That is noted in utterances such as “Can you delete...”, “Please change that to...”, or “Could I rephrase that?”. These utterances provide evidence of the existing conversation between users and the system, where users can even be polite. The use of polite words has been referred in previous studies [8]

Since speech is the primary method of communication between humans and we use it everyday, it could be easier or more natural for users to interact with the system in this way.

P02 “I would interact almost as I would with another person”

P04 “I was just trying to be specific, if it was a person, I think I would do the same”

However, as it will be explained in the last part of this section, the direct consequence of users being more polite or conversa-tional, is the generation of noise in the communication. Phrases such as “Could you”, “please”, or “sorry”; do not contain valuable information for the system and should be ignored.

6.7 Recommendations for future systems’

design

There is still some way to go before computers can understand and interpret completely unconstrained human speech. The findings of this current research aim at informing the design of future dictation systems allowing for a free use of speech. This thesis contributes with a list of high-level recommendations. Further research would be necessary in order to provide more detailed requirements.

Even if this research has been focused on eyes-free dictation, the below design recommendations can also be applied to traditional dictation using unconstrained speech.

• User-friendly Speech-to-Text. As a result of allowing for unconstrained speech, systems will have to deal with dis-fluencies (e.g. fillers, self-repair, repeated words) and noise (e.g. polite words). A word-by-word transcription of uncon-strained speech might be unreadable for users or make no grammatical sense. To achieve a user-friendly speech recog-nition, the system needs to identify and dispose of the parts of the communication that are not intended as text entry by the user.

• Interpreting unspecific speech. When instructing the sys-tem, users’ expressions may not always be specific, which can be related to the lack of visual feedback. Users have to remember what they have dictated, and think of what they are going to dictate. For that, they use different strate-gies. For instance, a user may interrupt the system uttering: “Change that to I won’t spoil the ending”, wherethat refers to the sentence previously dictated by the system. Disam-biguating unspecific expressions is necessary to facilitate unconstrained dictation, specially in an eyes-free context. • Free grammar. The system should allow the use of

com-mands with an open grammar. The results of the current research show that the use of commands prevails for in-structing the system. However, the results also proved that users do not limit themselves to a fixed syntax. When issuing commands, users use synonyms (e.g. delete and remove) and phrases (“Can you add...?”). The current research does not propose getting rid of commands in eyes-free dictation, but allowing for an unconstrained syntax.

• Different editing strategies. The study reported in this thesis showed that the use of low-level commands like “Delete Monday”, was a common editing strategy. However, the study provided evidence of the use of other strategies for editing text, such as re-dictating the intended text. The sys-tem should allow users to use the strategies they consider more appropriate in each case.

• Easy navigation. Navigating the text eyes-free was proved to be challenging for users by the current study, and also by previous research [2]. Specially in an eyes-free context, users have difficulties navigating to an specific target because 11

(13)

they have to remember exact phrases. Systems should make navigation easier for users. However, further research would be necessary to better understand how to improve eyes-free text navigation.

•Understanding user’s intention. Understanding what users intend as text entry and what is part of the communication with the system or thinking out loud, is necessary in eyes-free dictation. Speech disfluencies may contain valuable in-formation in that regard. The occurrence of filled pauses such as hum, eh, ah, etc. could indicate that the user is think-ing out loud. A system capable of recognizthink-ing when the user is dictating and when he is simply rambling (probably as a result of not having visual feedback), would give users time to think. According to the current research, users do not feel they have time to think when using eyes-free dictation. A system transcribing everything the user says would not be helpful in that regard.

•Easy switching between composition and revision. Due to the lack of modes (e.g. compose vs proofread), users have to decide how to switch between composition and revision. A user may compose first and then revise, clearly indicating the switch (e.g. “Can you repeat what I have just composed?”). Instead, another user may alternate composition and revi-sion without giving notice to the system. The burden of identifying the different operations should be on the system, and not on the user.

•Helpful system response. The system response should help users, and not disturb them. For example, feedback on successfully performed operations is useful when users do not have visual access to the text. On the other hand, a talkative system may disturb users, specially when they are thinking or trying to recall. Further research would be necessary to better define the system response.

•Inform of system working. In relation to the previous point, a useful system response would be one informing not only of system failure, but also of the system running smoothly. A silent system could confuse users, that may think something is wrong.

7 CONCLUSION AND FUTURE WORK

Eyes-free dictation is an appropriate text entry methods when our hands and eyes are occupied, and also for the visually impaired users. However, current dictation systems are difficult to use for eyes-free dictation, as they require users to memorize and repeat specific keywords and exact phrases for editing operations. Relaxing current constraints would contribute making eyes-free dictation possible and user-friendly. The current thesis has had the purpose of understanding how users would perform eyes-free dictation, when not constrained by the use of commands and a fixed syntax, in order to inform the design of future systems.

With that purpose, a Wizard of Oz study was conducted, simu-lating a dictation system that could be used completely eyes-free and with an unconstrained vocabulary. Participants were asked to perform different text composition and revision tasks, speaking freely to the system. To define the crucial system operations and

responses, an exploratory study was conducted prior to the main study.

The Wizard of Oz study showed, amongst other findings, that users’ instructions are not always specific and can contain “noisy” words. To deal with these, a certain level of understanding is re-quired from the system. Moreover, different editing strategies were detected. Even though users were allowed and encouraged to speak freely the use of commands and keywords (e.g. delete, change) was common, yet with an unconstrained grammar.

This work contributes with a set of design guidelines, to inform future systems allowing for unconstrained eyes-free dictation. How-ever, due to time constraints, the study reported in this research had to be performed with only seven users. Future research should be focused on gathering more participants in order to collect more data and determine the external validity of the current findings. Furthermore, conducting the study in a real life scenario (e.g. during users’ commute), would contribute assessing the ecological validity of the study, as different behaviors may occur when users are, for instance, walking.

On the other hand, collecting more data would be necessary to refine the system response, as the current system’s response is based on a few cases observed during the preliminary study. Future work should also be focused on continuing automating the system operations, in order to reduce human intervention.

8 ACKNOWLEDGEMENTS

Many thanks to Can Liu and Frank Nack for their supervision and help. Thanks as well to Shengdong Zhao for allowing me to join their lab at the National University of Singapore (NUS) during the development of this thesis, and for facilitating the resources needed to conduct the research. Also thanks to Debjyoti Ghosh and all the colleagues from the HCI lab at NUS for their support.

REFERENCES

[1] Shiri Azenkot and Nicole B. Lee. 2013. Exploring the use of speech input by blind people on mobile devices. InProceedings of the 15th International ACM SIGACCESS Conference on Computers and Accessibility - ASSETS ’13. https: //doi.org/10.1145/2513383.2513440

[2] Jinjuan Feng and Andrew Sears. [n. d.].Using Confidence Scores to Improve Hands-Free Speech Based Navigation in Continuous Dictation Systems. Technical Report. doi.org/10.1145/1035575.1035576

[3] Debjyoti Ghosh, Pin Sym Foong, Shengdong Zhao, Di Chen, and Morten Fjeld. [n. d.]. EDITalk: Towards Designing Eyes-free Interactions for Mobile Word Processing. ([n. d.]). https://doi.org/10.1145/3173574.3173977

[4] Paul Green and Lisa Wei-Haas. 1985. The Rapid Development of User Inter-faces: Experience with the Wizard of OZ Method. Proceedings of the Human Factors Society Annual Meeting 29, 5 (10 1985), 470–474. https://doi.org/10.1177/ 154193128502900515

[5] Clare-Marie Karat, Christine Halverson, Daniel Horn, and John Karat. 1999. Patterns of Entry and Correction in Large Vocabulary Continuous Speech Recog-nition Systems. (1999). https://dl.acm.org/citation.cfm?id=303160

[6] Tatsuya Kawahara. 2007. Intelligent Transcription System Based on Spontaneous Speech Processing. InSecond International Conference on Informatics Research for Development of Knowledge Society Infrastructure (ICKS’07). IEEE, 19–26. https: //doi.org/10.1109/ICKS.2007.13

[7] Martin Labsky, Tomas Macek, Jan Kleindienst, Holger Quast, and Christophe Couvreur. 2011.In-Car Dictation and Drivers Distraction: A Case Study. Technical Report. 418–425 pages. https://doi.org/10.1007/978-3-642-21616-9

[8] David R. Large, Leigh Clark, Annie Quandt, Gary Burnett, and Lee Skrypchuk. 2017. Steering the conversation: A linguistic exploration of natural language interactions with a digital assistant during simulated driving.Applied Ergonomics (2017). https://doi.org/10.1016/j.apergo.2017.04.003

[9] Christine Nakatani and Julia Hirschberg. [n. d.].A Speech-first Model For Repair Detection And Correction. Technical Report. http://www.aclweb.org/anthology/ P93-1007

(14)

[10] Jun Ogata and Masataka Goto. [n. d.]. Speech Repair: Quick Error Cor-rection Just by Using Selection Operation for Speech Input Interfaces. ([n. d.]). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.369.7929&rep= rep1&type=pdf

[11] Sharon Oviatt, Margaret Maceachern, and Gina-Anne Levow. [n. d.]. Predicting hyperarticulate speech during resolution ’ human-computer error. ([n. d.]). https://www.sciencedirect.com/science/article/pii/S0167639398000053 [12] Bernhard Suhm. 1997. Exploiting repair context in interactive error recovery.

Esprit/BRA project MIAMI-Multimodal Integration in Multimedia Interfaces View project. Technical Report. https://www.researchgate.net/publication/36452254 [13] Bernhard Suhm. 2001. Multimodal Error Correction for Speech User Interfaces.

ACM Transactions on Computer-Human Interaction 8, 1 (2001), 60–98. https: //pdfs.semanticscholar.org/3bc8/3e04977c6202d24a98f67be24c4356afca56.pdf [14] Keith Vertanen and Per Ola Kristensson. 2009. Automatic selection of recognition

errors by respeaking the intended text. InProceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2009. https://doi.org/ 10.1109/ASRU.2009.5373347

[15] Yang Liu, E. Shriberg, A. Stolcke, D. Hillard, M. Ostendorf, and M. Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies.IEEE Transactions on Audio, Speech and Language Processing 14, 5 (9 2006), 1526–1540. https://doi.org/10.1109/TASL.2006.878255