Pre-Producing Audio Stories with Linked Transcripts

(1)

Pre-Producing Audio Stories with Linked Transcripts

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER OF

SCIENCE

Nicolas Schabram

11426411

MASTER INFORMATION STUDIES

HUMAN-CENTERED MULTIMEDIA

FACULTY OF SCIENCE

UNIVERSITY OF AMSTERDAM

July 24, 2017

1st_Supervisor ₂nd_Supervisor

Dr. Frank Nack Dr. Britta Meixner

(2)

Pre-Producing Audio Stories with Linked Transcripts

Nicolas Schabram

University of Amsterdam

Amsterdam, The Netherlands

mail@nicolasschabram.de

ABSTRACT

Several studies have exploited the potential of linking tran-scripts to audio material to facilitate text-based, content-oriented audio editing. Little attention, however, has been paid to how linked transcripts might also facilitate material-intensive pre-production of complex audio stories. Based on an extensive literature review, expert interviews, and a user-centered iterative design process, this paper introduces a tool, which takes advantage of linked transcripts during pre-production. It facilitates tasks such as selecting usable sound bites from hours of raw material, re-organizing them using tag taxonomies and pinboard-inspired digital canvas-es, and writing an audio-enabled script. The resulting solu-tion considers users’ job needs and is designed to integrate well into existing workflows. Qualitative evaluation ses-sions with audio storytelling professionals demonstrate the rich potential of the proposed solution.

1. INTRODUCTION & MOTIVATION

In recent years, spoken audio media is slowly but steadily gaining popularity: Researchers and audio professionals observe growing numbers of people consuming podcasts, radio features, and audio books (Schröter, 2016; Vogt, 2016). Meanwhile, a trend in journalism towards elaborate storytelling has long since reached the audio sector (Buchholz, 2017b) with well-researched and meticulously scripted radio shows and podcasts like This American Life,

Radiolab, or more recent audio blockbusters such as Serial

or S-Town (Hess, 2017). The complexity level of such audio storytelling productions exceeds that of shallow 90-seconds radio reports or talk shows largely recorded in one go. Therefore, audio producers have adopted a plethora of different tools to get a grip on this complexity.

Naturally, most journalists use an audio editing software, or digital audio workstation (DAW), to construct their stories (Shin, Li, & Durand, 2016). With DAWs, they handle low-level tasks such as polishing the audio quality, fine-tuning cross-fades between different clips, or eliminating filler words and background noise. However, these tools enforce interaction with the audio on the most granular level possi-ble: Speech and music are manipulated “via tedious, low-level waveform editing” (Rubin, Berthouzoz, Mysore, Li, & Agrawala, 2013, p. 113). Such degree of detail can hardly facilitate the creative process of controlling the general flow of a story on a more abstract level. Thus, there is a series of studies (cf. Section 2.2) concerned with designing more content-oriented editing tools that rely on recent advances in speech recognition: By automatically linking textual transcripts to audio recordings, they allow users to manipu-late audio by editing text, which has produced promising results in terms of usability and productivity in the past.

These tools mostly come into play for the actual production of an audio story, practically replacing DAWs. A typical workflow, however, usually involves additional pre-production tasks aimed at organizing vast amounts of recorded raw material. As Radio Diary’s Joe Richman explains: “You get all this tape, you get 40 hours of tape or whatever, and you break it apart into little…into atoms. And then you try to find a way to fit it all back together.” (Abel, 2015, p. 116) While doing so, journalists must come up with a logical plot, facts need to be presented in a sensible order, the evocation of suspense and emotional reactions has to be managed, and so forth. Such tasks are currently performed using a variety of different tools such as word processors, script editors, audio players, and even physical note cards and sticky notes. Each of these tools comes with its own internal logic and introduces a certain level of friction when combined with other tools. As the profitability of audio storytelling still lags behind its growing popularity (Vogt, 2016), journalists would most likely benefit from a more unified solution that helps improving their efficiency and productivity when performing such pre-production tasks.

Therefore, this study investigates the potential of linking audio material and textual transcripts (as has been done in previous studies) to facilitate pre-production of audio sto-ries. Correspondingly, my research is guided by the follow-ing research question: How can linked transcripts improve

on the pre-production of audio stories in terms of usability and utility? The accompanying objective of the project is

the design of a proof-of-concept interface prototype. In the following section, prior literature on the topic is dis-cussed, including insights about current journalistic work-flows and how different tools and studies try to solve relat-ed problems (Section 2). Basrelat-ed on these findings, I con-ducted qualitative expert interviews with audio journalists to further substantiate the description of current workflows and problems and to formulate a comprehensive list of requirements (Section 3), which guided the subsequent iterative design process (Section 4). Section 5 discusses the resulting interface’s compliance with the requirements. In Section 6, limitations of the study are noted and comple-mented with recommendations for future work. Finally, the study’s contributions are summarized in Section 7.

2. RELATED WORK

Designing a tool that is aimed at improving journalistic workflows requires a solid understanding of these work-flows as they are in practice today. Thus, the first part of this section is dedicated to pertinent literature from the field of audio journalism. In the second part, I briefly discuss existing scientific and commercial software solutions.

(3)

2.1 Audio Storytelling Workflow

Shin et al. (2016) identify three main tasks involved in creating an audio story: “writing a script, recording the speech, and editing the recorded audio” (p. 509). Based on the conducted literature review, Figure 1 presents a more fine-grained process model. Considering the variety of workflows present in the real world, the model is still likely to simplify and

overly generalize the subject matter. In fact, the work-flow can rather be understood as a “chaotic” process, which involves extensive back and forth and a certain level of simultanei-ty. However, the model is sufficient-ly useful as a struc-tural element in this section and throughout the remainder of this paper.

Step 1: Plan

According to introductory literature for prospective audio journalists (Abel, 2015; Buchholz, 2017c), complex pro-ductions usually involve a planning phase: Before going out in the field, journalists develop a roadmap guiding the ma-terial collection and answering questions like: Who to talk to? Which situations to gather on tape? Which ambient noises (ambi) to record?

Step 2: Collect

Usually, the primary source for audio raw material (herein-after referred to as tape) are original recordings collected by a reporter in interviews and during field research. Addition-al materiAddition-al can be retrieved from externAddition-al sources such as public, corporate, and private archives, the internet, and movies (Buchholz, 2017a; Kirchner, 2017). The collected material is then stored to be accessible for the current and, sometimes, for future projects (Schamari, 2017).

Step 3: Transcribe

As several authors have mentioned, audio—just like vid-eo—is a time-based, strictly sequential medium, which complicates tasks such as browsing, skimming, searching, and key word spotting (Casares et al., 2002; Whittaker & Amento, 2004). It is impractical to find specific sections within large collections only by listening and interacting with a non-semantic waveform or progress bar representa-tion of the audio (Ranjan, Balakrishnan, & Chignell, 2006). Selecting and extracting good parts by identifying in and out points is similarly tedious, requiring “zooming and numerous repetitions of fast-forward and rewind opera-tions” (Casares et al., 2002, p. 158). Due to these accessibil-ity issues, audio journalists tend to suffer from overload caused by all to large tape collections (exemplarily de-scribed by This American Life's Ira Glass in Abel, 2015). A

common strategy to overcome this overload is working with a textual representation of the audio. Although text is infe-rior in transporting emotions and subtle nuances, it avoids the accessibility challenges of audio previously described (Ranjan et al., 2006). Thus, the pre-production phase is predominantly text-based. To this end, some journalists rely on verbatim transcripts, while others write a rough sound log, sequentially summarizing what is said in a recording (Ira Glass in Abel, 2015; Kirchner, 2017).

Step 4: Select

At this stage, it is likely that a journalist has far more mate-rial than can be used in the final story (Moll, 2017b). Thus, the ‘Select’ step involves reviewing the tape and transcripts and extracting the best quotes (Buchholz, 2017c), also re-ferred to as ‘sound bites’ or ‘actualities’ (MacAdam, 2015). There are varying strategies for how to perform this selec-tion. For example, some journalists find it useful to acceler-ate playback of the audio while simultaneously skimming the log or transcript to quickly identify good parts (Moll, 2017b). Usually, the transcribed sound bites are put down in a list, along with additional metadata such as time codes, duration, ascending numerical identifiers, speaker infor-mation, recording dates, and notes (Buchholz, 2017c; Moll, 2017b; Reintjes, 2016; Schamari, 2017). Buchholz (2017a) notes that the interviewee relies on the journalist to present her statements in a way that reflects the intended meaning. Cutting tape into loose pieces, on the other hand, always bears the risk of accidentally taking her statements out of context later during the process. Thus, special diligence is recommended during the selection process.

Step 5: Organize

Interviewees do not always tell a story the way it is best told in the final product. Thus, a journalist has to find an adequate order for the selected sound bites (Buchholz, 2017c). For material-intensive productions, this is a non-trivial task, requiring numerous iterations. Just like doing a jigsaw, journalists try to fit sound bites back together in a way that makes sense and maintains tension throughout the story (Abel, 2015). Due to the variety of existing approach-es, a complete review of methods exceeds the scope of this paper. However, one of the tools that audio producer seem to use rather regularly is organizing sound bites and ideas with physical note cards or sticky notes, as S-Town’s Brian Reed explains in the Longform Podcast (2017):

There’s a whole wall in the office that’s just filled with note cards [...]. Because the way we structured it is like ... it’s just Julie and me […] sitting there and kind of like jamming, […] just talking through different ways that it [the story] could go.

Another approach that is sometimes used for complex sto-ries is clustering like or related sound bites together to get a more structured overview of the material (e.g., Radio

Dia-ry's Joe Richman in Abel, 2015).

Whatever strategy the journalist might have followed to order the sound bites, the resulting list then serves as a structural framework for the story (Buchholz, 2017c).

Figure 1. Audio storytelling workflow

1. Plan 2. Collect 3. Transcribe 4. Select 5. Organize 6. Script 7. Record 8. Produce Pre-Production Production

(4)

Step 6: Script

A script is a text document outlining the audio story in-depth. It contains all the elements to be used in the final production. This includes fully or partially transcribed sound bites, where full transcription is deemed to be more suited to convey the mood and meaning but might not be feasible under time pressure (Schamari, 2017). Additional-ly, a script might contain voice-over translations for for-eign-language sound bites (Moll, 2017b). Usually, sound bites are complemented by narration passages to be record-ed by a speaker in a studio. Narration bridges logical gaps and moves the story forward (Buchholz, 2017b). In some cases, such passages are only summarized with rough talk-ing points, encouragtalk-ing speakers to talk more naturally (Linke, 2017). Sometimes, narration is accompanied by speaker instructions (Linke, 2017) such as indicating an ironic diction, how to pronounce a foreign term, or which word to emphasize in a phrase. Furthermore, reporters usu-ally specify where to use music, ambi and sound effects, and leave technical instructions for production (Schamari, 2017). Finally, scripts might be segmented into distinct scenes (e.g., Joe Richman in Abel, 2015).

Apart from helping the reporter to draw up the story, a script serves other purposes as well. In particular, large audio stories often are the result of collaboration. In Ger-man public radio, for example, tasks are usually distributed among a freelance reporter who researches and writes a script, an in-house editor who is responsible for the show that is airing the story, a speaker recording the narration, and a producer performing the final editing (Jarisch, 2017). These roles, however, are fluid and vary between different media outlets, not least because of recently emerging non-traditional organizational forms like podcast networks or self-publishing online journalists. Either way, scripts allow multiple contributors to work on a story together. For ex-ample, editors might give written or spoken comments, feedback, and suggestions for improvement to a reporter based on the script (Kirchner, 2017).

Another purpose of the script is to predict the duration of the final audio story before producing it, which is especially relevant for time-constrained radio broadcasts. Usually, the number of characters, lines, or pages is used as a proxy to estimate duration (Buchholz, 2017b).

Step 7: Record

In this step, the narration passages and translation voice-overs as formulated in the script are recorded. This is usual-ly done by a professional speaker or the reporter and mostusual-ly happens in a recording studio or at home (Jarisch, 2017). Some media outlets employ professional producers who operate the equipment, while others let the speakers operate the studio by themselves (Schamari, 2017).

Step 8: Produce

Ultimately, the script serves as an instruction manual for piecing the collected and recorded elements together to produce the final audio story. The reporter herself or a pro-ducer usually use a multitrack audio editing software— commonly referred to as digital audio workstation

(DAW)—to perform all necessary editing steps. According to Moll (2017b), this involves tasks such as

 fine-cutting sound bites,

 bringing the audio elements (sound bites, music, ambi, and sound effects) into the right order,

 cleaning up speech (e.g., removing “uhms”, coughs, or unnecessary pauses),

 manipulating audio signals (e.g., altering voices or removing background noises),

 modifying volume levels, or  layering clips on top of one another.

2.2 Existing Tools

Naturally, usage of different tools varies across different journalists and organizational forms. It is beyond the scope of this paper to provide a complete overview. Instead, this section focuses on characterizing the most commonly used applications as well as presenting pertinent research con-cerned with solving specific problems.

As mentioned before, DAWs are commonly used for virtu-ally every advanced storytelling production. Popular prod-ucts include Adobe Audition, Avid ProTools, GarageBand, Hindenburg, and Audacity (Rubin et al., 2013; Shin et al., 2016). Another system commonly used in public radio is DAVID DigAS (Moll, 2017a). DAWs help with a variety of different tasks. For example, they might be used as audio players during manual transcription (‘Transcribe’ in Figure 1), for cutting and extracting pieces from the raw material (‘Select’), for experimentally ordering and reordering ele-ments within the story while simultaneously organizing the transcribed sound bites in a list (‘Organize’) or while writ-ing the script (‘Script’), for recordwrit-ing at home or in a studio (‘Record’), and, finally, for producing the audio story (‘Produce’). In short, DAWs are omnipresent throughout nearly the entire pre-production and production workflow. This becomes possible due to the generic nature of such applications, which are typically built to support all kinds of low-level audio signal processing tasks (Shin et al., 2016). Another work horse of most audio journalists are word processors like MS Word, Apple Pages, or Google Docs. They are mainly used during pre-production to plan record-ings (‘Plan’), to collect and organize notes, to transcribe raw material (‘Transcribe’), to list and organize sound bites (‘Select’ and ‘Organize’), or to write the script (‘Script’; Shin et al., 2016).

Apart from generic DAWs and word processors, there are more targeted tools for specific use cases. In particular, dedicated script editors like Adobe Story, FinalDraft, or Celtx extend word processors with features such as ad-vanced “collaboration, automatic formatting, navigation and planning for future production” (Shin et al., 2016, p. 510). Scrivener, a similar application, provides additional functions for storing and organizing research material and for handling project management tasks (Literature and Latte, n.d.b). It can also be used in conjunction with Scap-ple, which lets users freely place notes on a visual canvas to structure thoughts and ideas (Literature and Latte, n.d.a). To

(5)

the best of my knowledge, however, none of these tools were specifically designed for audio storytelling.

Another set of tools helps journalists with transcribing raw material. For example, f4transkript features variable speed, automatic rewind, speaker management, automatic timestamps, shortcuts, and text modules to maximize effi-ciency of manual transcription (Audiotranskription.de, n.d.). In addition, there are numerous automatic and semi-automatic transcription tools. A comprehensive list of available tools can be retrieved from Bunce (2016).

A promising line of scientific research addresses the usabil-ity hurdle arising from overusing DAWs in audio storytell-ing projects: Several authors (Casares et al., 2002; Rubin et al., 2013; Shin et al., 2016; Whittaker & Amento, 2004) criticize that DAWs force the user to interact with the audio on the smallest level of detail, namely by manipulating the audio waveform.

As a result producers must map their high-level story editing and design goals onto a sequence of low-level editing operations—e.g. selecting, trimming, cutting and moving sections of a waveform. Manually applying each of these low-level edits is often tedious and usually very time-consuming. (Rubin et al., 2013, p. 113) Correspondingly, Shin et al. (2016) point out that conven-tional workflows based on DAWs and word processors entail a strict separation of audio and text, causing friction by imposing the maintenance overhead of keeping them in sync onto the user. To solve both of these problems— tedious waveform navigation and an inefficient separation of text and audio—, several studies employ the technique of time-aligning text with audio or video by automatically assigning time codes from the raw material to the corre-sponding words or sections in a transcript (Berthouzoz, Li, & Agrawala, 2012; Casares et al., 2002; Ranjan et al., 2006; Rubin et al., 2013; Shin et al., 2016; Whittaker & Amento, 2004). This technique enables users to target audio sections by selecting words in a transcript. Similar approaches have also since found their way into commercial products such as Trint (n.d.) or Descript (Detour, n.d.).

Rubin et al. (2013) and Berthouzoz et al. (2012) retrieve manually produced transcripts from an external crowdsourcing provider, which they find to be of higher quality; the other studies and commercial tools mentioned above employ more cost-effective automatic speech recog-nition software. In their comparative study, Ranjan et al. (2006) discover that perfectly accurate transcripts promote search tasks as users can skim the text instead of relying only on the audio. In contrast, automatically produced low-quality transcripts were found to be of limited value com-pared to not having a textual representation at all. This finding contradicts Whittaker and Amento (2004), who observe positive effects on efficiency even when relying on erroneous transcripts. Either way, it is generally recom-mended to allow users to manually correct transcription mistakes. This feature is implemented by Casares et al. (2002), Shin et al. (2016), Descript (Detour, n.d.), and Trint (n.d.). Additionally, several tools not only allow users to

navigate audio through text but also provide editing

func-tions: selections as well as cuts applied to the transcript are automatically propagated to the audio or video (Berthouzoz et al., 2012; Casares et al., 2002; Detour, n.d.; Rubin et al., 2013; Shin et al., 2016; Whittaker & Amento, 2004). The discussed applications potentially simplify the process of producing an audio story by providing proxy mecha-nisms that map high-level, content-oriented actions to low-level editing operations. Nevertheless, none of the tools is fully suited to incorporate Figure 1’s audio pre-production steps, which are the focus of the present study: Casares et al. (2002) and Berthouzoz et al. (2012) are built for video, not audio. Whittaker and Amento (2004), Rubin et al. (2013), and Descript (Detour, n.d.)—although high-level and content-oriented—rather position themselves within the production phase, mostly dealing with audio manipulation and less with planning, managing material, organizing sound bites, or scripting the narrative arc. Shin et al. (2016) allow users to write a so-called master script “that is linked to the audio and always reflects the current state of the project, including unrecorded, recorded, improvised and edited portions of the script” (p. 509). This use case comes closer to the present research interest. Their design phase, however, was guided by lecturers and YouTubers who are less concerned with organizing large amounts of research tape but instead have more control over the recording phase and need a way to keep their script and their audio project in sync when simultaneously scripting and producing. Trint (n.d.) allows for automatically transcribing audio and man-ually correcting these transcripts as well as for sound bite selection. Writing a script, however, is not supported. Thus, users lose the connection between audio and transcript as soon as they switch to conventional word processors or script editors to order sound bites and add their narration (Reintjes, 2016).

3. REQUIREMENTS

From April 11th_{to 27}th_{, 2017, I conducted a series of}

semi-structured qualitative expert interviews with experienced audio journalists via Skype …

 to gain a practical understanding of journalistic work-flows by contrasting my literature review up to that point with real-world practices,

 to discover problems of current workflows thus far unaddressed in the preceding literature review, and,  finally, to derive a set of requirements guiding the

design process.

I recruited two male (R2, E2) and three female (R1, R3, E1) audio journalists, all of them from Germany. E1 and E2 work as in-house managing editors for two different public radio shows, both with a strong focus on long-form story-telling. R1, R2, and R3 work as freelance reporters for a variety of different news outlets. On the side, R1 produces an independent historic storytelling podcast. R2 lives and works in the United States, although predominantly produc-ing for German radio. He has written a blog article about various tools that bridge the gap between audio and video, including some of those introduced in Section 2.2 (Reintjes, 2016). R3 focusses on foreign reporting. For a more

(6)

de-tailed overview, see Appendix Table 1. I conducted a non-probabilistic convenience sample with special attention to picking both freelance as well as in-house journalists. Dur-ing analysis, repeatDur-ing patterns suggested theoretical satura-tion, which is why I eventually stopped recruiting further experts.

The interviews were guided by a protocol structured around the following areas of interest:

 Which tasks cause friction within current workflows?  Which software and physical tools do they use?  How do they plan their recording phase?  How do they manage their raw material?

 What is the role of transcription, and how is it used?  How do they organize the plot of their stories?  How are scripts embedded into current workflows?  What is the role of collaboration?

On average, the interviews lasted 67 minutes. The recorded interviews were manually transcribed. Using the qualitative data analysis tool Atlas.ti (Scientific Software Develop-ment, n.d.) and following an inductive approach, I iterative-ly developed a coding scheme to detect overarching themes in the data. Thereby, interleaving data collection and data analysis allowed me to gradually narrow down crucial pat-terns from one interview to the next. In a final iteration, all codes were reviewed and reformulated into an extensive list of requirements. Consequently, every requirement is either rooted in the data or in the literature review (or both). The identified requirements are listed in Appendix Table 2. It contains 88 requirements, 26 of which concern high-level properties of the tool and the workflow it entails. The others apply to specific steps within the proposed workflow model (see Figure 1). Throughout this section, numbers in brackets (e.g., 3:12) relate to the corresponding requirement identifi-ers in Appendix Table 2.

General Requirements

In accordance with the research question, the proposed tool should facilitate pre-production of audio stories (0:02). Thereby, E2 highlights that usability issues entail noticeable costs for large radio stations that employ and train hundreds or thousands of people. Consequently, the tool should be easy to learn and use (0:01). It also should not require cost-intensive special hardware or software (0:16).

As all participants report, one of the most difficult tasks of their work is maintaining an overview once a story grows longer and more complex. Thus, E2 calls for a tool that helps interacting with a story on two levels: high-level to outline the overarching story structure and low-level for fine-tuning details like wording or timing (0:03, 0:05). Moreover, R1 emphasizes the importance of a clean writing environment. Therefore, the tool should be as minimalistic as possible and not distract journalists from a content-oriented workflow (0:04). R1 and R2 also mention the problem of material overload, which should ideally be miti-gated by the proposed solution (0:10).

As discussed in Section 2.1 and confirmed by all inter-viewed experts, producing an audio story should pre-dominantly be text-based (0:06). In that regard, related work has revealed the potential of linking text transcripts to the audio. By incorporating this principle throughout the entire pre-production phase (0:08), the tool should eliminate the necessity to constantly switch back and forth between text and audio applications—an annoyance that some par-ticipants regularly deal with (R2, E2; 0:09).

As mentioned, the public radio landscape often requires freelance and in-house journalists to work together. The tool should therefore be suited for the needs of both groups (0:11, 0:12). As storytelling approaches generally differ from journalist to journalist and from station to station, it should also be flexible enough to accommodate different work styles (0:13) and adapt to different stations’ guidelines (0:14). The latter is especially relevant as E2 mentions the hesitation of large stations to adopt novel technology. To increase the possibility of usage, the tool should ensure solid security, privacy for journalists’ sources, data integri-ty, and fine-grained access control (E2; 0:23–0:26). How-ever, as adoption by media outlets is still questionable, the tool should not require to be used by everyone involved in a project (0:20); e.g., if a freelance reporter wants to use it but a station does not, it should still be possible (0:21).

Other features valued by some participants are location-independent access to work material (R1, E2; 0:15), archive and search capabilities (0:17), and collaboration features for projects with multiple contributors (E1, R2, E2; 0:22)

Step 1: Plan

All interviewed journalists usually work out a plan for col-lecting interview material. The depth of this plan largely depends on the project scope as well as on personal taste. However, R2 and E2 note that planning mostly happens inside their own heads, thus not requiring elaborate tools. In fact, the data does not indicate any severe shortcomings of current workflows in this regard. Having said that, E2 notes that drafting a script based on recorded tape sometimes reveals that the story could benefit from additional material. Thus, it might be useful to allow for placeholders within a script, indicating blanks to be filled (1:01).

Step 2: Collect

All interviewed reporters (R1–R3) confirm that they per-manently archive their raw recordings or at least selected sound bites in folders on their hard drives (2:01, 2:02). Some journalists collect additional metadata about their recordings such as observations, relevance, quality, record-ing dates, or names (R2, R3; 2:03).

Step 3: Transcribe

Four out of five journalists have not worked with automatic transcription yet. R1 and R3 prefer manual transcription as the writing process provides them with a better overview of the raw material. Moreover, R1 thinks that the costs of existing services are too high. Only R2 is a satisfied Trint (n.d.) user. He reports that transcript quality is, in general, good enough to grasp the sense of a recording, although he has experienced issues with strong accents and poor audio

(7)

quality. Nonetheless, he thinks that an erroneous transcript is better than not having a textual representation at all. To improve crucial sections of the transcript, he makes use of Trint’s manual correction mode. While R2 being the only user of automatic transcription, R1 showed at least slight interest in using it. Thus, automatic transcription could benefit the workflow if it allows the user to correct errors (3:02, 3:03) and if it does not hinder manual transcription. For manual transcription, R1 uses Scrivener’s helper func-tions such as automatic rewind on pause; R3 uses a DAW for playback and a word processor for writing the transcript, but complains about having to switch back and forth be-tween the two for navigating the waveform. Thus, the tool could benefit from facilitating manual transcription (3:01).

Step 4: Select

In one way or another, all interviewed journalists mention the necessity for selecting usable sound bites (4:01). This process involves a back and forth between audio and text, either for putting down a list of timecoded sound bites or for highlighting sections in a transcript (R1). While text is easier to navigate and skim, it is crucial for journalists to hear how an interviewee has said something (E2).

Sometimes, journalists perform a rough or fine cut of the sound bites during selection, extracting the respective audio sections into individual clips of a multitrack DAW (E2; 4:02).

All reporters (R1–R3) annotate selected sound bites with notes or labels, indicating a need for managing metadata about each sound bite (4:03). R3 flags foreign-language sound bites that require professional translation (4:04). Searching through transcripts is a common use case de-scribed by every reporter (4:05). Usually, built-in search functions from browsers or word processors are used to quickly jump to certain keywords, thus requiring sufficient-ly accurate transcripts.

Step 5: Organize

The interviewed journalists use a variety of different tech-niques for finding an order for the selected sound bites and to develop a consistent story line. For example, R3 usually experiments with different successions of a story by simul-taneously moving pieces around in a DAW and in a script. R2 also uses a script to gradually conceptualize a story line but has also experimented with writing down sound bites on note cards and shifting them around on a desk. Similarly, several participants mentioned the technique of putting sticky notes on a wall to sort material (E1, E2, R1, R2). R1 uses sticky notes to model a timeline of events relevant for her history reporting, complemented by a set of note cards with biographical information about important people. Thus, the tool could benefit from a clutter-free graphical canvas, allowing users to freely position sound bites and notes (5:03–5:09).

Step 6: Script

Consistent with prior literature, every interviewed journalist works with scripts (although the degree of detail varies; 6:04). In some cases, DAW production does not even begin

before everyone involved has agreed on a script version (E2). R3 explains that recording narration also usually takes place only after the final script has been accepted. Both for recording and for production, the script then serves as an instruction manual (E1, E2, R2; 6:02).

As mentioned earlier, scripts are frequently used to itera-tively conceptualize a story over time. Although the de-facto standard in German public radio is MS Word (E1, R2, R3), a script editor allowing users to shift pieces around more quickly and easily could be useful in that process (6:01). While only R2 has experience with linked tran-scripts, literature indicates that propagating such script manipulations to the underlying audio potentially facilitates and simplifies editing workflows (6:03).

Several reasons suggest that the script should be printable (6:05). First, noisy computers might not be allowed in a recording studio (E2) and some speakers generally prefer to read from a piece of paper instead of holding a tablet or laptop (R2). Second, E2 explains that professional speakers sometimes put hand-written intonation markers onto the printed script. And third, some editors prefer paper over screens for proofreading a script, according to E1 and R2. E2 notes that his team requires its reporters to provide cer-tain metadata on the first page of the script, such as a teaser text, a headline, a kicker, or an anchor intro (6:07).

As expected, the interviewed journalists largely agree on the elements that might be present in a script: accurately or roughly scripted narration (6:08, 6:09), easily recognizable sound bites (6:10, 6:11), flexible voice-over translations (6:12, 6:13), music (6:14), ambi (6:15), rudimentary speak-er instructions (6:16), and technical instructions (6:17). A difficulty mentioned by several participants is correctly estimating the duration based on a script. Character-based estimation works comparably well, but every now and then, the final audio still turns out too long, requiring subsequent trimming (R2, R3, E1, E2). In particular, layered translation voice-overs are difficult to estimate based on the character count (R2). Therefore, handling voice-overs in a sensible way while also considering the actual duration of sound bites, varying speaking rates, music, pauses, and so forth could be beneficial to achieve a more accurate duration estimation (6:18, 6:20). However, R1 notes that duration is less of a concern for podcasts without time constraints. Thus, the tool should allow but not enforce manual fine-tuning by shipping with sensible presets (6:19).

All participants were receptive to the idea of listening to sound bites directly from within the script, without the need to switch between an audio player and a word processor (6:21). This could potentially provide a better sense of speech and audio quality and of the speaker’s character and emotions, for example when an editor reads a reporter’s script draft for the first time (6:23, 6:24). Being able to listen to a sound bite in the script might also simplify decid-ing whether a certain cut is technically feasible (6:22), which is especially relevant for stories rapidly switching back and forth between narration and sound bites (6:25).

(8)

The journalists confirmed that most reporting in German public radio is usually done by freelancers who need to work together with in-house editors (E2). As suggested before, the script is a crucial interface for collaboration between the two (6:26, 6:27). Currently, this mainly takes place in form of written comments and/or text corrections within the script, often using MS Word’s ‘track changes’ mode (6:29). E2 misses the possibility within his organiza-tion to use a real-time collaborative word processor (6:28). Solutions such as Google Drive are not deemed secure enough by the IT department (although he notes that it is used by many employees anyway).

Finally, some aspects regarding the script style need to be considered. R2 explains that different stations have differ-ent guidelines for how they need freelance reporters to format their scripts. Some, like the one that E2 works for, try to achieve uniform script styles (6:34) to simplify char-acter-based duration estimation, while others do not care that much (E1). Either way, scripts produced by the pro-posed tool should be flexible enough to accommodate dif-ferent stations’ needs (6:32). Furthermore, estimation dura-tion should be detached from how the script is eventually formatted (6:33).

Step 7: Record

As is current practice today according to all interviewed journalists, the tool should allow for recording narration into an external DAW (7:01). However, it might also be useful to record directly into the tool itself in order to attach narrated audio to an interactive script (7:02).

Moreover, E1 and E2 mention that it is sometimes useful to listen to sound bites in a sequential order while recording the narration so that speakers can adapt their voices to what is said before or after. The tool should assist with this prac-tice (7:03).

Step 8: Produce

As mentioned before, reporters not always produce their stories themselves but might work together with an in-house producer. The tool should facilitate this collaboration (8:01). A rather rare workflow is that the reporter inserts the sound bites and any other additional material like music and ambi into a DAW multitrack project, already sorted accord-ing to the script (R3). Compatibility issues of different tools often inhibit this practice (R2), sometimes forcing freelance reporters to work on-site at the commissioning radio station with computers that are equipped accordingly (E2). Alter-natively, some reporters simply export individual sound bites and hand over a folder of numbered audio files, which then must be matched with the sound bites denoted in the script (E1, R2). Ideally, the proposed tool is capable of converting a script into multitrack sessions for a variety of different DAWs as proposed by R2 (8:03).

In fact, many of the tools introduced in Section 2.2 aspire to partially or fully replace complex DAWs. However, R2— who has experience with Descript’s text editing functions— is skeptical if these tools can really be used for constructing complex audio stories. DAWs offer functions like highly-layered multitrack arrangements, background noise

elimina-tion, fades, effects, and elaborate cutting tools that are not currently matched by these solutions to their full extent. Ideally, the proposed tool allows for performing the fine-cut within the tool itself (8:04); but it should not inhibit the use of professional DAWs for production if needed (8:02).

4. DESIGN & IMPLEMENTATION

Design and development were guided by the requirements introduced in Section 3. The overall target, of course, is designing a useful tool. A precondition of utility, however, is usability, that is, how well can a user make use of the tool’s functionality (Nielsen, 1993b)? Therefore, consider-ing usability durconsider-ing the design phase is crucial so that usa-bility flaws do not stand in the way when evaluating the final tool against the list of requirements.

Prior literature is concerned with a variety of different methods to test and ensure usability, including extensive quantitative approaches with many test subjects. Nielsen’s (1989a) popular discount methods, however, take a differ-ent approach: He argues that most projects can make better use of their limited time and budget by pursuing qualitative usability tests with only a small number of participants, representing the target user group as closely as possible (Nielsen, 1993b). During such user tests, every subject finds some usability flaws, which are then aggregated into a list of recommendations for interface improvements. Logi-cally, the number of detected flaws increases with the num-ber of test users. However, Nielsen’s research shows that most major problems are found by the first five participants (Nielsen, 2009). As usability is only considered a precondi-tion for this study, it is reasonable to assume that Nielsen’s discount methods are sufficient to detect the most relevant problems. Another principle advocated by Nielsen (1993b) is iterative testing, starting early during the design and de-velopment process. Instead of relying solely on expert knowledge, real users should be invited to catch usability flaws before they become an expensive-to-fix problem later during development. To the extent possible, iteration was incorporated into the design process.

The design phase started with drafting interface ideas on paper, followed by initial fragmentary digital mockups. This way, the overall structure and guiding principles of the tool were gradually carved out. In accordance with Niel-sen’s recommendation to test as early as possible, these initial screens were discussed with a fellow student special-izing in interface design and usability engineering. After fixing several issues, I presented the preliminary mockups to R2, who has experience with linked transcript applica-tions. As most functionality was not visible and testable yet, I filled the gaps with explanations about where the design process was likely to be headed. This allowed for collecting first feedback and validated the overall concept of the envi-sioned interface, while again revealing several usability flaws. After that, screens for the entire app were designed (and redesigned) over the course of three weeks. Using the design application InVision (n.d.), I transformed the screens into a partially clickable browser prototype. It was semi-functional in the sense that it allowed for clicking most links and buttons, accessing all screens of the tool, and

(9)

testing every important feature. However, content used within the tool was added in advance, and some parts of the app required for a fixed sequence of actions.

The prototype was used for extensive task-related user tests with four out of the five journalists introduced in Section 3. (Unfortunately, E2 was only available for one interview.) The interviews were preceded by two pretests with the aforementioned designer and with another fellow student who works as a (non-audio) freelance journalist. On aver-age, the tests lasted 80 minutes. InVision’s (n.d.) live shar-ing feature allowed me to point to elements in the prototype with my mouse while observing every action made by the testers. This way, I guided the participants through the entire prototype, regularly asking them to perform certain tasks. As some features were not fully functional, I served as “living documentation” (Riihiaho, 2002), providing missing information on request.

Analysis of the audio and screen recorded sessions culmi-nated in a list of usability flaws, most of which could be solved in a subsequent design iteration. Remaining issues are discussed in Section 5.

4.1 Design

In this section, the final interface is introduced. The semi-functional prototype is available at https://goo.gl/SQj6QB. The tool’s core functionality is fully outlined within the presented screens. However, some elements hint at certain additional features that are not visually designed yet. I will

provide explanations for these “hidden” features; working out the details, however, is left for future studies.

For easier understanding, the screens depicted throughout this section draw on fully fictional example content for an audio biography about a person called John Smith.

The application is organized into five distinct modules, which can be accessed via a permanently visible menu bar on the left side: ‘Projects’, ‘Files’, ‘Bites’, ‘Boards’, and ‘Scripts’ (Figure 2). This separation is designed to largely correspond with the workflow introduced in Section 2.1.

Projects

A project is a collection of work material. Namely, it con-tains related audio files, sound bites, boards, and scripts. A project might be suited for a single audio story or for an entire serial podcast with multiple episodes drawing from the same body of interview material. Appendix Figure 6 shows the project administration screen, which is the first screen users see after logging in. It allows for adding, re-moving, and renaming projects. Deleted projects are moved to trash and can easily be recovered.

Files

After entering a project, users can upload their raw record-ings into the ‘Files’ module. Each file can optionally be annotated with additional metadata, including a title, speak-ers, responsible reportspeak-ers, labels, a recording date and time, a summary, and notes (Appendix Figure 7). On the file administration screen (Appendix Figure 6), this metadata is

Figure 2. Transcript view; (a) sound bite list, (b) sound bite details

(b)

(10)

shown in a sortable and searchable table. Play controls within the table allow for pre-listening to a file before enter-ing. Like on the project administration screen, files can be added, renamed, and removed. With respect to the work-flow model presented in Figure 1, the file administration screen facilitates step 2 (‘Collect’).

By clicking a name in the table, the user enters the corre-sponding file. As shown on the screens in Figure 2, the right side is reserved for a transcript of the recording and an audio player. By default, the transcript is generated with automated speech recognition. The software also tries to differentiate speakers so that the transcript can be segment-ed into distinct paragraphs. In the left column of the tran-script, the user can then select a speaker for each paragraph from the file’s list of speakers (cf. Appendix Figure 7) or manually enter a new name. Each paragraph is highlighted with a speaker-specific color, both within the transcript and underneath the audio player’s waveform, facilitating easier orientation and navigation. As in previous tools, each word in the transcript is linked to a certain timecode within the audio. Thus, spoken words can be dynamically highlighted while audio is playing so that the reader can easily follow the transcript while listening. Moreover, double-clicking on a word plays the audio from that point on. The transcript allows for standard text editing so that the user can manual-ly correct errors in the transcript (without breaking the connection between text and audio).

If the user prefers manual transcription, she can type direct-ly into the transcript while listening to the recording. The software automatically tries to link the written words to the

spoken audio. As is usual in other transcription tools, play-back speed can be adjusted in the bottom left corner of the audio player. Pausing playback by pressing the play/pause button or by hitting a keyboard shortcut automatically re-winds the audio by three seconds, reducing the need for tedious waveform navigation. The transcription features facilitate workflow step 3 (‘Transcribe’).

While reviewing a recording, the user can select the best parts (step 4: ‘Select’), either by selecting words in the transcript or by selecting a section within the audio wave-form. Text selections are automatically mapped to audio selections, and vice versa. With a pop-up button, the user can store the selection as a new sound bite for later use. All sound bites are depicted within the transcript with sub-tle text underlines and a grey marker on the right, next to the respective transcript section. Moreover, a list of all sound bites is displayed on the left side of the screen as indicated on screen (a) in Figure 2.

Clicking on a sound bite triggers Figure 2’s screen (b). Here, the user can manage metadata about each sound bite, including a title, tags, a quality rating, a translation for foreign-language sound bites, and notes. The duration and a list of speakers is automatically determined based on the audio and the transcript. On the right side, the user can alter the selection by dragging and dropping the black handles on the sides of the yellow text selection or by adjusting the audio selection within the waveform. Again, text and audio selection are automatically kept in sync by default. Howev-er, it is reasonable to assume that automatic selection map-ping might not always be perfectly accurate. Thus, the user

(11)

can link and unlink the two by toggling the chain symbol in the bottom right corner of the waveform selection. When-ever this symbol is greyed out, changing the text selection does not affect the audio selection, and vice versa. This allows users to fine-tune the selections independently if the software makes a mistake or produces inaccuracies.

Bites

On the previously described sound bite detail screen in the ‘Files’ module (see left column in Figure 2), the user can attach tags to a sound bite. Within the ‘Bites’ module in Figure 3, these tags are managed. The first column shows a list of all tags. By default, this list is flat and sorted in al-phabetical order. However, the user can manually rearrange tags into a hierarchical taxonomy using drag and drop. By clicking on a tag, the user can view all sound bites at-tached to that tag in the second column. These sound bites might come from one or several files. Selecting a sound bite shows the familiar editable sound bite form in the third column, containing its metadata as well as a preview of the respective transcript section (highlighted yellow in Figure 3). The user can listen to the sound bite with the play/pause button. Again, double-clicking a word in the transcript triggers playback from that point on.

By editing the list of tags in the metadata form (third col-umn) or by dragging and dropping sound bites into certain tags, the user can dynamically group and organize (step 5) sound bites from various recordings to gradually develop a better sense of the subject matter and to get an overview of the available material.

Using the search bar in the first column, users can find specific sound bites based on the transcript and metadata.

Boards

Yet another, more associative way to facilitate step 5 (‘Or-ganize’) is the ‘Boards’ module. Inspired by techniques based on physical note cards or sticky notes, a board allows users to freely place sound bites and notes on a visual can-vas. The left column in Figure 4 shows the tag taxonomy and all associated sound bites in a combined list. Selecting a sound bite shows a preview in the lower left-hand corner. This preview card can then be dragged and dropped onto the board on the right side. Again, different speakers are denoted by different colors. By clicking anywhere in the board, the user can add a note card with a title and a text to the board. For example, sound bites in Figure 3 were visual-ly clustered according to different themes, which are briefvisual-ly described in the corresponding note cards. Notes and sound bites are resizable; note colors can be changed manually as shown in the screen’s top-right corner. Like with a geo-graphic map, the user can pan and zoom the visible section of large boards with more cards than fit on a screen. There are countless conceivable use cases for boards. As mentioned, the user could cluster sound bites by different themes. Moreover, she could linearly align sound bites and notes to experiment with different storylines. She could also organize information about relevant people with note cards and group them together into families. Lastly, she could build a timeline with note cards to visualize a sequence of events, potentially backed up with sound bites.

(12)

Because of these different use cases, the tool allows for creating multiple boards within a single project. Appendix Figure 6 shows the respective board administration screen.

Scripts

The final pre-production step (‘Script’) is performed in the ‘Scripts’ module depicted in Figure 5. In an interactive editor, users can arrange sound bites, narration, and instruc-tions as distinct cards into a linear sequence to work out the detailed story line. They can be arranged as needed with simple drag and drop operations. For structuring a script, it can be segmented by using customizable scene headlines (‘Opening Scene’ in Figure 5).

The currently selected first card on the right-hand side in Figure 5 represents a narration block. In here, the user can select a speaker from a list of previously used names or manually enter a new one. On the right side, she can freely write the narrated text and use basic formatting options (bold, italic) on it—for example, to communicate which words to stress to a speaker.

The duration of the entire script as visible in the footer area of the screen (‘Estimated duration’ in Figure 5) is calculat-ed by adding up anticipatcalculat-ed durations of all cards. In the first narration card, this duration is displayed in a text field in its lower right corner. By default, the duration of narrated text is estimated based on the character count. However, the user can adjust the duration by manually typing a different value into the text field. Alternatively, she can use the stopwatch button to stop the time it takes her to read the narration, thus accounting for varying speaking rates. To

reset manual changes and re-enable automatic estimation, the tilde button (~) can be used.

The user is given the option to record a narration card di-rectly into the script editor. By clicking the ‘REC’ button, the tool starts listening to microphone input. After the re-cording has ended, it is attached to the card and can be listened to with a play button to the left of the card. This can be used for test recordings with low-quality micro-phones or even for final recording in the studio.

As in the ‘Boards’ module, the left column of the screen shows a searchable list of tags and sound bites. These sound bites can be added to the script with a drag and drop opera-tion. One such sound bite is the second card in Figure 5, highlighted in red. As before, different speakers are repre-sented by different colors to allow for faster navigation. With the circular play button to the left of the card, the user can listen to the sound bite directly from within the script. Again, playback can be navigated via double-clicking cer-tain words in the transcript excerpt. As sound bites are usually retrieved from the raw material early in the research process, the exact text and audio selection might not be suitable in the script as is. Therefore, users can grab the black handles at the beginning or end of the sound bite’s transcript excerpt to adjust the selection. This way, one could, for example, remove the first sentence of a sound bite from the script in case it doesn’t fit into the narrative. The duration of the red sound bite is not visible in Figure 5 because the card is not currently selected. However, select-ing it also shows a text field, whose value is retrieved from the actual duration of the underlying audio section.

(13)

The fourth card in Figure 5 shows a foreign-language sound bite. The two green parts represent the original voice—in this case, French. The text in between is a translation, which has yet to be recorded. By default, the beginning of the original audio is played for a few seconds; then the translation is played as a voice-over; and, finally, a few seconds from the end of the original sound bite are played again. As different journalists have different voice-over styles, the exact length of the original sections can be ad-justed with the black handles. This allows, for example, to play the entire original part first and then play the transla-tion afterwards. Likewise, the user could decide to remove the first sentence of the original voice with the very same approach. The translated text is retrieved from the sound bite’s metadata but can be edited within the card as needed. The duration of a translated sound bite is calculated by adding up the duration of the played sections of the original voice and the estimated duration of the translation based on the character count. Again, the user can also stop the time or manually edit the duration field.

Finally, the third card with italic text represents an instruc-tion. In here, the user can place any missing information about ambi, music, technical details, pronunciation, and so forth. By default, the duration of an instruction card is set to 00:00. However, an instruction might cause the entire story to take longer, for example when it requests to play music for 20 seconds. In such a case, the user can manually enter the anticipated duration into a text field.

As indicated with the overlay panel to the right of the first card in Figure 5, every card can be deleted or duplicated. Duplication might, for example, be useful when splitting a sound bite in half and inserting narration in between. While facilitating step 6 (‘Script’), the ‘Scripts’ module also facilitates step 7 (‘Record’). In particular, the script can be used as instruction manual to read from during a record-ing session. In between the narrated sections, the speaker can listen to the sound bites directly from the script to get a better sense of the mood and to pick up an interviewee’s way of speaking. As mentioned, the speaker (or a producer) could potentially record narration directly into the tool. The bottom right corner in Figure 5 displays two more buttons, whose functionality is not fully designed yet. The ‘Share’ provides the user with a secret link that she can send to her colleagues. The underlying web page would show the interactive script, similar to the script editor. On this page, an in-house radio editor could read and listen to a draft sent in by a freelance reporter; commenting features would allow her to provide direct feedback.

Second, the ‘Export’ button should allow the user to con-vert the script into different file formats. This might include a printable PDF, an editable word processor file, an interac-tive offline version of the script, or a multitrack project for different DAWs with all the sound bites and recorded narra-tion bits already aligned into the correct order as specified in the script. Such a DAW exporting feature would facili-tate step 8 (‘Produce’) by providing the producer with an editable first draft of the multitrack project which she can

then use for further processing. When exporting a PDF script, the user should be able to select or create different templates to fulfill different stations’ formatting guidelines. A project might contain more than one script, for example when producing several podcast episodes based on the same body of material or when producing different versions of the same story for different clients. Therefore, there is an-other administration screen listing all the scripts (see Ap-pendix Figure 6). Furthermore, each script can be comple-mented with additional metadata, including a title, a time limit, a delivery date, a description, client information, and complementary content for online publishing (see Appendix Figure 8). Although not fully designed yet, it might be useful to let users add custom metadata fields, depending on the commissioning station’s information needs.

4.2 Implementation

As of writing, the frontend of the proposed solution is under active development on https://goo.gl/nGshte. However, full technical implementation is beyond the scope of this study. Thus, this section demonstrates practical feasibility by providing pointers and recommendations for realization. The tool is designed as an in-browser web application. On the backend, the web framework Django (Django Software Foundation, n.d.) on Python can be used to handle server-side computations. Data can be stored in an object-relational PostgreSQL database (The PostgreSQL Global Development Group, n.d.). All search-related backend tasks can be handled by an Elasticsearch instance (Elastic, n.d.). Many major software companies offer APIs for automatic speech recognition. Further testing is necessary to deter-mine the service best suited for the present application. However, IBM’s (n.d.) Watson Speech to Text API appears to be a good fit as it delivers timecodes for each word in the transcript out of the box and has proven good reliability in similar studies (Shin et al., 2016). Whenever the user man-ually edits a transcript, the altered text needs to be re-linked to the audio. This can, for example, be done with the Py-thon/C library Aeneas (ReadBeyond, 2017). Different speakers can be recognized with a speaker diarization li-brary such as the Python package Sidekit (Larcher, Lee, & Meignier, 2017). For further algorithmic details regarding speech recognition and alignment, I refer to the work of Shin et al. (2016).

The frontend is currently being built using the declarative, component-based JavaScript library React (Facebook, n.d.), where application state is managed with Redux (Abramov, n.d.). The waveform audio player can be realized with D3.js (Bostock, n.d.) and Wave.js (Saiz, n.d.).

For communication between the backend and frontend, the Django REST Framework (n.d.) can be used to serve a JSON-based API.

5. EVALUATION

From July 5th_{to 11}th_{, 2017, the final iteration of the}

proto-type was presented to the four journalists who participated in the testing phase (see Section 4). Because Nielsen and Landauer (1993) note that fixing usability issues might, in

(14)

turn, introduce new problems, I first walked the participants through the tool again and explained the changes compared to the previous test. However, no notable new problems were discovered.

Overall, the tool was received extremely positive by all participants. E1 thinks that “this is the future!” R2 is “im-pressed” and “would like to use all this tomorrow! This is very cool. I do not see why I would not use it.” Similarly, R1 would also “like it [the tool] to be ready tomorrow so that I can use it”. She further explains the advantages of the “striking idea” to link audio and transcript throughout the pre-production phase:

That is the great thing about your tool: always knowing which document a quote comes from. […] This is very important for me, because you often forget to write that down. You never lose this connection with [this new] workflow. […] Permanent reference to the source mate-rial is great for that! (R1)

With respect to the tool’s transcription and selection fea-tures, E1 says:

I am beginning to understand what you are doing there. And I have to say: If this tool becomes available, […] it would be great! That is exactly what my reporters spend three days with. Transcribing and then extracting the sound bites is an incredible amount of work. It would be really cool if there was something to help with that. (E1) To get further, more specific feedback on individual fea-tures and aspects, I asked the journalists to evaluate the tool based on the requirements (see Section 3). Due to the length of the list and facing time constraints during the interview sessions, requirements had to be arbitrarily distributed among the participants in a way that resulted in at least two opinions about each requirement.

As shown in Appendix Table 2, participants were first asked to decide if the respective requirement is, indeed, important on a scale from not important (-) via moderately important (○) to important (+). As the list was aggregated based on hours of interview material, this allowed for de-tecting any possible invalid assumptions and interpretations made when formulating the requirements. Secondly, the journalists rated how well the requirement is fulfilled by the proposed solution on a scale from poor (-) via acceptable (○) to good (+). As the three-point scales imply, the evalua-tion was explicitly not aimed at quantitative feedback. In-stead, journalists were encouraged to qualitatively elaborate on their decisions. As interviewer, I remained passive to the extent possible and only assisted with explaining the re-quirements and technical details unknown to the journalists. A brief glance at Appendix Table 2 shows that the list is dominated by plus symbols (+), suggesting that the vast majority of requirements are both relevant and well fulfilled by the tool. In particular, the tool is generally deemed to facilitate the overall pre-production workflow (0:02), is easy to learn and use (0:01), provides better overview of long and complex stories (0:05), mitigates the perceived raw material overload (0:10), simplifies the selection of

sound bites (4:01) as well as the process of finding an order for them (5:01).

Many deficits relate to (a lack of) collaboration features (0:22). Although these were considered during the design phase and future integration should be feasible within the provided framework, they are not fully designed yet. For example, fine-grained access control is currently missing (0:23, 6:30, 6:31), the script lacks functions like comment-ing (6:31), collaborative writcomment-ing in real-time (6:28), and a mode to track changes (6:29). Further, a read-only version of the interactive script is only hinted at with the ‘Share’ button in Figure 5 (6:30).

Similarly, script exporting features are described in Section 4.1 but are not fully designed yet. They are, however, a pre-condition for allowing freelance reporters, editors, produc-ers, and speakers to work together seamlessly (6:27, 7:01, 8:01, 8:02, 8:03).

Security is another valid concern: As the tool is designed as a web application, the user does not have direct control over the cloud-stored data. While all consulted journalists would appreciate collaboration features and location-independent access (0:15), they criticize the involved security and priva-cy risks for themselves and for their sources (0:25) as well as uncertain data integrity (0:24). Therefore, it is doubtful if the tool would be secure enough for large radio stations (0:26). R1 proposes a white label solution where stations provide their employees with a customized instance of the tool on their own servers, which might mitigate some of the risks. Even more secure and, in some cases, more conven-ient is a local offline version as proposed by R1 and R2. Regarding step 5 (‘Organize’), R1 and R2 ask for a way to filter and sort sound bites according to metadata, e.g. to retrieve only those with the highest quality rating (5:02). The ‘Boards’ module was received generally well, although R3 notes that she would not need it for her work. R1, on the other hand, confirms that boards are well-suited to convert her physical wall of note cards into a digital version, indi-cating sufficient flexibility of the feature set (5:06). How-ever, she is not fully satisfied with how to relate cards and notes to each other on a board (currently by proximity on-ly). She would prefer connecting lines to denote relation-ships such as ‘A confirms B’, ‘X is related to Y’, or ‘C is the grandson of D’. Boards are rated to be comparably clutter-free (5:07); however, they are thought to profit from a large screen (5:08, 5:09). With respect to the ‘Bites’ mod-ule, R2 asked for a possibility to store notes within tags, alongside the sound bites (0:19).

Feedback for the script editor was generally positive. Im-proved duration estimation was received particularly well (6:18). The journalists also praised the tight connection of audio and text, allowing them to play audio directly from within the script to better assess audio quality, emotions, character, and feasibility of cuts (6:21–6:24). However, there are certain drawbacks regarding the ‘Scripts’ module, too: (1) R2 worries if the handling of voice-over transla-tions might proof to be cumbersome when quickly alternat-ing between narration and foreign-language sound bites