Interacting with a virtual conductor

(1)

Interacting with a Virtual Conductor

Supervisors:

ir. Dennis Reidsma prof. dr. ir. Anton Nijholt dr. Zsófia Ruttkay

Pieter Bos

Student number: 0001996

Masters Thesis

(2)

(3)

Abstract

The task of conducting human musicians in a live performance by a computer has not yet been addressed extensively before. A few attempts exist at letting a computer perform this task, but there is no interactive virtual conductor who can conduct human musicians and can interact with these musicians.

The virtual conductor described in this report can conduct human musicians in a live performance interactively. The conductor can conduct 1-, 2-, 3- or 4-beat patterns. Tempo changes can be indicated in such a way that musicians can follow the change. Dynamics are supported by changing the amplitude of the conducting gestures, so that music that should be loud will make the conductor conduct bigger and music that should be played softly will be conducted smaller. These signals to musicians all are given before the actual change occurs, so that musicians are prepared that the tempo or dynamics will change. Accents are indicated by conducting the preparation of a beat bigger.

The conductor listens to the musicians as they play to follow their performance. He can track the beat of the musicians with a beat-tracker and can read along with the score as musicians play. For future reactions of the conductor, a chord detector has been designed and implemented, to allow the future conductor to detect wrong notes.

This information is used to interact with the musicians: if the musicians start playing slower or faster when they should not be, the conductor will notice this and try to correct this. First, the conductor will follow the musicians so they do not lose track, then the conductor will lead the musicians back to the original tempo.

The conductor has been evaluated several times with groups of human musicians. The musicians could follow the tempo and dynamic changes of the conductor reasonably well. The conductor could interact succesfully with the musicians, correcting their tempo if they played too fast or to slow. The musicians enjoyed playing with the virtual conductor and could see uses for it, especially if the conductor is further extended.

Concluded can be that a virtual conductor has been designed and implemented that can interact with musicians in a live music performance. This conductor is only a basic version of a conductor and can be extended in almost all aspects. So, while a basic version exists, this is still a lot left for future research on this subject. Potential applications of the future and current virtual conductor are for example a rehearsal conductor for when a human conductor is not available or as a conductor for when studying orchestral parts at home together with a recording or MIDI-version of the rest of the orchestra, including a conductor.

(8)

Samenvatting

Het dirigeren van muzikanten is tot nu toe een taak voorbehouden aan mensen. Er zijn een paar eerdere pogingen gedaan om een computer deze taak te laten verrichten, maar er geen interactieve virtuele dirigent die menselijke muzikanten kan dirigeren en ook interactie aan kan gaan met deze muzikanten.

De virtuele dirigent beschreven in dit afstudeerverslag kan dit wel. Deze dirigent kan 1, 2, 3 en 4 tellen in de maat slaan. Tempoveranderingen worden aangegeven en wel op zo'n manier dat de muzikanten dit kunnen volgens. Dynamiek wordt aangegeven door groter of kleiner te slaan en dynamiekveranderingen worden aangegeven voor ze daadwerkelijk van toepassing zijn, zodat de muzikanten hier op tijd op kunnen reageren. Op dezelfde manier worden ook accenten aangegeven.

De virtuele dirigent luistert ook naar de muziek die gemaakt wordt door de muzikanten. Met een tempo-detector kan de dirigent het tempo bijhouden van de muzikanten, zoals een mens die meetikt met muziek. Bovendien kan de dirigent meelezen met de partituur terwijl muzikanten spelen. Er is een akkoordendetector gebouwd die toekomstige versies van de dirigent in staat zal stellen om foute noten te detecteren.

Met behulp van deze informatie kan de dirigent interactief dirigeren. Als de muzikanten een ander tempo beginnen te spelen dan de dirigent dirigeert, zal de dirigent dit merken.

Vervolgens zal de dirigent zijn tempo aanpassen en de muzikanten volgen, zodat de muzikanten niet de weg kwijt raken. Hierna leidt de dirigent de muzikanten terug naar het originele tempo, op een manier zodat de muzikanten het kunnen volgen.

De dirigent is meerdere malen geëvalueerd met menselijke muzikanten. De muzikanten konden de tempo en dynamiek-aanduidingen van de dirigent volgen. Ook als de aanduidingen op onverwachtse momenten kwamen konden de muzikanten na enige oefening deze aanduidingen volgen. De muzikanten vonden het leuk om met de dirigent muziek te maken en zagen nut- tige toepassingen voor de dirigent, bijvoorbeeld als repetitor bij ritmisch lastige passages voor kleine ensembles, of om met een opname mee te spelen.

Geconcludeerd kan worden dat een virtuele dirigent is onderzocht en geimplementeerd die interactief menselijke muzikanten kan dirigeren. Deze dirigent is echter slechts een basis- dirigent en kan op bijna alle mogelijke punten worden uitgebreid - goede punten om uit te breiden zijn meer interactie, bijvoorbeeld met dynamiek, of een expressieve dirigent. Gezien de complexiteit van de taak van dirigeren zal het niveau van een menselijke dirigent niet erg snel bereikt worden en is er nog veel te onderzoeken. Mogelijke applicaties van de huidige en toekomstige virtuele dirigent zijn onder andere een repetitiedirigent als een menselijke dirigent niet beschikbaar is, of een dirigent om thuis mee te kunnen spelen met een opname of MIDI- bestand van de rest van het orkest, met dirigent.

(9)

Acknowledgements

I would like to thank Daphne Wassink for giving advice about conducting throughout my work on this thesis; my brother Rik for helping me design a more suitable avatar for the virtual conductor; my supervisors for allways giving useful feedback quickly; Harm Witteveen, conductor of the CHN orkest and the musicians of the CHN-orkest that participated during the demonstration at the CHN and nally, all the people who have helped during the dierent evaluations:

(10)

1 Introduction

Recordings of orchestral music are said to be the interpretation of the conductor in front of the ensemble. A human conductor uses words, gestures, gaze, head movements and facial expressions to make musicians play together in the right tempo, phrasing, style and dynamics, according to his interpretation of the music. He or she also interacts with musicians: The musicians react to the gestures of the conductor, and the conductor in turn reacts to the music played by the musicians. The conductor not only leads the musicians through a performance, but should inspire them, tutor them and interact with them to together create a good music performance. This task asks for dierent approaches in dierent situations: when playing a piece of music for the rst time with amateur musicians is a very dierent task from a performance with a professional orchestra. Dierent kinds of music required dierent styles of music: romantic music requires a dierent approach than rhytmically complex modern music. How exactly a conductor does this diers from person to person and several styles of conducting could be identied.

Virtual humans have been performing a wide eld of tasks: several virtual humans or embodied conversational agents exist that can perform a conversation, dance to music or show expressions corresponding with expression in music. At the Human Media Interaction group several Virtual Humans are being researched, including a Virtual Dancer and virtual

tness trainer. So far however, no virtual humans are known of that can conduct musicians interactively in a live music performance. This thesis discusses a virtual conductor that can perform this task.

(11)

2 Related Work

To our knowledge, our project is the rst interactive virtual conductor. However, several other virtual conductor projects have been found that synthesize conducting movements. [47]

describes a virtual conductor that learns from real conductors. This conductor can learn conducting gestures with a kernel based hidden Markov model (KHMM). It is used as an example to show that KHMM's can be used to synthesize gestures. These movements are learned with as input a combination of movements from a real conductor and a synchronized recording of music. Loudness, pitch and beat are used to describe the music, positions and movement of several joints of the conductor as input for the movements. The model is then trained with this data and the result is a conductor who can conduct similar music - similar in time and tempo. Basic movements are used and style variations are shown. This conductor does not have automatic tempo tracking, the music is semi-automatically analyzed using the movements from the real conductor to track beats. This conductor cannot interact with musicians, it can only synthesize an animation from an annotated audio le. It is suggested to allow tempo changes by blending multiple trained models, however this has not been done.

In [40] conductor movements are synthesized to demonstrate STEP, a VRML scripting language. Conducting movements are specied using a high-level scripting language, however nothing but the movements has been made.

A movie le of the Sony Qrio robot conducting the rst movement of Beethoven 5 with the Tokio Symphony orchestra has been found. It is not known how this robot does this.

In the 'help island' of the online world second life a conductor is shown with a group of virtual barbershop singers. A screenshot is shown in Figure 2.1. The conductor can perform two dierent conducting patterns more or less in time with one piece of music. The parts can be 'sung' by other players by clicking on the music stands. The parts will be played back synchronized. The conducting movements are for decorative purposes only, the only aspect of the performance that can be changed by the conductor is moving to the next section in the music by clicking on the score. Although images of real sheet music are presented, the players do not have any control over the performance. Whether this small demo has been extended by second life players is not known due to the large size of this online world.

2.1 Conductor Following Systems

While no interactive virtual conductors have been found, there are several systems that do exactly the opposite of conducting: following a human conductor. These systems are called conductor following systems. Such a system consists of some way to measure part of the movements of a conductor, gesture recognition to extract information from these movements and often also a virtual orchestra, of which the performance can be altered by conducting.

In [26] several of these systems are summarized, including their possibilities and limitations.

The conclusion is that following a conductor is possible very well with the current state of technology, except for tracking the gaze of a conductor. These conductor following algorithms take dierent approaches at what they track. Many systems use some sort of sensor a conductor has to wear. This can an electronic conducting baton, like in [27] and [23], but also a jacket measuring the conductors movement, like in [31]. [32] describes a system following a baton with a camera, and [25] describes a system followed by a camera requiring the conductor to wear only a colored glove. This system is available for anyone to download. The gesture recognition of the various researches varies as well. The Vienna virtual orchestra in [3] for example recognizes only up and down motions as beats. As soon as the direction of the baton

(12)

(a) second life conductor

(b) Sony qrio conducting

(c) kernel based HMM conductor

Figure 2.1: Existing virtual conductors

(13)

is reversed, a beat is registered. Bigger movements or directing towards sections makes the whole orchestra or just one section play louder. This is done to allow the system to be used by non-musicians. This is later extended in [28] with the possibility to detect real conducting gestures should an experienced conductor use their system. Other systems recognize more complex conducting gestures. In [23] and [29] neural networks are used to recognize gestures.

In [31] a system has been made that allows manipulation of music using several gestures and movements, allowing precise control over the music being played. An analysis of conducting gestures is given. These gestures however are not limited to standard conducting gestures, several other gestures have been added to manipulate the music. In [21] a modular conductor following system is described that is independent of the input method. If new input methods should be available, new modules can be written to adapt the system to the newer input method.

2.2 Virtual Agents

Many examples can be found of embodied agents reacting to music. The virtual dancer, described in [38] is a system that lets a rap dancer move in time on music, interacting with a human dancer. The dancer reacts to audio inputwith the beat-tracker explained later in this report and uses computer vision to react on a human dancer. Other dancers like this exist, like Cindy as described in [18] or [44], which also makes use of the structure of music to plan and select its dance moves.

In [7] a system is presented that performs a traditional Chinese Lion dance in real time. The dancers can move on a rhythm, using beat detection to allow the input of drum rhythms by the user. The dancers can perform several dierent dances and the movements are specied using a high level language.

[30] describes Greta, an embodied conversational agent capable of showing emotions by means of facial expression. Greta's face has been linked to a system that detects emotion in music. Greta then adapts her facial expression to the music being played. Such a system could be directly used for the conductor, to show the emotion of the music being played.

2.2.1 Synthesizing Gestures

Synthesizing gestures for other purposes than conducting has been done many times before.

In the eld of embodied conversational agents gesture synthesis systems have been developed, usually to support the conversational features of agents. Work done on synthesizing conducting gestures has been found before, as stated earlier in Chapter 2. Many other gesture synthesis systems exist however. Often these are used for lifelike embodied conversational agents to support speech. Often such a system has a high level language to describe gestures, like MURML in [46] or STEP in [40]. Such a language might be useful for the virtual conductor.

Gestures and speech have to be coordinated, so often a planner is used for this purpose. A planner will also be needed for the conductor to determine when a beat will occur and when to gesture.

2.3 Listening to Musicians

Some form of an algorithm to listen to musicians is required for the conductor, to follow the progress during conducting. Two basic types of algorithms exist for this purpose: algorithms that require a score and algorithms that do not require a score to function. The algorithms not requiring a score are generally called beat-tracking or tempo-tracking algorithms. The other kind of algorithms, which require a score, are called score following or score aligning algorithms. For the conductor, both types of algorithms can be of use, as long as it is realtime.

A summary will be given of some of these algorithms and their features and peformance.

(14)

Feature extraction

Pulse Induction / Beat Period Tracking

PulseTracking / Beat Phase Tracking

Audio

Audio Features Audio Features

Tempo

Beats

Figure 2.2: Division in parts of beat detectors

2.3.1 Beat Tracking Algorithms

Many beat tracking algorithms exist. Very few evaluations of the algorithms however exist.

An overview of the eld is presented in [19]. This paper presents a qualitative comparison of what they call automatic rhythm description systems. These systems can be anything from a beat tracker, which tracks separate beats, a tempo induction algorithm, which computes the tempo of music, to a rhythm transcription system, which transcribed rhythms from an audio le. Many algorithms are compared, using one framework to compare the algorithms.

The comparison is divided into several functional units. For beat detectors these units are feature extraction, pulse induction and pulse tracking. The second unit is also often called beat period tracking, while the third is often called beat phase tracking. The units are drawn schematically in gure 2.2 The rst step done by all the algorithms is creation of feature lists from audio. The input is processed and converted into a list of features. After this, pulses are detected from these features - the pulse induction step. The pulse induction step assumes a

xed pulse period It detects this period in which pulses occur, sometimes in dierent metrical levels of periodicity - a measure, a beat and shortest occurring smallest note value. These levels are called respectively measure, tatum and tactus. The last step, pulse tracking, does not determine the period of pulses, but tracks the pulses themselves. It can be driven by the period of the pulse from the pulse induction algorithm or can be separate altogether. This division of parts is used by other authors as well [1, 18]. It will be used here as comparison for beat tracking algorithms. Beat tracking algorithms perform their work without prior knowledge of the piece being performed. However, they can be adapted to work so: for esxample, the relation between dierent metrical levels in the case of the conductor is already known, which means it can be used by a beat-tracker instead of trying to determine it from the music.

2.3.1.1 Feature Extraction

According to [19], the following features have been used for beat detection.

Onset Time The beginning of musical notes are used widely as features to nd beats, as in [1, 11, 18]. Many algorithms have been dened to detect onsets in music.

Duration Some systems use note duration, or the time between two onsets as a feature. [11]

uses this feature.

Relative Amplitude The relative amplitude contributes to perceptual accentuation in music and as such is used as a feature.

(15)

Pitch Pitch is hardly ever used, according to [19].

Chords Two ways of using chords as a measure for beat detection are named: to count the number of simultaneous notes to identify accents and by detecting harmonic changes as evidence for a beat, as is done in [18].

Percussion Instruments If percussive instruments are present in the signal, those can be used to detect beats, as done in [18].

Frame Features Other than the other features mentioned before, some beat tracking systems, like [24, 41] use features from frames, rather than discrete note onsets, note duration or chord changes. A frame is a short period of audio from which features are processed. Usually consecutive frames overlap each other. As a feature for example the energy from a frame can be used, or the change in energy. Often the audio is split in multiple frequency bands before analysis. This is closely related to the onset detection: for example in [24] for every frame a number indicating accentuation is detected. [24] uses these features directly to calculate periodicity, while [1] uses these features for onset detection.

2.3.1.2 Pulse Induction or Beat Period Detection

For pulse induction, used methods include autocorrelation [1, 13], comb lter banks [24], inter onset interval clustering [11] and spectral product[1]. [24] notes that the performance of the dierent algorithms is very similar, while [1, 20] list dierences, although the dierences are not consistent with each other.

In the best perfoming algorithms, autocorrelation or a bank of comb lters is used [20].

Autocorrelation calculates the cross-correlation between expected pulses and detected pulses.

It is computationally ecient, but does not preserve the phase of the tracked pulse, only the beat.

A bank of comb lters, as used by [24] and [41] on the other hand, uses many lters that each respond to periodic signals with one xed delay. For every to be detected tempo, one

lter is used. Because one lter is required for each to be detected period, this method is computationally expensive. However, the phase of the detected period can be derived easily from the lter state. In fact, it can be used in combination with autocorrelation for beat phase detection, as done in [43].

2.3.1.3 Pulse Tracking of Beat Phase Detection

Pulse tracking can be done with cross-correlation between the expected pulses and the detected pulses [1], by probabilistic modeling [24] or is derived directly from the pulse induction step [41]. Very little is known about the performance dierences of the dierent algorithms.

2.3.2 Performance of the Algorithms

Very few evaluations of the performance of dierent algorithms exist. In [33] a framework for evaluating these algorithms is proposed. No evaluation however, is presented. An extensive quantitative comparison of 11 dierent algorithms is presented in [20]. 11 dierent tempo induction algorithms are run on a data set of 12 hours and 36 minutes of audio. The data set consists of over 2000 short loops, 698 short ballroom dancing pieces and 465 song excerpts from 9 dierent genres. The songs were annotated by hand by a professional musician for the songs and the rst author of the paper for the ballroom dance. The ground truth of the loops was known beforehand. Accuracy was measured in two ways: the number of songs that were correctly identied within 4% accuracy, called accuracy 1, and the number of songs that were correctly identied plus the songs identied having a tempo that is an integer multiple of the real tempo, accuracy 2. The algorithm by Klapuri, as described in [24] was the winner,

(16)

showing 85.01 percent accuracy in accuracy 2 and 67.29% accuracy 1. This algorithm also has the best robustness when noise was added to the audio les.

In this comparison, the framework given in [19] is used to try and compare dierent parts of the algorithm, but this proved to be impossible with their set of algorithms. To do this they suggest a more modular system in which multiple algorithms can be compared. A way to use multiple algorithms to track the beat is presented, showing an increase in accuracy when about equally well performing algorithms are combined.

2.3.2.1 Description of Separate Algorithms

This winning algorithm by Klapuri[24] works by using a bandpass lter with 36 bands on the audio signal. The audio is rst split into small overlapping frames, then the bandpass lter is applied to the frames. Accentuation is detected in these bands by means of a weighted dierentiation. The feature list generation of this algorithm is very similar to some other algorithms. When set with dierent parameters than Klapuri did, this is very similar to the algorithms presented in [41] and [1]. Then a bank of comb-lter resonators is used to detect periodicity in these accent bands. The periodicity in these accent bands is then combined. A discrete Fourier transform is applied to detect the period of the pulses. After this, a hidden Markov model is used to detect the tactus, tatum (beat) and the measure period from the signal. After the period is detected, the phase of these is detected, again with a hidden Markov model. This is the pulse tracking part of the system.

Dixon also submitted three algorithms in the quantitative comparison of algorithms. The

rst two are described in [11]. He states that these two of his algorithms are not real time, but can be adapted to run real time. However, from e-mail conversations from Dixon it appeared that it is not feasible to adapt his implementation and that it may be better to use a dierent algorithm for real time tasks. These two algorithms use an energy based onset- detector, followed by an inter-onset interval clustering algorithm. A dierent algorithm by the same author [13] is also compared, using a band lter to split the signal in 8 frequency bands, then smooths and downsamples the signal and performs autocorrelation of the bands. From each band the peaks of the autocorrelation function are combined and the best is selected as a period. This algorithm can work in real time and while being much more simple, according to [20] performs better than the other two algorithms.

The system of Alonso [1] is also presented, performing fairly well. This beat detector uses an onset detection, similar to the frame based features of Klapuri, but in the frequency domain instead of time domain and with a FIR-lter to smooth the signal. The period is estimated using autocorrelation and spectral energy ux. The beat location is found using cross-correlation between the expected beat location and the found pulses. While [20] lists this algorithm with the spectral energy ux to be having a better performance in the experiment than the same algorithm with autocorrelation, the author of the algorithm in [1] mentions a better performance in his evaluation with autocorrelation.

The system of Scheirer in [41] is the predecessor of the system by Klapuri. It also works with a bandpass lter, smoothing this and calculating pulses from this. Pulse induction and pulse tracking is done by a comb lter which preserves the phase of the signal. The performance of the system seems to be less than that of the others, although this is an earlier approach, being the rst to use regularly sampled frame features to detect beats instead of using onset times. Also he introduced comb lter banks to perform pulse induction.

A system not compared, but often cited by the dierent authors is that of Goto [18]. He mentions that beat tracking is dicult because the rhythmic structure of the piece being tracked is not known and because it is dicult to nd the cues in audio signals. This is solved by extracting audio cues and trying to recreate the rhythmic structure. The algorithm works on onset detection in the frequency domain, using several sub bands, chord changes and drum pattern detection. The chord change detector tries to detect chord changes without detecting the chords themselves. A frequency spectrum is sliced in strips at times where chord changes are likely. Moments where this is likely would be moments where a beat is likely to be found,

(17)

by using provisional beat times. The system then tries to nd dierent metrical levels, a measure, half-note and quarter-note level. The algorithm works real time and is used to make a virtual dancer move.

A new interesting algorithm is the algorithm by Seppanen[43]. They adapted the algorithm of Klapuri to work in mobile devices, by lowering the computational cost signicantly. To do this, the lters used are simplied greatly, the comb lters are replaced by autocorrelation with two comb lters for beat phase tracking and the music model used is greatly simplied, with minimal performance loss.

2.3.3 Score Following

Algorithms that follow a performance with knowledge of the score are called score following algorithms, or on-line tracking algorithms. Some of those algorithms require real time MIDI data instead of audio, like [34, 42]. These require an automated transcription system or MIDI instruments to work.

In [10] a score following system is described working on audio recordings. The recording is split into short segments of 0.25 seconds and for every part a chroma-vector is calculated. This vector contains the spectral energy in every pitch-class (C, C#, D, ... , B). Chroma vectors from a score le are made as well: by creating an audio le from a MIDI le and processing that or by putting the notes from the midi le into the chroma vectors directly.

The chroma vectors of both les are normalized and compared by means of euclidean dis- tance. The results are stored in a similarity matrix. Now a path is sought through the matrix, to realize a mapping from the recording to the score. This technique is called dynamic time warping. Because of this matrix, the algorithm does not work real time. However, the algorithm can be adapted and the technique of chroma vectors might be useful for following the score.

In [12] the dynamic time warping algorithm used in [10] is adapted for real time used, now called online time warping. The algorithm works by predicting the current location in the matrix and calculating the shortest path back. Only the part of the matrix close to the prediction is calculated to give the algorithm linear eciency. The given audio feature is not very eective, but the algorithm is, meaning that this can be eective when combined with a better audio feature, for example chroma vectors.

In [37] also a score following algorithm is described which works on polyphonic audio recordings. The algorithm works on chord changes and searches through a tree with the dierent options to determine the tempo of the music being played. It was tested on orchestral classical music and worked accurately for at least a few minutes in most pieces before losing track of the music. The algorithm produces errors when no chord changes occur, on long tones. It is suggested that it should be possible to improve this.

There seems to be no score following algorithm that works completely without any problems, just like there is no beat detector without any problems. The algorithms do however come close and are certainly usable.

2.3.4 Expression Detection

Humans perceive emotions with music. Many systems to detect features describing the musical expression in performances have been researched. An overview of these systems can be found in [17]. In [14] a system is presented that can extract emotions from music. It extracts audio features, such as note onsets, volume and articulations, and maps them to emotion. It uses previous research to map detected features to emotions. Which features correspond with which emotion is displayed in table2.1

(18)

Emotion Motion cues Music performance Cues Anger Large

FastUneven jerky

LoudFast Staccato Sharp Timbre Sadness Small

SlowEven soft

SoftSlow Legato

Happiness Large

rather fast Loud

FastStaccato

Small tempo variability

Table 2.1: Musicians' use of acoustic cues and motion cues when communicating emotion in music performance, from [14].

2.4 Analys of human conductor

Only a few studies have been performed in which the behaviour of human conductors is analyzed. In [35], the meaning of dierent gaze, head and face movements of a conductor are analyzed, obtained by analyzing video recordings. The goal is to create a lexicon of the conductors face. Part of such a lexicon was made and is included in table 2.2.

In [15], the eect of various left hand shapes on choral singers has been researched. Tapes with a conductor with dierent hand-shapes were presented to singers and they were asked to rate their vocal tension. It was found the hand-shapes used by the conductor could change the vocal tension signicantly.

In [45], dierent ways of indicating dynamic markings to musicians have been analyzed, by letting them sing with a video recording of a conductor, with a choir presented through headphones. The volume of the singers was measured. It was found that verbal instructions gave signicantly stronger eects than written instructions, gestural instructions and volume changes in the choir.

One of the conductor following systems, by Nakra [31], was used to perform an analysis of muscle tension in six human conductors during conducting. Several detailed observations have been made about how humans conduct. Most correspond to the directions given in conducting handbooks.

(19)

TYPE OF MEANING SIGNAL LITERAL

MEANING INDIRECT

MEANING

SUGGEST HOW TO

PLAY

Who is to

play Look at the choir You choir

When to play

Raised eyebrows I am

alerted(emotion) Prepare to start Look down I am concentrat-

ing(mental state)

You concentrace, prepare to start

Fast head nod Start now

Look down I am not alerted Do not start yet What sound

to produce

Melody Face up High tune

Rhythm Staccato head

movements Staccato

Speed Fast head

movements Svelto

Loudness Frown I am determined

(mental state) Play aloud Raised eyebrows I am startled

(emotion) It is too loud, play more softly Left-right head

movements No! (not that loud) Play more softly Expression Inner eyebrows

raised I am sad Play a sad sound

How to produce the

sound

Wide open mouth Open your mouth

Rounded mouth Round your mouthwide

PROVIDE

FEEDBACK Praise Head nod Ok go on like this

Closed eyes I'm relaxed

(emotion) Good, go on like Oblique head I'm relaxed this

(emotion) Good, go on like Blame Closed eyes + this

Frown + Open mouth

I'm disgusted

(emotion) Not like this

Table 2.2: Lexicon of the conductors face (from [35])

(20)

3 Research Question, Assignment

The assignment the virtual conductor consists of researching the possibilities of a virtual embodied agent capable of conducting a group of musicians in a live performance and designing and implementing this agent. The description of the assignment is split in three parts: movements of the conductor, knowledge of the music and feedback and reaction from the musicians.

For a conductor capable of conducting musicians, a basic version of all three parts is necessary. The main focus, however, is chosen to be on the feedback from and reaction to the musicians. These parts are not entirely independent: for example, to be able to lead a musical performance and give feedback to the performers, the conductor has to posses knowledge of the piece to be played.

3.1 Knowledge of the Music Being Performed

A conductor conducts based on knowledge of the piece that will be played. A conductor knows how this piece is supposed to sound, what which people will play at which moment, what the tempo should be and where it should change and where time changes occur. A real conductor will gesture all of this to the musicians. Normally a conductor analyzes and uses sheet music to gather this knowledge. This sheet music will not strictly dene how the piece will be performed, interpretation by the conductor and musicians is done, for example on playing style, dynamics and tempo.

The virtual conductor has to store knowledge about a piece and analyze this to be able to translate this to conducting movements. Therefore, a component has to be designed and implemented to read digital sheet music les, perhaps in combination with recorded interpre- tations so he can acquire the knowledge about the to be played piece.

The basic information from which the virtual conductor can conduct is the number of bars, the time and tempo, from which the conductor can make basic conducting movements. The sheet music can further be analyzed for markings indicating aspects of the music such as dynamics, articulations and style. Finally, the notes being played can be analyzed, to nd phrasing, as well as the expression of the music. Chord changes, key and rhythm can contribute to this. To analyze this, some way of nding or storing expression in music has to be found.

The sheet music has to be stored in a known le format, preferably one that can be opened and edited by the major music notation programs.

3.2 Movements of the Conductor

From the knowledge of music, the conductor needs to synthesize movements and gestures to show the musicians how the music being played should be played. This means a component is necessary to synthesize conducting gestures from knowledge of music.

The basic movements a conductor makes will be the beat-patterns, which indicate the beats of a measure. For dierent time signatures dierent basic strokes are necessary. Added to these basic movements are style variations. For example, if a conductor wishes to indicate that the musicians should play louder, he will make bigger gestures. For legato playing, he will make more uid gestures, and the conductor should do the opposite for staccato playing.

These gestures will have to be analyzed from a real conductor to be able to synthesize them for a virtual conductor. In this analysis should be researched what these basic gestures are and how they change with style variations. When synthesizing these movements, a basic version can rst be made that can handle the basic movements. Variations can be added later.

(21)

A possible extension is the adding of gestures for the left hand of the virtual conductor.

With the right hand, a conductor will indicate the beat. The left hand can be used to signal when a musician, or group of musicians will have to start playing . It can also be used to indicate that a group of musicians has to play louder or softer, or dierent, or is just completely on the wrong track and should just stop.

A normal conductor will use more than arm gestures to conduct music. By looking at one or more musicians he can signal to separate performers. For a virtual conductor capable of signaling to separate performers, the conductor has to know where the musicians are. This could for example be accomplished with a camera, or by telling the conductor in another way where the musicians are located. Facial expressions could also be used - for example to indicate expression in music, but also to indicate someone is making mistakes. In such cases, the conductor can look angry, or smile at someone if they are playing well.

3.3 Interaction between Musicians and Conductor

Making gestures with knowledge of a piece of music however is not enough to make a realistic virtual conductor. The conductor should be able to react to the input, either music recorded beforehand or real-time musicians. The conductor has to be able to react to what the musicians do, to follow their interpretation of the music, but also to correct them if they make mistakes or to stop them when the performance goes wrong altogether. After such a stop, the conductor should be able to pick up the music at a previous point in the music and try again - perhaps conducting more clearly this time as to make sure the musicians do play correctly.

Ideally the conductor should be able to detect when the musicians start playing, in the most ideal case for all musicians separately. When there is a longer rest after which musicians start playing again, the conductor could indicate that they should start again. If the conductor can follow the score and detect which notes are being played as well, it might be able to detect mistakes in the performance and give feedback on this. This however is far from a simple task.

The basic part of this can be a beat-detection and prediction algorithm. To provide feedback to the musicians, dierent gestures or facial expressions can be used. Extensions would be to implement a score following algorithm to better follow the score and perhaps nd out mistakes in the input. By detecting expression in what the musicians play and doing so in the analyzed music, the conductor could try and provide gestures and facial expressions to indicate the expression that should be played.

There will be a delay required for the processing of the music. This delay means some sort of scheduler will be necessary to plan the timing of the gestures in advance. The scheduler should not plan so far ahead that the conductor cannot react in time, but should also plan far enough ahead to compensate for the delay.

3.4 Type of Music, Number of Musicians and Input

In the ideal case, a virtual conductor would be able to conduct anything from two people to a whole orchestra, with just a single (stereo) microphone as the input source. Probably this is a too dicult setting for the virtual conductor. For the conductor to follow a whole orchestra he would need a quite complex beat following algorithm and it would be dicult to track what separate players do. Therefore, it might be easier to design the system to allow it to conduct a small group of musicians.

It is also possible to use MIDI instruments instead of real instruments. In this case, no transcription system is required for the conductor to follow musicians and the work can be focused on other parts of the conductor rst. Later, this can be changed to process real audio signals as well. Another idea is to track individual players with separate microphones. It will be easier to keep track of what separate players do and less complicated algorithms can be used for transcription and score following - at least in case of monophonic instruments.

(22)

3.5 Focus of this Assignment

For the conductor, a basic version of all three parts is necessary. The focus of the assignment however is on the feedback between the musicians and the conductor. This means a more basic version of the gestures and the knowledge of music can be researched. However, these three parts are far from independent. For a conductor to react to a group of musicians he needs gestures to be able to do so, audio analysis to be able to listen to musicians and some knowledge about the music played to be able to determine such things as tempo, style and dynamics.

(23)

4 Human Conductors: How do they conduct?

4.1 Literature

In literature, quite extensive descriptions of the tasks of a conductor can be found. A short description will be given here, based on several descriptions. A short description of conducting can be found in [6], a historical overview of conducting handbooks can be found in [16]

4.1.1 Dierent Conducting Gestures

There are a few basic beat patterns on which conductors base their conducting. The most used are the 1-, 2-, 3- and 4-beat pattern. These beat patterns are illustrated in gure 4.1.

Many variants of beat patterns can be found in literature. Several variations are known in several cultures and styles. A very torough description of these styles, current and throughout history, can be found in [16] These beat patterns can roughly be divided into several sections:

the preparation and the actual beat. This preparation occurs before the actual beat and also during the upbeat. The preparation is thought of to be more important than the beat itself, because it tells the musicians when the next beat will be and in what tempo[36]. As such can be used to change the tempo.

4.1.2 1-Beat Pattern

The one beat pattern is used in music for fast ²₄-, ³₈- and³₄-measures. A good example of when this pattern can be used is in a waltz. The pattern is the most simple of the patterns and therefore also quite dicult to do well for a human - there is very little possibility of expression in a 1-beat gesture. The one beat pattern is a simple up-down movement. The movement must be like a stick bouncing on a timpani, or a bouncing ball. This means the vertical movement of the pattern can be approximated with a parabolic function.

The 2-beat pattern is mainly used for ²₄- and ²₂-measures and fast ⁴₄-measures. The movement consists of two downward strokes, the rst from left to right and the second from right to left, if performed with the right hand. The lowest point of the second stroke is generally higher than the rst.

The 3-beat pattern is used for slower measures in 3, for example a ³₄-measure. It consists of three downward strokes. All beats must be fairly elastic.

The 4-beat pattern is used for measures in 4, for example a ⁴₄-measure. The 4-beat pattern consists of a stroke down, one to the left, one to the right, one slightly higher to the left again and a stroke up.

(24)

(a) 1-beat pattern

1 2

(b) 2-beat pattern

1 2

3

(c) 3-beat pattern

2 1

4 3

(d) 4-beat pattern

Figure 4.1: Beat patterns

(25)

2 1

4

3 2 1

4 3

Figure 4.2: Example of legato and staccato 4-beat pattern

4.1.6 5-, 6-, 7- and Other Beat Patterns

The other beat patterns of a human conductor will not be mentioned in detail here. 5-, 6- and 7-Beat patterns are used for music with a meter with 5, 6 or 7 beats. Other beat patterns also exist, like a three beat pattern where the rst and second beat take 2 eight notes and the third beat takes 3 eight notes. Many variations on this are possible.

4.1.7 Staccato/Legato Beat Patterns

According to [39], two main variants of these beat patterns exist: a legato and a staccato pattern. A human conductor can vary anywhere in between these two patterns to indicate any articulation in between staccato and legato. The dierence betwen these two patterns is shown in gure 4.2.

4.1.8 Left and Right Hand

A human conductor often uses his right hand to conduct a beat pattern and his left hand to communicate other messages to musicians. A conductor uses his left hand to make gestures indicating dynamics, indicating cues, to indicate expression and many more messages.

4.1.9 Gaze and Gesture Direction

A conductor can indicate cues in music to a complete group of musicians, to a subgroup or to just one musician. He does this mainly with gaze and gesture direction. If a conductor wants to indicate something to all of the musicians, he will usually not look at just one musician, but direct his gaze so that the entire group can see the gesture is meant for them. If however, the conductor wants a message to reach just one musician or a small group of musicians, he will conduct a gesture towards that musician or group of musicians, also looking at those musicians.

4.1.10 Dynamic Changes

To indicate dynamic changes, a conductor has two main methods. The conductor can conduct big for higher volumes and small for lower volumes. He can use left hand gestures to further emphasize this: by raising his left hand, palm up, he can indicate musicians to play louder.

By lowering his left hand, palm down, he can indicate musicians to play softer. With gaze and gesture direction, he can indicate this to a small group or just one musician, or an entire group.

(26)

4.1.11 Expression

A conductor will have a wide range of gestures and facial expressions to communicate expression in music. First of all, a conductor can use facials expression: if he wants music to be played happy and light, it will usually help to look happy himself. If he wants music to be played in a sad way however, looking very happy will not have a good eect on the music.

Besides facial expression, he adapts his beat patterns for dierent styles. He can conduct smaller with light gentle movements to make the musicians play gentle music. He can conduct bigger and more dramatic for dramatic music and everything in between. He can conduct very clearly for rhythmically complex music, and make movements that no longer resemble the basic patterns for romantic music. This is usually eective on an orchestra, as it immediately knows in what style to play.

4.1.11.1 Facial Expression for Expression in Music or for Tutoring Purposes

When a conductor looks angry, he can mean two things: He either will mean that the music should be played in an angry way, or he will be angry at a particular musician or a group of musicians for something they do. For example, when someone plays far too loud, or plays a lot of wrong notes, a conductor might look angry at that particular person. He might look angry at a whole group to tell them this music should be played in an angry way, should contain the emotion anger. If facial expression is used, it should be clear what is meant with the facial expression.

4.1.12 Cues

A conductor can give a cue to a group of musicians or a musician to tell them they should start playing after for example a rest. He can do this by looking at the musicians and conducting towards them, making an accent in the conducting gestures. He can also put his left hand forward towards the musicians, palm up, to indicate it is their turn. This helps the musicians begin at the right time, but also helps them be convinced enough of their rst notes.

4.1.13 Dierent styles

Every conductor has its own conducting style, his own way of conducting musicians. The style variations consist of dierent gestures, dierent selection of beat patterns (For example, to conduct in 2 instead of 4), dierent left hand gestures, dierent facial expression. Also of course the interpretation of music by dierent conductors is dierent, leading to dierent performances. Conductors will also use words to inspire or correct musicians, this is of course also dierent for every conductor.

4.2 Conversations with a human conductor

During the process of creating the virtual conductor conversations have been held with a human conductor, Daphne Wassink. A summary will be presented here. During this talk, a working prototype of the conductor was shown, with less than ideal movements.

4.2.1 Movements

The basic pose of a conductor is with the arms slightly spread, and slightly forward. The movements should be done using that as a starting pose. The shoulders should not be used to conduct, unless they are necessary for expressive movements. The hands should never stop moving in a conducting gesture, although they can move less fast. The conducting movements should be as uid as possible. For every beat, the pattern is split into a preparation and the moment of the beat itself. The preparation is what tells the musicians when the beat is and

(27)

therefore is more important than the timing of the beat itself. A conductor can conduct with only the right hand. If the left hand has nothing to do at such a moment, it can go to a resting position, which is upper arm vertically, lower arm horizontally, resting against the body of the conductor.

If the size of the movements changes, the movements should be placed higher, closer to the face of the conductor. If the conductor wants to indicate pianissimo or even softer, the conducting movements may be indicated only with wrist or nger movements. The right hand movements should be slightly bigger than the left hand movements, but the downward movements should end at the same point for both hands.

4.2.2 Following and Leading Musicians

If musicians start to deviate from the tempo or start to play less in time, a conductor should conduct more clearly and bigger. The conductor should draw the attention of musicians, by leaning forward and conducting more towards the musicians as well. If musicians play well, the conductor can choose to conduct only with one hand, so he can conduct with two hands only when more attention from the musicians is required. Snapping ngers or tapping a baton on a stand can work to draw attention, but should be used sparingly or the musicians will grow too accustomed to this.

To correct the tempo of musicians, a conductor should rst follow the musicians, then lead them back to the correct tempo. Care should be taken that enough time is taken to follow the musicians, or they will not respond to the tempo correction in time and the conductor will no longer have his/her beats during the beats of the musicians.

Just changing the conducted tempo will not work to correct musicians. The musicians should be prepared beforehand that the tempo will change. A conductor should change the preparation of a beat to the new tempo, then change the conducted tempo after that beat.

This should preferably be done on the rst beat of a measure. Care should be taken to keep each separate measure as constant as possible. Other than the rst beat in the measure, the tempo between two accents should be kept constant, for example between the rst and third beat of a four-beat measure.

Another way of getting musicians to play faster is to conduct in the same tempo, but to conduct a beat slightly before the musicians play this. The musicians will instantly know they are playing too fast or too slow and will try to adjust. The conductor can now just follow this and the tempo is corrected.

Interacting with a virtual conductor