Relating head-movement to sound events.

(1)

Relating head-movement to sound events.

H.H.W.J. Bosman August 9, 2010

Master’s thesis Computing Science University of Groningen Supervisors:

Tjeerd Andringa (University of Groningen / INCAS³) Michael Biehl (University of Groningen)

Figure 1: Device in situ

(2)

Abstract

This thesis stems from the idea that certain sound events draw your attention involuntary, and because of that you change your behavior to, for instance, determine what caused the sound. The response (changing behavior) can give information about underlying processes of the perception of sounds and soundscapes and can also indicate the sounds preceding the response.

Most psychological research on the perception of soundscapes on a macro-level uses only questionnaires and long-term measures. Research on cognitive processes study the reaction times and brain signals from EEG and fMRI devices on a micro level, but the details of (attentional) processing of the different sensory modalities are unknown. Creating a non-invasive measurement tool that can be used to measure changes in real-life behavior and correlate these to environmental stimuli can help this research.

Advances in motion measurement technology have given research into human gait and classification of movement and activity a new impulse. Using the (change in) motion can distinguish human activities and it might be well suited to measure a possible change in behavior due to an external stimulus. Ever more increasing in all this research is computer aided analysis, and with it the influence of computer science.

For this purpose I built a device to measure and record head-movements and sound. With one of the goals being usability in daily life this resulted in a glasses frame on which an accelerometer and microphones were placed.

With this device an experiment was set up and carried out to determine if there was any relevance to the idea of relating changes in head-movement to sound events. 20 people were asked to wear the device during a reading task. During this task several sound stimuli were played that were either congruent to the experiment situation or incongruent. After the task a questionnaire was given. A control group of 4 did the same setup without hearing the sounds.

While for each participant some events were detected during stimuli, this differed among the different participants. Accumulating these events per time interval did not result in clear relations to sound stimuli. Thus from this experiment no clear relation was seen in movement during the sound stimuli.

(3)

1 Introduction

1.1 Project description

Humans live in an environment full of stimuli. These stimuli are observed during for instance a walk through the city, which gives plenty of things to see like billboards, screens, buildings and people. In addition to the visual stimuli, the constant sounds of city life and traffic, form a constantly changing soundscape that also requires ones attention. The choice of what to pay attention to becomes ever more difficult, and the consequences can be major.

The Auditory Cognition Group the University of Groningen and the Cognitive Sensors Group at INCAS³investigate the (physically) optimal use of the information conveyed by sounds through the study of natural auditory systems and their integration with the rest of cognition.

These information conveying sounds do not occur one at a time, but are embedded in a so called soundscape. However, a soundscape is not a uniformly defined term. In music for instance it might be a composition focused on the timbre of sounds, with long pads and no real structure.

But , in the scientific community the the term soundscape, as referred to in the rest of this thesis, seems to have reached a consensus on the following definition: an environment of sound (sonic environment) with an emphasis on the way it is perceived and under-stood by the individual, or by a society (Truax (1978)).

In this project, the hypothesis is that a change in head-movement can result from an event in a soundscape.This head-movement might indicate a change in attention, for instance due to an orienting response towards the source to identify it. Such an orienting response is used to do infant hearing tests Bygrave and Stevens (1990). Changes in a soundscape that are not related to the current task of a subject can influence behavior of the subject (Andringa (2010)). Thus, if we devise an instrument to correlate head-movements and sound, further research may link this movement behavior to cognitive perception processes.

It is interesting to detect the causes of these events, to determine the properties of these events or study the processes that evoke the response. One approach to detect events that draw attention is to let the subject fill in a questionnaire. But, a questionnaire provides only a psychological description of the soundscape and events within it over time, as perceived by the subject. What makes a subject change their attention depends on a lot of factors, internal and external. Creating an instrument to measure the changes resulting from sounds would provide a complementary method that might link sound events and movements as they occur.

I have tried to create an instrument to be able to detect these changes and possibly deduce a correlation. The instrument is divided into a hardware part and a software part. The hardware part takes movement and audio measurements. The software part processes the signals to find a correlation.

1.2 Motivation

This thesis is the result of the research done at INCAS³ for my master study ”Intelligent Systems” in Computer Science at the University of Groningen. The Auditory Cognition Group of the University of Groningen is researching the perception of soundscapes by humans and how to artificially recreate these perceptions. To do so they are researching methods to single out individual sounds and sources from an audio stream. This involves the fields of acoustics, signal processing and artificial intelligence, and psychology.

The goal of this thesis and the device are not limited to the use of studying soundscapes and their content. Other uses might include studies of the human gait (the pattern of movement of the limbs of animals), as seen in section 2.2.1 and it also opens up a lot of possibilities in psychological research. It can help identify relations between sound and head-movements, for

(6)

instance to correlate speech with head-movement (Munhall et al. (2004)), or to detect sound to which mentally handicapped have a high sensitivity.

1.3 Method

This project encompasses a broad range of disciplines next to computer science. After some initial research the concept of measuring head-movement with an accelerometer was tested in a pilot experiment, to test if head-movement is detectable and relate-able to sound. The results of the experiment were promising and thus further investigations were done based on related work.

After the pilot experiment a prototype device needed to be built that was able to measure head- movement and sound, be comfortable and robust to be used during this research and possibly future research. To evaluate the prototype device, a psychological experiment was set up with the help of psychologists. This experiment provided a controlled situation where the responses to a specific set of sounds from participants are measured with the device with the purpose of evaluating the device’s ability to correlate head-movements with the sounds.

1.4 Scope

This thesis will concentrate on the initial research into relating head-movement to sounds to provide basis for further research. The underlying cognitive processes involve attention processing and specifically auditory attention, which are described in section 2 to give an idea of the principles and how they might correlate with head-movement. The device developed during this project is meant to be an asset in further investigations on head-movement.

Thus the thesis will provide a basic method to do the aforementioned correlation and sug- gestions to improve hereupon. With different factors playing a role and time being a limit not all possible features and classifiers are tested, but an informed choice is made. The Auditory Cognition Group at the University of Groningen and Cognitive Sensors Group at INCAS³ research methods to detect individual sound events within a stream of audio. Because of this the audio classification is done at a very basic level, which later can be replaced by a more precise model.

1.5 Objectives

The objective of this master project was to provide a tool to analyze the relation of head- movements to sound events, with the underlying idea that a (change in) head-movement can indicate a shift in (auditory) attention. Thus it can facilitate further research into this and other fields. Other fields of interest may for instance be physiology, to provide information about the current activity the user is involved in, or the identification of characteristics in the sound that evokes a head-movement response.

1.6 Organization

The rest of this thesis is divided as follows: In section 2 some available literature in the fields related to the project is discussed. Then the setup and results from the pilot-experiment are shown in section 3, which is followed by the realization of the actual Proof of Concept device in section 4. This section discussed the methods used in gathering and analyzing the data. The process of evaluating the device, together with the hypothesis is discussed in 5, and finally the overall discussion and conclusion of the thesis in section 6.

(7)

2 Theoretical background

This chapter gives an overview of the background of my work and the different fields in which it has overlap/interest. From a wide range of fields there is interest in soundscapes and other fields study human behavior and movement. Relevant publications from these and other fields are referenced with an indication of relevance. The rest of this chapter is divided into sections according to the field of study of the referenced publications.

First, section 2.1 describes research on the psychologic and cognitive processes of interest, then section 2.2 discusses related work on human locomotion. Finally, section 2.3 gives an overview of common computer science techniques used to analyze and combine data.

2.1 Psychology

Psychologists aim at understanding the role of mental functions in human (or animal) behavior and exploring underlying processes. When considering a black to be a metaphor of the brain, research on the mental functions is aimed at how the black box interacts with the outside world, while research on processes is to understand what is going on inside the black box. After a small overview on investigations on the effects of environmental sounds on humans, I will review some research on the ideas that go behind the mental processes of the auditory system.

Azrin (1958) states that previous studies were inconclusive whether output and speed of work are influenced by noise. Repeated presentations of the same noise unrelated to the task target produced progressively diminishing disruption from the task, habituation. The repeated noise was the same during half an hour periods, and the noised ranged from a 60 dB hum in one period to 100 dB white noise in another. Acoustic change results in subjects to temporarily return to their unconditioned state, not the intensity of the noise. When presentation and termination of noise are related to the target or the response however, the performance is increased.

Kjellberg et al. (1996) have researched the factors influencing psychological responses to noise, focusing on predictability, controllability, aspiration level, necessity of the noise, informa- tional content and ongoing activity. To measure these responses the participants had to fill in a questionnaire after the experiment with questions regarding these factors, and the questions were mapped to the following nine variables: surprising changes of noise, control, reduction possibilities, sources, sex, hearing status, task evaluation, workplace, self-rated sensitivity

They found annoyance and distraction to be two different influences of noise. Sound level, presence of a tone, type of workplace and self-reported noise sensitivity are the factors that significantly influence the noise annoyance rating. Predictability and controllability of the noise have significant influence on the distraction response but not on the annoyance, while the perceived possibility to lower the noise level in the workplace resulted in stronger reported annoyance without having an effect on the distraction.

Miedema and Oudshoorn (2001) have made a model to fit the relation of annoyance level with regard to long term level measurements of transport traffic noise. Transport is then divided into the categories roads, airports and railways. Similar to other studies (e.g. Hoeger et al. (2002)), the measures used are the Day/Night Level (DNL) and Day/Evening/Night Level (DENL) scale. They mention that different studies have a different number of response categories which results in difficulties translating scales into comparable measures.

All these studies have in common that they measure the perceived effect of sound after a period of time and in that period of time the perception might change. Accounts on the perception differ per subject, as their observations are recalled from memory and possibly influenced by previous experiences (memories). The sound is measured in few describing features, only averaged over longer periods of time. Both the measure of perception and that of sound

(8)

disregards singular events, which might change the whole measured perception of the sound environment.

In the following we will zoom in to more specific topics of psychology that also investigate the effects of single sound events.

2.1.1 Orienting behavior

Humans and animals alike respond to new unfamiliar sounds and Pavlov and Anrep (1960) describe the response to changes, referred to as orienting response (OR), as the ”reflex which brings about the immediate response in man and animals to the slightest changes in the world around them, so that they immediately orientate their appropriate receptor organ in accordance with the perceptible quality of the agent bringing about the change, making full investigation of it”. It may involve external changes, as in changing skeletal position or internal changes as shifting the focus of attention.

Even babies show this response as they react to new sounds in their environment. This fact is used to do hearing tests with infants. By playing a sound near the left or right ear, doctors can watch the shift of attention indicated by subtle head, mouth and eye movements (Bygrave and Stevens (1990)). These head and eye movements are aimed to (visually) localize the source of the sound to orient themselves and as Kearsley (1973) suggest to assess a possible threat as a defense mechanism.

Thompson and Masterton (1978) have found that cats respond to a sound source that has not been habituated invariably with a latency from 20 to 80ms and terminate the turn very accurately. McDonald et al. (2000) provide psychophysical evidence that an auditory cue can involuntary orient human auditory attention towards the location of that cue. Feature extraction at early processing states is not influenced by concurrent sound cues but preceding sound cues do increase the sensitivity. Uno and Grings (2007) suggest that these autonomic responses, which form part of the orienting response, decline in magnitude with repeated presentation, but increase with increased noise intensity.

Thus it seems that this orienting response can be involuntary and that the response can result in repositioning of sensory modalities to attend to the new stimulus. This repositioning then might be measured, and could give insights into what sounds evoke these responses and what processes might be involved in doing so. One of these processes involves the focus on only one aspect of the environment, selective attention.

2.1.2 Selective Attention

The cognitive process of the ability to focus or concentrate on only on aspect of the environment while disregarding other information is called selective attention. The following sections give an overview of some of the research in this extensively studied field, with first some general relevant research followed by research focused on specific sensory modalities.

Yantis and Johnson (1990) and Remington et al. (1992) found that task-related visual distractors might automatically and involuntary capture spatial attention. But, is this spatial attention only visual when the distractors are visual? Driver and Spence (1998) argue that previous studies on selective attention are typically on single sensory modalities while in the real world all our senses are stimulated simultaneously. They note that many of the neural structures that have been implicated in spatial attention are known to be heavily involved in cross-modal interactions, and are thought by many investigators to sub-serve supra-modal representations of space.

In the experiments they ran they cued a participant’s left- or right-side to attract covert attention to that side. They then measured the time it took for the participant to judge whether

(9)

a light at a top of bottom corner of a board was turned on, but that light did not have to be at the same side as the attentional cue. They found that a spatially non-predictive, task-irrelevant abrupt sound on one side leads to better elevation judgments for both visual and tactile events presented in the vicinity of the sound shortly after its onset. The case was similar with spatially non-predictive tactile events on hand. However, they have found that peripheral visual cues do not affect auditory judgments.

In other experiments Driver and Spence (1998) found that coding of auditory location changes with eye deviation in the head. In general, covert attention mapping changes with how current posture realigns sensory receptors across various modalities.

Recent attention research is more focused on the understanding of neural mechanisms by using EEG or (f)MRI techniques. An MRI scanner consists of a large cylindrical electric magnet coil cooled by liquid helium in which a patient takes place. EEG measurements are made by placing electrodes one one’s scalp with a conductive gel. The electrodes can be embedded in caps or in a headband. These techniques either do not provide much mobility or are quite invasive and, as such, are not very useful for real-world applications.

Visual attention is also more addressed than auditory attention, but some research argue that similar processing is used in the auditory domain Scholl (2001b); Johnston and Dark (1986).

Thus, the following sections first describe some details of the cognitive visual system, followed by details of the auditory system.

2.1.3 Cognitive visual processes

Vision provides humans with important information about the environment. Therefor a broad range of research topics have relations to the visual processing. I will review some papers that might give an insight in the hypothesis of this thesis and why visual attention might be relevant.

Examples are for instance research into eye-movements, the in section 2.1.2 mentioned research on visual cue-ing the attention, or how visual attention and memory work together.

Yantis and Johnson (1990) have researched mechanisms of priority for visual onsets. They propose that this mechanism first directs attention based on a limited set of high-priority elements after which other elements are processed. This limited queue of prioritized elements may vary depending on the individual and the task at hand, and even on element properties, but in the event of onsets, an average of 4 elements is prioritized.

The question arises of what elements of attention are kept to focus further processing on.

Scholl (2001a) has reviewed the state of the art of visual scene analysis. Standard models of attention select properties like spatial regions and visual features, but more recent research suggests that in some cases not only features but whole objects can be selected too. Such objects can be grouped together, consist of parts or have one or more surfaces. For instance, brain areas have been identified which respond selectively to faces or the shape of the environment, depending on the selective attention. Evidence supports the view that auditory processing has similar grouping methods. What space-time is for a visual scene, time-frequency is for an auditory scene and separate streams for processing the source and location of elements exist in both modalities. Occlusion for instance occurs visually when one object moves behind another, and auditory when for instance a sudden burst of noise covers frequency development.

Berry et al. (2007) have tested a camera system which records Activities of Daily Life (ADL) to improve autobiographical memory of subjects with memory impairments. They have tested the use of their system on a 65 year old subject with memory impairment. A baseline trial was conducted in which their subject recalled 2% of significant events after 7 days. The score was based upon the amount of events and specific details thereof recalled. When reviewing images taken by the system multiple times their subject could recall 70% of the events after one month

(10)

of not viewing the pictures.

In other research the link with other sensory modalities is encountered and also, like in the aforementioned research, the use of memory in the recognition of the signals plays an important part. A speaker’s head movements improve a listener’s auditory speech perception Munhall et al.

(2004), but also blind people may still use parts of their visual processing system, for instance to improve their verbal memory performance Amedi et al. (2003) or to aid the recognition of patterns in sound Arno et al. (2001).

Attention processes seem to be similar to each-other (Shinn-Cunningham (2008)) and some subprocesses might even be shared between modalities. The orienting response lines up sensory modalities towards the signal that attracted the attention.

2.1.4 Cognitive auditory processes

In addition to the visual system, the auditory system too provides information on the environment. It is believed that this system works in a similar way as the visual system, but many parts are still not understood (Shinn-Cunningham (2008)). How different signal sources are sep- arated from one audio stream, what features are used and which sounds attract our attention are of much interest to understanding the audio processing and improve on automated sound recognition.

Speech-like sounds attract attention, and especially when presented to the left ear Kopp et al. (2004); Hadlington et al. (2004). Even when not paying attention to all the different conversations going on around you, when at a cocktail party for instance, you can recognize what one person is saying, the cocktail party effect. At the same time one can hear important information such as one’s own name in other conversations (Cherry (1953); Johnston and Dark (1986)).

Experiments also show that hearing irrelevant speech and tones during a memorization task of a list of length 7 ± 2 results in a decreased serial recall performance (Hadlington et al. (2004)).

Results of an EEG study show that a memory task for visually presented stimuli alters sensory processing in human auditory cortex, even when subjects are explicitly instructed to ignore any auditory stimuli (Valtonen et al. (2003)).

Sugita and Suzuki (2003) have shown that the human brain compensates for the slower speed of sound than light up to a distance of 40m from the source. Reaction Time for an auditory stimulus is determined to be 140-160ms on average for healthy people, but it is longer for diabetes patients for instance (Parekh et al. (2004)).

Kayser et al. (2005) propose a model for an auditory saliency map, i.e. a mechanism for selecting the salient stimuli potentially relevant to behavior. They mention that similar models for the visual system were shown to replicate properties of the human overt attention, but a model for auditory attention has not been explored before. Thus, they created a salience map by extracting features like spectral and temporal modulation in parallel, normalizing these features with a sliding window in a manner consistent with psycho-acoustical masking effect, and integrating over individual features.

This model confirms replicates basic properties of auditory scene perception as mentioned in the literature. As they state: ”Both short and long tones on a noisy background are salient, as are gaps (the absence of frequencies in a broad band noise); long tones accumulate more saliency than short tones; temporally modulated tones are more salient than stationary tones;

and in a sequence of two closely spaced tones, the second is less salient in agreement with the phenomenon of forward masking.”

It is noted that their results implicate a similar mechanism for visual and auditory saliency detection, if not by the same multi-modal cortical areas.

(11)

Smith and Lewicki (2006) found that with a nonlinear model, based on a population spike code, a complete acoustic waveform can be represented. The waveform features are optimized for coding natural sounds or speech of three types: mammal vocalization, environmental ambient sounds and environmental transients. The features extracted from these types show similarities to time-domain cochlear filter estimates, have a bandwidth dependence similar to that of auditory nerve fibers and yield greater coding efficiency than conventional signal representations.

Gist processing

A traditional view on perception is mostly a bottom-up process that progressively increases the detail of signal features. Attentional processes typically then direct the feature extraction to features of interest. However, in reality there are some features of the human perception system that are not explained by this view, like the ability to grasp the overall content of rapidly presented images. Therefor there are recent investigations into a top-down process that directs the attention of the bottom-up processes. This top-down process requires information from previous processing but also from the gist of the current scene, glimpses of early extracted information.

Harding et al. (2007) give an overview of a line of research on selective attention. They are one of the first to suggest applying the ideas of gist processing (rapid detection of the

’gist’ of an information stream) in the auditory system. Gist processing research has mainly been done on the visual system but they summarize the evidences for similar processing in the auditory system. There is no consensus yet about how the information of the (gist of) a scene is represented in the brain. It might be represented with fragments and glimpses or with a smoothed/lower resolution representation of the information. In general, the hierarchical (bottom-up and top-down) processing model does not seem to fit the results of research in this area. Initial early bottom-up processing steps may provide the information to top-down processes that direct the attention.

Like Driver and Spence (1998), Harding et al. (2007) too suggest that auditory spatial attention can be primed with visual cues. Experiments have shown that the visual system can rapidly become aware of the surroundings without being aware of detail, or that it singles out an odd detail in contrast to other items in the scene, the ’gist’ perception of a scene. After determining this gist, the part that is given attention to as a result can be scrutinized.

The difference between hearing and listening is introduced by Harding et al. (2007) as follows:

Hearing is a bottom-up processing stage that provides an overview and possible categorizations of the whole auditory scene, while listening is the top-down processing stage that focuses on the attended source and analyze its detail. Speech for instance can be understood in difficult listening conditions, when the speech signal is heavily modulated or when part of the speech is masked. When not attended to, certain streams (characteristics of the sound signal) are undifferentiated, thus one can fail to notice the change of a speaker when one is asked to memorize and repeat the spoken words.

Another overview of auditory attention research is given by Fritz et al. (2007). They also suggest the auditory scene analysis to be multi-stage process that draws on bottom-up gestalt grouping primitives, on auditory memory (prior knowledge or expectations of the auditory scene), on attention, and on other top-down control. This process has evolved (in man and animal alike) a fairly sophisticated novelty detection system which reacts to any deviations or irregularities in sequences over periods as long as 20 seconds.

Niessen et al. (2008) argue that the interpretation of sounds is dependent on context and that sounds can be disambiguated by a context. Therefor performance of automatic sound recognition systems can be greatly improved when using contextual knowledge. Similar systems

(12)

are used with speech recognition, e.g. grammar and statistics.

It seems that while we can listen to a mishmash of sounds we can only identify a few sources at a time. Which sources we identify depends on the context and task at hand. Suddenly (dis-)appearing or changing sources or sources with a special meaning, like your name, stand out.

2.2 Psycho-physiological and movement sciences

In addition to cognitive processes, another factor of relevance for the hypothesis in this thesis is the human body. The body performs actions due to signals from cognitive processes that in turn can modify the input to the cognitive processes. This allows us not only to perceive and model the world but also to influence and explore it. This section explores the methods and techniques used to analyze the patterns of human movement (gait).

There is a lot of research on how we use our bodies, not only to mimic this in robots but also as input for computer systems to determine for instance the user’s state. Other fields use these technologies to determine injuries or improve performance (e.g. in sports). In the field of motion sciences devices exist that provide full body motion capture using accelerometers and gyroscopes Brodie et al. (2008) to get detailed information on the human gait. De Rossi et al.

(2003) even discuss different types of biometric sensors which can be integrated into clothing fabrics. Eye-tracking devices are used to measure visual attention in many fields Duchowski (2007), like psychology and neuroscience. Although none of these device are aimed at auditory attention they can provide insights and ideas on measuring head-movement for the current project.

2.2.1 Gait analysis

Gait analysis is an active research area that analyses patterns of human movement and devises techniques to do so. This is of special relevance to the thesis because of the analysis of movement, thus in the following part I will give an overview of relevant bits of research.

Using a Kohonen Self-Organizing Map (KSOM), Van Laerhoven (2001) define an algorithm to map feature vectors to a type of context, to make devices context-aware. That context could be a state a device is in or a situation/location that the device and its user are in. To overcome the untangling problems of a KSOM in the initial phase, a second layer of KNN clusters is introduced for every neuron in the KSOM. As features they calculated the minimum, maximum, mean and standard deviation of accelerometer data in a window of 3 seconds (resulting in 100 samples/window) with an overlap of 90%.

To recognize Activities of Daily Living (ADL) Kim et al. (2007) used accelerometers placed at ankle,tight, waist and wrist. There are multiple ways to classify the accelerometer data, including using fixed thresholds, pattern matching, neural networks and decision trees (DTs).

Kim et al. (2007) used a DT to classify the activities. The features mean, energy, entropy and correlation in frequency domain that they have used are based on research of Bao and Intille (2004). These features are calculated for a window of 256 samples representing 4 seconds of data, with an overlap of 50%. Next to movement data, an RFID system is used to identify objects that are used or grabbed which improves the accuracy from 82% to 97%.

Ravi et al. (2005) focus on the classifiers used to recognize ADL. Using windows of 5.12 seconds (256 samples) with 50% overlap they calculate the features mean, standard deviation, energy ¹ and correlation. With this set of features they tested single classifiers and combined classifiers, or meta-classifiers. Meta-classifiers tested are: Boosting, Bagging, Plurality Voting,

1sum of squared FFT component magnitudes normalized with window length

(13)

Stacking with Ordinary-Decision Trees (ODTs) and Stacking with Meta-Decision Trees (MDTs).

Using the latter is promoted by the authors, with Plurality Voting in particular, supported by the high accuracy of classification with their data.

To conserve power, Wu et al. (2007) classify context with a Bayesian classifier with features from pulse oximeters and accelerometers to switch on or off energy intensive physiological sensors. Specifically, they monitor Electro Cardiogram (ECG) signals after exercise. When the oximeter pulse-rate is high, the two 3D accelerometers are switched on, to determine whether the subject is resting, walking, jogging or running. The accelerometer is sampled at 100Hz and that data is windowed in 512 samples with 80.5% overlap. The features extracted are the peak spectral value, and the total spectral energy ². When resting or walking, the ECG is switched on. Due to sensor fixation, pulse oximeter errors occurred which resulted in erroneously classified contexts, but in general this system promises energy efficiency improvements by using the context to switch the use of extra sensors.

Fall detection can sufficiently be done by using threshold detection. Bourke et al. (2007) determined thresholds for accelerometer data from accelerometers worn at trunk and thigh when falling. Training data was created with young adults while targets where elderly people. With test data from ADL from elderly people it was determined the thresholds hold with normal activity and thus falls could be correctly detected.

To recognize patterns in signal streams it is important to segment the different patterns.

Guenterberg et al. (2007) suggest a generic method, using the local standard deviation of a signal, to segment sensor data. Using an N-sample window the segmentation is done by thresholding the standard deviation. The threshold is variable; in periods of high activity the threshold is also higher. By correlating different sensors, the authors suggest, the results can be significantly improved.

A combination of ADL classification and fall detection was the goal of Nyan et al. (2006).

The accelerometer was positioned at the shoulder inside a garment for wearer comfort with other electronics in the garment pocket. Accelerometer signals were acquired at 50Hz and windowed with 350 samples, or 7 seconds, and an overlap of 40%. Features were extracted to the frequency domain with wavelet transforms (Daubechies mother wavelet (db5). Fall detection was done when a threshold of 4.8g over the sum of accelerometer data was crossed.

Some user state transitions like lying down to sitting upright where distinguished with a decision tree based on thresholds of signal variance and signal filtering with wavelet decompo- sition/composition. To determine if a subject was walking, 5 or more successive peaks above a threshold of 0.1g were to be present in the signal. Distinguishing level walking, ascending or descending stairs was then done with classification of wavelet features. The first 25 wavelet co- efficients of a wavelet transform, of which each 5 successive components was averaged, resulted in 5 features. These 5 features were classified with the nearest neighbor of training data with euclidean distance. Other activities were also classified with only the wavelet features.

True Positives (TP) were equal to the number of true sit-stand transitions, correctly detected by the system. False Negatives (FN) were equal to the number of sit-stand transitions, not detected or wrongly detected by the system. True Negatives (TN) were equal to the number of other types of activities detected by the system, which were not true sit-stand transitions.

False Positives (FP) were equal to the number of other types of activities wrongly detected as sit-stand transitions.” With this system, the sensitivity (TP/(TP+FN) was 94.98% and the specificity (TN/(TN+FP)) was 98.83% over 1495 activities.

The authors of Ward et al. (2005) have written multiple papers on using sensors to classify hand gestures. They have used multiple 3-axis sensors and microphones on different parts of the body to recognize gestures. In Ward et al. (2005) they use only one 3-axis accelerometer

2sum of squared FFT component magnitudes normalized with window length

(14)

together with a microphone worn on the wrist. Their previous work relied on differentiating the sensors on different body locations. Hand-gesture data was gathered from users working in a wood workshop and divided into 9 activities and a ’null’ bin for everything else.

Audio data was sampled at 2 kHz and windowed per 100 ms, with a 75% overlap. This window was then transformed to the frequency domain with FFT. Using Linear Discriminant Analysis reduced FFT dimensionality from 100 to 8 and the resulting vector was classified with the nearest class mean value. Accelerometer data was gathered at 100Hz and also windowed per 100 ms with 75% overlap. The extracted features X-axis mean, X-axis variance, number of peaks in x,y,z signal and mean amplitude of the peaks in x,y,z were used as input for a HMM.

Using one of the methods ’Comparison of Top Rank’ and ’Logistic Regression’ the data of each classifier was joined, resulting in up to 70% accuracy.

Van Laerhoven et al. (2002) were inspired by nature, where massively parallel neural path- ways supply vast amounts of sensed impulses to the brain. Thus the concept of many simple sensors was conceived, which the authors evaluated. Approaches to how data of these sensors is combined are locally processing data and sending results to a centralized system to be merged, or centralized processing of sensor data. The authors choose to process the sensor data centrally based on the ease of implementation, however they do mention that decentralized processing increases robustness. For their system the authors used 30 accelerometers positioned at different places of a human body, each sampled at 30Hz. With this system they tested claims on the amount of sensors and context. The number of contexts to be distinguished has a negative effect on predictions while sensor fusion theory states that adding sensors improves accuracy and speed of recognition. With the data gathered the authors can substantiate these claims.

Lau et al. (2008) used a 2D accelerometer and a 1D gyroscope measuring ankle movement and shank rotation respectively at 240Hz. With this data they have tried to classify 5 different walking conditions: level ground walking, stair ascent, stair descent, upslope and downslope walking. Features extracted were the minimum turning point (time) from the shank, the amplitude values of the shank and foot at pre-swing phase and the amplitude of the peak values from the accelerometers during swing phase. After comparing SVM, ANN, RBF and Bayes classifiers, the authors chose the SVM classifier because it outperformed the other classifiers.

With both shank and foot features, the classification accuracy of the 5 classes was 85%, and with shank data only 78%.

Kunze et al. (2005) are able to recognize the body position of an accelerometer sensor on a human wearer with about 90% accuracy. To do so they try to identify when a user is walking followed by determination of the placement on the body. From a signal window of 1 second with 50% overlap the following features are extracted: RMS, 75% Percentile, Interquartile Range, Squared FFT components, FFT component entropy and the power of detail signals at given levels from a discrete wavelet transformation. For each window it is determined if that is a walking activity or non-walking activity with a C4.5 (tree based) classifier. When walking, further classification, again with C4.5, is done to determine the position. On a frame- by frame average, 80% is correctly classified as one of four positions when walking state is automatically determined and 89.8% when walking state was pre-labeled. When combining frames into segments of some tens of seconds up to 3 minutes a majority vote decision resulted in 100% recognition rate.

The aforementioned papers analyze the human gait to determine or classify the user’s state, or current action. Several systems are designed and tested in specific situations only, like working in a tool shop. The systems are usually limited to a fixed number of states, like sitting, walking and running, and sometimes have an ’everything else’ state, which results in varying performance.

Accelerometer data is commonly sampled around 50Hz, sometimes more and sometimes even

(15)

as low as once per 3 seconds. The analysis is almost always done by using a sliding window over the signal to determine features in that window. This sliding window is on average a 3 to 5 seconds long but can be as small as 100 milliseconds or as large as minutes and on average it is 50% overlapped. There is no consensus on what classifiers to use on these features but much seen classifiers are thresholding g-forces, decision trees, neural networks, nearest neighbor classifiers and meta classifiers (combining the result of multiple classifiers).

The features that are most commonly used are: Signal mean and standard deviation Fre- quency components (with FFT or wavelet transform), and statistics thereof (total energy, entropy, maximum, largest frequency component etc.) Correlation between different axis of the signal or of the frequency components.

While the classifications used do not really deal with real-world usage, and only few papers determines the transitions of states (were Guenterberg et al. (2007) have quite a generic approach), the features that are commonly used can give a good starting point. The most common features are average and standard deviation of the signals, frequency spectra, features based on the frequency spectra like the energy or the largest frequency and the correlation between signals.

2.3 Computer Science

Computer science is a field that contributes to many other research fields, it provides the tools and techniques to analyze data. In the above review of papers a lot of techniques already have been mentioned, and a lot can be found in literature like from Mitchell (1997). In the following section I will discuss some techniques that are used to combine multiple sensors and classify their output, but first a short overview of some classification algorithms mentioned earlier.

One of the most basic classification schemes is Thresholding, choosing one of two classes bases on the signal value exceeding some fixed value. Decision Trees use a tree-like model of decisions, for instance thresholds, to determine outcomes/classes. Self-Organizing Maps (SOM) produce a low-dimensional discretized representation of sensor data, which is convenient for visualization and clustering. K-Nearest Neighbors classify a vector of features by comparing stores feature vectors that are close in feature space and assigning the most common class.

The concept of Neural Networks is modeled after the biological neural networks like in the central nervous system. Interconnected artificial neurons that mimic properties of biological neurons, or use Bayesian laws to model them. However, the simplest Bayesian networks are Hidden Markov Models. Useful for temporal pattern recognition they output probabilities for a sequence of states that match the given observations. Meta classifiers aggregate the output of several classifiers to generate an output class. That class might be chosen with (weighted) voting or with again a classifier, like a decision tree.

2.3.1 Multimodal sensor fusion

Classifying a single sensor signal can be done with a classifier like mentioned above. When there are multiple sensors in a system however, which maybe even measure different physical properties, combining these signals is not straightforward. With sensor technology becoming more advanced and more available every day, combining sensor signals becomes ever more important. Thus in the following I will describe some ideas to fuse sensor data.

Godfrey (1980) describes the mathematical principles of correlation methods, which can be used to measure the similarity of two signals. This is usually done to find a (shorter) signature signal in a constant signal stream. However, these methods require a stationary mean of the signals.

(16)

The overview of the state of the art from Hall and Llinas (1997) summarizes techniques for data fusion used by the department of defense from the U.S.A. They state that sensor data may be fused on different levels, from raw data to state or decision level. When sensors are measuring the same physical properties the raw data can be combined, but when those properties are different the data must be fused at a higher level. However there is no optimal solution for data fusion, an architecture should balance resources like processing power and funding with accuracy and capabilities.

Network intrusion detection systems, like the one described by Valdes and Skinner (2000) use an array of different sensors to alert possible intrusions to security officers. In such a system, a large number of individual sensors output possible intrusions as alerts to a central system.

The central system then aggregates this data and outputs a probability of intrusion which, above a certain threshold, generates an alert. Alerts are given as probabilities to intrusion by looking at the correlation between sensor intrusion alerts and some temporal awareness (rules) of what sensors alerts will precede others.

Wun et al. (2007) provide an architectural overview of a system that semantically fuses sensor data to provide reports of that data to subscribers of the system. By using the domain knowledge about raw sensor data and the interpretation thereof, reports of events are generated with the semantics of events next to the raw data. Because different knowledge domains can have different terminology for the same events, a middle-ware layer is used that translates terminology using an ontology to publish data to the subscribers.

In the search for automated recognizing human emotions sensors of multiple modalities are also used. Busso et al. (2004) for instance use speech sound and images of facial expression and use feature level fusion to recognize the emotions, while Paleari and Lisetti (2006) propose an approach that fuses on all different levels, data, feature and decision that can incorporate new input modules based on Scherer’s theories on emotion on the fly.

Fusing multiple sensors of the same kind is widely discussed in literature but it seems there is no consensus on the best approach to fuse sensor signals that measure different physical properties, as already suggested by Hall and Llinas (1997). Combining the results of different classifiers in some way is a common choice that is reasonably easy to implement.

Over all the above subsections give an overview of relevant research. No research papers were found that describe similar devices or methods, and this results in the broad range of research fields described. There seems to be a lot of research into human movement, the psychology/psycho-physiologic research and auditive perception, but the combination of these is rare.

The following sections describe the details of my project. It combines several ideas mentioned above to create a system to measure changes in head-movement and correlate these to sound events.

(17)

3 Pilot Experiment

This small chapter will discuss the design and execution of a small pilot experiment, designed to see if there was any truth to the hypothesis, in that you can measure head-movement and might be able to relate that to sounds. First the design of the experiment is explained, after which the results are discussed.

3.1 Concept

The hypothesis is that head-movements can be related to sounds from the environment. To test the concept and the feasibility of measuring head-movement we’ve designed a pilot experiment.

Because of wide availability of a ready to use accelerometer in the form of a WiiMote, this was chosen to measure head-movement in stead of for instance a gyroscope. Communication with a PC can be done with just a bluetooth-dongle and some software, which made communication with this sensor easy.

There were basically two questions that this experiment needed to answer

1. Is an accelerometer sensitive enough to pick up subtle head-movements that might result from sounds?

2. Could there be any relation of head-movements to sound?

3.2 Method

The following describes the method used for the experiment.

Environment sound was recorded with a standard pc microphone and sound-card. Sound data is recorded monophonic at 32bit, 44kHz.

The WiiMote was connected to the PC via a bluetooth link and using the software library CWiid (http://abstrakraft.org/cwiid/). The WiiMote sends accelerometer information at a variable rate. Each packet received has a time-stamp. When a lot of acceleration change occurs the rate at which the acceleration information is sent is high, and with few changes the rate is low. To account for this difference a fairly high recording sample-rate of 4kHz is chosen, and data is linearly interpolated with respect to the time-stamp. That data is saved as a 32bit 4kHz audio file with 3 channels.

The WiiMote was strapped to a cap for a subject to wear, and the microphone was standing in front of the subject. The location was a working room with a desk and computer. Behind the desk was a window overviewing a small square, but the monitor obscuring the view of any person walking by. The subject was placed on a chair in front of a computer screen and instructed to do everyday computing tasks like checking email and browsing the web. Data recording took place in the period over an hour with some background sound.

At different moments an event took place in the square: a person walked by, a trolley was pushed along and a car passed, all familiar to the participant.

3.3 Results

The data was analyzed by visual inspection of the raw data signals. Since all signals were recorded as audio signals, an audio editor could be used to align the different channels and compare them.

It was found that head-movement up to a certain acceleration was not detected, except for head-movement that resulted in a shift of the Z (gravity) component. Because of the type of

(18)

accelerometer (MEMS) the gravity is always illustrated as about -1 in the signals. This can be divided in factors of the X, Y and Z components, depending on the position of the sensor.

It turned out that not enough evidence to suggest a relation of head-movements to some sound events, except for the participant sneezing. This was to some extent due to the lack of sensitivity of the microphone, which resulted in certain audible events not being heard on the recording, but which did got the attention of the subject. With the sneezing the head is shifted and a sound is produced, which both got registered by the pilot system. That is head-movement which can be related to sound, albeit self-induced sound.

The aforementioned trolley cart got the attention of the participant, but no head-movement, probably because of the background knowledge of the surroundings.

(a) Cyclic movement interrupted with looking to-

wards the left and right suddenly (b) looking from left to right with increasing speed

Figure 2: Early measurements from the pilot experiment

Thus, the answer of the sensitivity question is that accelerometers are not suitable for the most subtle of movements but the head-movement evoking sounds will result in an orienting response with some more explicit head-movement. Second, the question of relating some head-movement to sound can be confirmed at least in the case of sounds origination from the subject. Environmental sounds like activity of humans do elicit some orienting response but less measurable when the source location is already in sight.

(19)

4 Realization and building of a Proof of Concept

4.1 Goal

As mentioned in the introduction the idea is that changes in head-movement can be related to sound events. These changes might provide information on the processes of attention. The pilot experiment already showed that head-movements could be measured with accelerometers and that a subject sneezing for instance could be detected. My goal is developing a method of measuring head movements resulting from sounds from the environment, to measure the accuracy of this claim. Therefor a demonstration device needed to be built, a Proof of Concept (PoC).

To fulfill this task the PoC needed to fulfill some requirements. It should be able to measure and record head-movements and sound and correlate this signals, either real-time or offline to determine sections of sound that might have induced head-movement. The device should be usable in lab settings but also in real life, to gather real measurements. Thus it should be robust, be comfortable to the user and have a long battery life. The comfort means that freedom of movement is not obstructed and that the user does not notice wearing the device after a few minutes. The device of the pilot experiment for instance was to heavy to be unnoticeable.

Robustness means that the device should withstand the stresses of everyday use and transport.

Everyday use encompasses not only lab use but also in a general setting by a novice user. With all these requirements the device should be usable in a lab but also in a user’s home situation or for instance during a walk in the park.

With these properties in mind I set out to design and build the PoC. In the following subsections I will first present the global design of the system. Next the hardware part is presented and the firmware of that hardware. Finally the data processing software is described together with the methods used. To conclude everything some results are discussed, but for an elaborate evaluation see section 5.

4.2 Devices

The design requirements mentioned above suggest the use of microphones and, from the pilot experiment, accelerometers for movement. Gyroscopes were considered but not used because more research reviewed uses accelerometers. At first I have looked for ready-made solutions for recording (head-)movement and audio, for an overview see table 1. The combination of (head-) movement and audio recording is not made, but there are systems, like the XBus Kit from XSens, that are geared towards motion recording, usually for gait analysis. These systems however are very expensive, cumbersome to set up or seem uncomfortable to wear, thus other solutions are sought.

Another system that might be relevant is the ”Hoorbril” (hearing glasses) developed at Delft University. These days it is commercially available as a product, the Hoorbril has microphone arrays on a glasses frame to aid people with a hearing impairment. It is however in its commercial form hard to add extra electronics because of the design.

Looking further for existing devices I came across glasses with embedded camera’s and microphones, so called ’spy-glasses’. These incorporate a video camera, microphones, batteries and a digital recorder all in the frame of the glasses. Sold as gadgets these seemed to be a good platform to add extra sensors to because they already recorded audio.

Other options for a frame located at a human head might be: A hat/cap, a headband, glasses or headphones. These are generally available but custom electronics needed to be designed in every case to be able to fit in or on such frames. Looking at the size of the frame of the spy-glasses (see figure 6) this seems perfectly do-able but out of the scope of this project.

(20)

The sensors as mentioned are to be a microphone and an accelerometer. A single microphone could be used, but I opted for a binaural setup to give some possible extra spatial information.

One can create his own set of binaural microphones by getting two identical microphones, or buy an existing set. Available sets are the Core Sound Binaural Microphones, the Sound Professionals Condenser Binaural Microphones or the Roland CS-10EM. From the Auditory Cognition Group of the University of Groningen, a binaural set of mics from Core Sound was available, and thus used. The binaural mics are a matched pair of omni-directional condenser mics with a frequency response that is flat within 1dB for frequencies of 20Hz to 20kHz.

Because I opted not to use an existing motion recording set, another form of motion capture was needed and as the pilot experiment confirmed, an accelerometer was sensitive enough to capture the required motion. Numerous accelerometers are available, with ranges up to 16g and sensitivity of less than 0.02g (with a 2g range). The company Sparkfun supplies electronics to hobbyists and professionals alike, and offers several accelerometers on breakout boards. These breakout boards give easy soldering access to the chip. Examples of such chips are Freescale MMA7260QT and the Analog Device ADXL345. The latter was chosen because of the small size of the board (14x21mm), digital output and high resolution (13 bit, 4 mg/LSB).

Data from the sensors has to be recorded for processing. There are numerous options for data recorders, like using a ready made PDA, something more geared towards development like a Beagle board or maybe an Arduino. But the Sun Smart Programmable Object Technology (SunSPOT) was chosen as a data recorder because of the available know-how and devices at INCAS³. It can be programmed with Java and has numerous libraries for hardware devices.

However, it soon became clear that audio recording abilities of the SunSPOT were not sufficient because of a low throughput speed and slow acquisition speeds of the ADCs. An available breakout board (eDAQ) provides fast ADCs but still has the problem of low throughput speed, thus I opted to only store accelerometer data on an SD card with a SunSPOT.

Audio data might be recorded by the spy-glasses, or by an external system. After ordering a spy-glasses set I found out the audio quality of the recordings was too poor to be of any use, that is, with a samplerate of 8kHz and a bit-depth of 8 bits the signal-to-noise ratio was too low. Thus an external system had to be used. There are numerous mobile recorders on the market: small devices specifically designed for recording audio, including microphone pre- amplifiers and optionally with phantom power or even with microphones embedded. Examples of such devices are the Korg MR-1, the Alesis ProTrack, the M-Audio Microtrack II or the Tascam DR-100. The MicroTrack II was available from the Auditory Cognition Group at the University of Groningen, thus I opted for that device. The recorder is capable of recording in 24bit at a maximum samplerate of 96 kHz.

Summarizing (see also table 1), we will use the spy-glasses frame as a basis, a set of binaural microphones, an accelerometer, a MicroTrack II audio recorder and a SunSPOT to record acceleration data. Software will analyze the gathered data off-line to match the head-movements to sound events.

The hardware is further described in the following section 4.2.1 and the analysis of the data gathered is done offline in software, for which MatLab seemed an obvious choice. The analysis process is discussed in section 4.3. Data from the different recorders is manually synchronized by comparing signals of knocking against the casing of accelerometer (which produces distinct acceleration and sound signals).

4.2.1 Hardware

Creating the hardware is basically electrical engineering. However to gather actual data this is an essential part of the project. In the following paragraphs I will describe a rough overview of

(21)

Product XBus Kit from XSens System for full body motion capture Product Hoorbril from TU Delft Designed for sound localization Product Spy-glasses (frame)* A gadget with some decent electron-

ics recording audio and video Accelerometer Analog Device ADXL345 Small size, high accuracy Accelerometer Freescale MMA7260QT Larger, high accuracy Binaural mic Core Sound Binaural Microphones* Omni-directional, condenser Binaural mic Sound Professionals Binaural Mics Omni-directional, condenser

Binaural mic Roland CS-10EM Condenser

Data recorder Arduino Easy prototyping, available libraries

Data recorder Beagle Board Laptop performance development

board

Data recorder PDA General purpose computer

Data recorder SunSPOT* Easy prototyping, available libraries

Audio recorder Korg MR-1 192 kHz @ 24-bit

Audio recorder Alesis ProTrack 44 kHz @ 16-bit Audio recorder M-Audio Microtrack II* 96 kHz @ 24-bit

Audio recorder Tascam DR-100 96 kHz @ 24-bit

Table 1: Overview of available and used hardware. *) hardware used in this project

the hardware design and problems I ran into. An overview of the settings used for the devices can be seen in table 2.

The SunSPOT is an open-source development platform from Sun Microsystems (now Ora- cle). It has a 180 MHz ARM processor with 16kB of data cache and 16kB instruction cache running a Java Virtual Machine port called ’Squawk’. Next to the on-chip cache, there is also 4mB of Flash memory and 512kB of RAM. Peripherals include USB, Wireless networking, Se- rial Peripheral Interface (SPI) and Inter-Integrated Circuit (I2C) controllers, and it allows for user-developed extension boards to provide extra functionality. With this platform and it’s ab- straction from hardware in java it is easy to access and control sensors attached to it. Included with the standard package is the eDemoBoard which included an accelerometer, some LEDs and pushbuttons.

There are extension boards that provide high-quality ADC which have converters that are able of sample rates up to 22kHz. However, the SunSPOT firmware can only handle communications with external boards at frequencies up to 8kHz, due to hardware and software limitations.

Thus the choice was made to record only the accelerometer data using a SunSPOT. The data that can be stored on the SunSPOT is limited to a few kilobytes, thus external storage is needed.

This external storage can be easily accomplished by using a microSD card reader/writer extension board from DOSonChip. It provides an easy interface to a FAT32 file-system on an SD card. An external board was made with a connector for the accelerometer and the microSD interface chip. Both devices can be interfaced using the SPI protocol. Since the SunSPOT could not provide the required samplerate/precision for audio, the above described M-Audio Microtrack was used.

To improve robustness (and also looks) I designed custom casings with Blender for the SunSPOT with extension board and the accelerometer connection to the glasses. At first the accelerometer was attached to the glasses with duct-tape, and the SunSPOT’s circuit board were in open air. Since the SunSPOT is an open-source project, the design files for the original casing were available. Taking these files as a basis the top part was enlarged to fit the extension board with the accelerometer connector and the DOSonChip micro-SD interface. The bottom

(22)

(a) SunSPOT enclosure (b) Spyglass replacement lid

Figure 3: 3D designs of SunSPOT and accelerometer enclosures.

part was modified to provide a mechanism to hold the straps that attach the SunSPOT to ones body. The result can be seen in 3(a). Designing the accelerometer holder, seen in figure 3(b), to attach to the glasses was more of a challenge since no design files were available. The best option to create a holder without having to re-design the whole casing was to replace one of the lids for the electronics. With the lack of a 3D scanner a 2D scanner was used to give an outline, and the curvatures were estimated by hand and measurements. As can be seen in figure 6 the result is pleasing to look at as well as robust.

One of the major issues I ran into was the cable length from the accelerometer to the recorder.

Due to this length the signals from one wire of the cable were influencing the signals on other wires of the cable as result of magnetic interference (crosstalk). Especially when dealing with digital signals, having the input from one data-line influence the output of the other data-line is not good. To solve this, there are two options: either short the cable length or make sure that the wires that are influenced are not data-lines. Since the first was not an option, the latter was the chosen solution. Since there are 3 lines with high frequency digital signal in SPI (Clock, DataIn and DataOut) and 3 lines with a static or lower frequency signal (Ground, V+ and ChipSelect), the easiest was to interleave the signal lines with the static lines. This interleaving solved the crosstalk problem and allowed for the current cable length of about half a meter.

The recording settings of the devices can be seen in table 2. They are common in similar research as seen in section 2.2.1.

Audio recording 48kHz 24 bit

Accelerometer recording 50 Hz 13 bit Accelerometer sensitivity 2g 13 bit SunSPOT SPI communications 1mHz

Table 2: Overview of used settings

4.2.2 Firmware

Developing the firmware for the SunSPOT involved implementing the protocols with which the accelerometer and SD interface talk over SPI.

The base of the accelerometer implementation came from an existing implementation of an other accelerometer from the SunSPOT package. However this was read analog while the ADXL345 accelerometer was sending digital output, thus all the analog to digital conversion code could be removed.

The DOSonChip device provides a way to read and store data on a microSD card formated with the FAT16 or FAT32 filesystem. It can handle a maximum of 4 open files, read directory content and properties and read/write files. The firmware needed to manage the file handles

(23)

and enforce the maximum of 4 open handles. For file access an interface was made similar to C stdio, with fopen, fclose, fread and fwrite functions.

The main program uses these classes to provide the needed functionality. To take measurements at 50 Hz, the program uses the timer class PeriodicTask from the java library and when the handling function is called that stores the data from the accelerometer together with a timestamp in a circular buffer. The main while loop in the other thread checks for new data in that circular buffer and writes it to a file on the SDcard. Care had to be taken that the functions were thread-safe because the SPI bus only allows one device at a time to do a command/read sequence. With two threads the one writing data to SDcard should not interrupt the reading of data from the accelerometer, because the latter has higher priority.

4.3 Methods

The most important part of the device is the data analysis. This part is done offline, to allow repeated runs of the algorithm on the same data set. The difficulty here was twofold: how to combine signals from different sensors which have very different characteristics, and how to classify changes in movement. First I will discuss what signal processing occurs, followed by a discussion on how these signals are then classified and finally how the signals are combined.

The raw accelerometer signals are, like audio, a 1D array of samples of the acceleration values along three axis, as seen in figure 5(a). For each axis we can measure things like average, standard deviation (or RMS), and using a transformation to the frequency domain we can see movement cycles, like the up-down movement during walking. The most common features mentioned in the papers from section 2 are:

Frequency Spectrum of the different signals. The detail relies upon the length of data analyzed, the window.

Average of the individual signals per axis

Standard Deviation of the average of the signals Energy the sum of the squared Frequency Spectrum Correlation between the different signals

And in almost all cases these features are determined for a certain part of the signal in time, the window. This is because most measurements of the signals cannot be expressed continues, for instance the (discrete) Fourier Transform. The length of the window (|window|) is kept variable to be able to experiment with the optimal setting. There is however a trade-off between precision in the frequency domain and precision in the time-domain. To improve upon the frequency domain precision without sacrificing the time-domain, windows are overlapped, from a few percent to as much as 90%, common however is about 50% overlap.

The features I have chosen are based on the features commonly used in the papers in section 2.2.1, as described above (Van Laerhoven (2001); Kim et al. (2007); Ravi et al. (2005); Ward et al. (2005); Kunze et al. (2005) etc). Some extra features are there because they are common in signal correlation

Thus, the resulting number of features is 33+3×|window|/2, where the number of frequency components depends on the length of the window (|window|). The window length is kept variable to find an optimum trade-off between the frequency precision and the time-precision.

The default window length is 1.28 seconds, resulting in an accelerometer data window length of 64 samples, but more on the effects of window length can be found in section 4.4. An example of extracted features can be seen in figure 5(b).

The audio signal is also a one-dimensional signal, but with a higher samplerate. For this project we were not (yet) interested in exact spatial location of the sound, but location (left,

Relating head-movement to sound events.