Writing with a Smartwatch: Character Classification based on Motion Sensors

(1)

Writing with a Smartwatch: Character Classification based on

Motion Sensors

SUBMITTED IN PARTIAL FULFILMENT FOR THE DEGREE OF MASTER OF SCIENCE

Bastiaan Waanders

10247742

MASTER INFORMATION STUDIES

HUMAN CENTRED MULTIMEDIA

FACULTY OF SCIENCE

UNIVERSITY OF AMSTERDAM

July 1, 2017

1st Supervisor

2nd Supervisor

T. Mensink

F. Nack

UvA

(2)

Writing with a Smartwatch: Character Classification based on

Motion Sensors

Bastiaan Waanders

University of Amsterdam bastiaanwaanders@gmail.com

ABSTRACT

This research tests the potential for motion based character clas-sification to support text entry on a smartwatch. We evaluate the performance of SVM-classifiers trained for the same characters but written in different sizes. We present our findings on the perfor-mance of different kernels and evaluate which selection of features is best suited to classify characters correctly. Finally, we present three heatmaps based on the changing probability distribution over time for each character when a word is written. While the accu-racy of the trained models is not yet high enough to implement in a real world application, this initial research on a new data set presents interesting findings about how people write in practise, and that standard SVM-classifier without any optimization of the hyperparameters, already performs better than just a random guess.

KEYWORDS

Motion sensors; Smartwatch interaction; Character classification

1 INTRODUCTION

During their daily lives people receive numerous messages per day via various communication channels, most of which are also present on their phones. This results in a constant buzz of messages even when we shut off communication channels on our computers. With the introduction of smartwatches we created an even more present device that is prominently visible in our eyesight and makes us increasingly aware of the constant stream of messages to respond to. However, during our daily routines responding to each message quickly is not always possible, as it likely interrupts current activi-ties. The smartwatch itself lacks support of a workable keyboard, due to which we are forced to open up a separate device to be able to respond to a message.

Current smartwatches function as an extension of a persons mobile phone. Paired with a phone using a Bluetooth connection, smartwatches can inform users about (incoming) information on their mobile phone, e.g. incoming emails or text messages, and also support interaction with active applications running on the phone, e.g. controlling volume while playing music. While these features serve as a robust extension of the interaction between applications on a phone and a smartwatch, the interaction with the watch itself leaves room for improvement.

The two most widely supported input modalities for smart-watches are touch and voice interaction. As most smartsmart-watches have a touchscreen, users can interact with applications using the touchscreen. However, due to the placement of the watch on a persons wrist, touch interaction always demands both hands to fulfil an action. Furthermore, touch interaction is affected by the fat finger problem [6]; due to the small screen size of a smartwatch, the screen is easily occluded by a persons finger, potentially resulting in

Figure 1: Occlusion of the screen and noise during voice in-put are two common problems for smartwatch interaction.

inaccurate touch interaction. The voice capabilities of smartwatches differ per Operating System (OS). The Android Wear1OS offers two types of voice actions: System-provided and App-provided. The first are built into the Android Wear OS, e.g. ’Set an alarm’, ’Take a note’, and the App-provided voice input is related to voice inputs specified for the application, e.g. ’Next song’. Voice input relies on complex speech-to-text recognition systems and even though these systems exist [2] and have high accuracy, they are not always available, as in certain cases an internet connection is needed for the speech-to-text translation. Another drawback is that voice com-mands are not appropriate in every situation and voice interaction has been shown to be less accurate in noisy environments.

As current input modalities show some difficulties during practi-cal use, see Figure 1, this research studies the potential for using the motion sensors integrated in a smartwatch as support for text entry on smartwatches. Potential advantages of this system would

(3)

be that it tackles the fat finger problem, only one hand would be required for interaction, the performance of the application is not affected by noise from its surroundings, and at last it could be a more appropriate method of text entering rather than of speaking loudly to a smartwatch. To test the feasibility of such an application the following research questions are addressed in this paper:

• Does the size of a character effect the accuracy regarding correct classification of characters?

• Based on other research, which selection of features works well to correctly classify the characters in the collected dataset? • How well will the the trained models work when used in practise

to classify the individual characters, when a word is written? The remaining sections of this paper are structured as follows: Section 2 discusses similar research done in the area of character recognition in general, and specifically related research about mo-tion data. Secmo-tion 3 describes the how and why the data is collected. Section 4 discusses why a machine learning approach is selected for the experiments and how the experiments are set-up. Section 5 presents and discusses the results of the experiments. Finally Sections 6 and 7 present our conclusions, discussion & future work.

2 RELATED WORK

2.1 Optical Character Recognition

Text and character recognition has been studied extensively [5], with a main focus on character recognition from images, books, and a variety of documents. Most of these applications use Optical Character Recognition (OCR) techniques. OCR techniques convert written or typed text into a digital version of the text, so that it can be easily searched, edited, and shared with others. OCR techniques are well-suited for the classification of image data. However, when trying to translate the recorded values for the x- and y-axis of the signal from the sensors of a smartwatch, the resulting image did not show an unambiguous image of the recorded character. Therefor, we consider OCR techniques to be less suitable to use for this research.

2.2 Research using custom hardware

A number of research papers have been published that use motion data to classify characters or gestures. Amma et al. [1] presented a custom made glove with an accelerometer and gyroscope placed on top of the backside of the hand. Based on their collected data they tried to predict words written by a person. They used a two step approach, first a SVM-classifier that classified time periods in the data stream during which a person was indeed writing text. In the second stage Hidden Markov Models (HMM) are used to generate a text representation from the motion data. The individual characters are modelled by HMM’s and concatenated to word models.

Xu et al. [13] used a shimmer device2for collection of motion data. They used three shimmer devices which were placed on the arm, wrist, and finger of a participant in their study. They success-fully used the recorded data of the shimmer devices to classify 37 gestures with an accuracy of 98% and the 26 characters of the al-phabet with 95% accuracy. Using several features they tested three different classification methods: Naive Bayes, Logistic Regression,

2_{http://www.shimmersensing.com/}

and Decision Tree. During the experiment for finger writing par-ticipants had to rest their hand on an arm chair and had to write characters in the air with their finger.

Vikram et al. [12] developed an application for handwriting and gesture recognition based on motion of hands in the air. They made use of a Leap Motion controller3to collect data and, using the Dynamic Time Warping (DTW) algorithm [8], classified single characters as well as full words.

As custom hardware is not available commercially and hence not suitable for the development of applications for daily use, we looked into the current state of research in which commercially available smartwatches are used for similar applications.

2.3 Research using commercial smartwatches

Arduser et al. [3] used a smartwatch to collect motion data when participants wrote characters on a white board. Their classification system implemented three stages:

(1) Feature vectors were created of the sensor recordings. They created three types of feature vectors:

(a) One based on the gyroscope and linear acceleration data relative to the whiteboard.

(b) One with the data of the the linear acceleration and gyro-scope as sensor coordinates.

(c) And one with the linear acceleration and gyroscope data provided by Android within the sensor coordinates. (2) The feature vectors are compared using the DTW algorithm. (3) A ranking was created based on the similarity of the input data and the list of all the training samples of the characters. In addition to this they also recorded the audio when people wrote the words on the board; this was done to classify more ac-curately when a person was writing on the board. This data also improved the recognition when someone was writing; it could also improve segmentation to set the window size for the DTW algorithm.

Two other recent studies, although not specifically focused on character recognition, are the ViBand project done by Laput et al. [7] and the Float project done by Sun et al. [10]. The first study of Laput et al. made a custom smartwatch kernel to boost the sampling rate of the smartwatch’s available accelerometer to 4000Hz whereas the standard frequency of smartwatches is between 100 and 200Hz. This makes the system capable of capturing so called bio-acoustic signals. They tested their system for two different categories: ges-ture recognition for a total of 17 gesges-tures, and object recognition for a total of 29 objects. They used a SMO implementation [9] of a SVM-classifier for the classification of the gestures and objects. For each data stream the following features were computed: mean, sd, sum, max(), min(), centroid, and number of peaks, the power spectrum, 1st derivative, and Band ratios.

The second study conducted by Sun et al. supports users to do one-handed and touch-free target selection based on the motion of the accelerometer and the orientation of the gyroscope. To detect the tap and current tilt level they combined the motion sensor signals and the photoplethysmogram signal of the heart rate sensor and used DTW and k-Nearest Neighbor for classification.

3_{https://www.leapmotion.com/} 2

(4)

Figure 2: Image of the different sized boxes for each size. The top 3 boxes are used for writing the characters, the lower bar is used for the words.

3 DATA COLLECTION

To our knowledge a dataset containing motion data when people write characters on a flat surface while seated, and wearing a smart-watch is absent. For this reason a prototype application is built and an experiment is conducted to collect the needed data for the clas-sification of the characters. A total of 40 individuals participated in the experiment resulting in total of 1800 samples for the characters. The dataset also includes 120 written words. The only restriction to participate in the experiment is that a participant writes with their right hand in their daily life.

3.1 Prototype Application

To collect the motion data a prototype application is programmed for the Android Wear platform. The smartwatch, is a Motorola 360 sport4,5containing a 6-axis Internal Measurement Unit(IMU) made by Invensense6. During the experiment people write 5 different characters, respectfully an ’A’, ’O’, ’B’, ’T’, and ’R’. This is repeated three times for three different sized boxes. The boxes are 5 by 5, 7.5 by 7.5, and 10 by 10, all sizes are in centimeters. This results in 45 samples per participant.

To maximise the number of samples per character we only se-lected 5 characters, the specific characters are sese-lected arbitrarily, however we selected 2 vowels and 3 consonants to be able to make 3 words. As knowledge about a size that is minimally required to correctly classify a character is absent, we selected three different sizes. We selected the different sizes based on the observation of almost no movement with the wrist (5 centimeters) and generous movement of the wrist when writing a character (10 centimeters), 7.5 centimeters was also included as a middle size between almost

4_{https://www.motorola.com/us/products/moto-360-sport} 5_{Android OS version 6.0.1}

6_{https://www.invensense.com/products/motion-tracking/6-axis/mpu-6050/}

no movement and generous movement. Data for the following three words is collected: ’BAT’, ’BOAT’, and ’BAR’. Each word is written once, words are written in a box of a height of 7.5 centimeters and stretched over the full length of an A4 sized paper. During the ex-periment participants sat down behind a desk or table and the paper with the boxes is placed in front of them on the table. Participants are asked to ’fingerpaint’ the character with their index finger inside each different sized box. Participants wore a smartwatch on their right wrist. The prototype application has the following structure: (1) The application starts with a welcome screen, presenting an introduction text and a button to start the experiment, see Figure 3.1. Before starting the experiment we instructed par-ticipants about the possibility that the collected data might be published on the internet, anonymously. Furthermore the participant is explained the nature of the study and how they are expected to proceed.

(2) Before a character or word is presented to a participant, we displayed a countdown of three seconds. This was included to give the participants some extra time to prepare for writing the next character or word.

(3) Figure 3.2 shows the character to a participant. The words are presented to the participants in a similar font-size. (4) After finishing writing a character or word Figure 3.3 is

presented to the participant. Here participants have the op-tion to start with the next character, or redo the previous character. The last option is included in case participants were unable to finish their character or there occurred a sudden distraction preventing them of writing the character sufficiently.

(5) After completion of the experiment, a text message was displayed to the participant thanking them for participating in the study, displayed in Figures 3.4.

During the experiment we collected the data of the following sensors in the smartwatch: Accelerometer, Linear Accelerometer, and the Gyroscope. Per character, participants had 5 second to write the character, and for a word the window was set for 15 seconds. Participants were requested that after completing a character or word, to hold their hand and arm as still as possible in the end position. We made this request to make it easier to filter out non-relevant data of the signal during the remaining recording time after participants already completed writing the character or word. The watch has no way to directly connect to a computer therefore the data that is collected using the prototype application is transferred over Bluetooth to a phone that is connected to a computer.

3.2 Dataset

The data for each character is saved in a separate JSON7file. The JSON object has the following attributes:

• A unique ID (SessionID)

• A timestamp, (SessionTimestamp)

• The character, as in the ground-truth character (Character) • The respected size (Size)

• The Sensortype (SensorName). This is an object with an ordered list where each object has the following structure:

7_{http://www.json.org/} 3

(5)

Figure 3: Screenshots of the different screens in the proto-type application. Images are from the application when run-ning in an emulator.

– A timestamp, in millisecond after first recording (Times-tamp)

– X-value (X) – Y-value (Y) – Z-value (Z)

4 METHODS

The following section discusses the choice of classification method, which features are selected, and which experiments are performed to answer the earlier stated research questions in section 1, and two more experiments of general purposes.

4.1 Model selection

There are several methods for character classification, most notably a machine learning approach or using Dynamic Time Warping. As the DTW algorithm is very computational expensive[1], and with the idea that this application can be developed for smartwatches, built with less computing power, a machine learning approach deemed to be a more suitable approach.

Based on the literature studies the best performing machine learning algorithm is the SMO model used by Laput et al. [7]. The SMO algorithm[9] solves the quadratic programming problem that occurs when during the training phase for SVM’s.

4.2 Feature extraction

The collected data of a character can be described as a time series of the acceleration movement defined as a= (a1, ..., an), with vector

an = [axn, ayn, azn] for eachanin a. Vectoran gives the values of each axis at timen. The size of a is not defined since the total number of samples is unknown a priori. To classify each character a feature vector will be constructed for each axis in a. Features are a mathematical representation about some meaningful attribute that characterises the character in an unambiguous way.

Based on several studies [1, 7, 13] a number of features are se-lected to be computed for each a. In Table 1 all the features are listed, including their justification. The features listed in Table 1 are computed for the x- and the y-axis, resulting in a feature vector of length 20.

The features are computed from two different domains, the first five features listed in Table 1 are computed over the raw data, the remaining 5 are passed through a Fast Fourier transformation. Not the full window of the raw data is used: the data is trimmed until the moment the relative movement of both the x- and y-axis is less 0.05g, opposed to 0.1 seconds earlier. The trim function is started after the first two seconds of movement, to reduce the chances of trimming the signal too early. The features are only computed for the x- and y-axis as we are looking for the linear movement, so including the signal of the z-axis deemed to be not relevant.

4.3 Experiments

The experiments are programmed in python, and for the classifi-cation models the scikit-learn8and Weka9,10libraries are used. A total of 5 experiments are done to answer the earlier stated research questions. Based on some initial tests, using the collected data from the accelerometer resulted in the highest accuracy scores.

Experiment 1: To discover which of the features works best for classification of a character, a linear SVM model is trained and tested. A linear SVM model can be used to determine which are the most significant features for a dataset. A multi-class linear SVM-classifier works with a one versus the rest principle, so the hyperplane is placed between all the points of character class ’A’ and the rest of the character classes. This results in 5 coefficient vectors, containing the weights per vector per feature. The higher the value of a weight per feature from zero, the greater the importance for classification. The features for each character are ranked according to the absolute values of their weights. After this we create a list of features by selecting the topn number of features for 1 to 20, so forn = 1 we only select the highest ranked feature in each ranked list per character, forn = 2 we selected the first and second highest ranked features and so on. After selecting a subset of the features we test a 10 fold cross-validation and collected the accuracy score per model. The accuracy score of a model is used to evaluate the performance of the selected feature subset. In this experiment the following settings are used:

• Dataset: 100% of the dataset. • Box size: 10.

8_{http://scikit-learn.org/stable/} 9_{http://www.cs.waikato.ac.nz/ml/weka/}

10_{https://pypi.python.org/pypi/python-weka-wrapper} 4

(6)

Feature Notation Justification Raw time domain data

Mean µ Statistical aggregation of the signal Max maxa Min mina Standard Deviation σ =√µ2 Sum Í n=1an Frequency domain

Energy ⟨a(an), a(an)⟩ = ∫_−∞∞ |a(an)|2dn

Represent the strength of the signal

normalised by the size of the signal Centroid ÍN −1 n=0f (an)x(an) ÍN −1 n=0a(an)

Used for indication of the center of the mass of the signal Number

of peaks Number of peaks

The number of significant peaks of the signal

Max maxf (a)

Statistical aggregation of the Frequency

Min minf (a)

Table 1: A list of all the features computed for the x- and y-axis.

• Features: For the first test all features are used, in the fol-lowing experiments after ranking only the selected subset of features based on their rank are included.

• Kernel: Linear.

• Hyper parameter: C = 1 (default value)

Experiment 2: For the SVM-classifier we test four different kernels that are available in the Weka library: a linear kernel, a normalized polynomial kernel, a RBF Kernel, and The Pearson VII function-based (Puk) kernel[11]. The accuracy scores are based on the mean accuracy across a 10-fold cross-validation. For this experiment the following settings are used:

• Dataset: 100% of the dataset. • Box size: 10.

• Features: selection of features following out of experiment 1. • Kernel: Linear, normalized polynomial, RBF, Puk.

• Hyper parameter: C = 1 (default value),γ = 1 numberof f eatur es

Experiment 3: Data is collected for three different box sizes. To determine if there are differences between the different sizes each size is evaluated on its accuracy. Also a test is performed where all different sizes are included in one big dataset and tested for the overall accuracy.

Figure 4: Accuracy scores per model per total number of se-lected features of the ranked feature list per character.

• Dataset: 100% of the dataset. To evaluate against their own size we used 10 fold cross-validation, for the test against other sizes we randomly selected 20% per size as test set. • Box size: 5, 7.5, 10 centimeter, and all sizes together. • Features: selection of features following out of experiment 1. • Kernel: best preforming kernel based on experiment 2 • Hyper parameter: C = 1 (default value)

Experiment 4: Previous experiments all consider one general model based on 80% of the participants, however with this experi-ment we include2₃ of each participant’s data in the test set to be included in the training dataset. A justification for including this ex-periment is the notion that in a real application it is not unthinkable that the generalised model will be fitted to the specific motion of a person, based on explicit or implicit training. For this experiment 3 the best performing model of experiment is used.

• Dataset: 80% training data, 122₃% extra training data, 71₃% test data.

• Box size: 5, 7.5, 10 centimeter.

• Features: selection of features following out of experiment 1. • Kernel: best performing kernel based on experiment 2 • Hyper parameter: C = 1 (default value)

Experiment 5: To test the efficiency of the models for character classification in written words, a heatmap is created based on the probabilities for each character. A sliding window method is used, with a window of 1 second and steps of 0.25 seconds. The window size of 1 second is based on our observations when participants wrote the characters when we collected the data.

• Dataset: 100% training data for size 7.5. • Box size: 7.5.

• Features: selection of features following out of experiment 1. • Kernel: Linear

• Hyper parameter: C = 1 (default value)

5 RESULTS

Experiment 1: The results of experiment 1 are presented in Figure 4. From the results we can observe that when an of 6 is chosen the SVM-classifier has the highest accuracy. This implies that selecting the top 6 of ranked features per character results in the list of features that works best for this dataset. The absolute value scores

(7)

Selected Kernel Accuracy in %

Linear 37.9

NormalizedPolyKernel 38.6

Puk 35.0

RBF 27.2

Table 2: Classification accuracy for each kernel, based on 10-fold cross-validation

Tested of size 5 7.5 10 All

Trained of Size

5 34.4 29.7 33.7 32.6

7.5 28.3 33.3 31.7 31.1

10 32.8 34.7 38.6 35.4

All 35.0 33.9 38.1 35.6

Table 3: Accuracy scores in % of correctly classified charac-ters among different sizes, tested against a 20% test set per size.

per character are presented in Figures 9, 10, 11, 12, and 13 which are included in the appendix. The top 6 features per character results in the following list of features per axis:

• X-axis: Max, Min, Num of Peaks, and Standard Deviation • Y-axis: Centroid, Min, Max, and Number of Peaks

In the remaining experiments the above listed features are used. Experiment 2: The results of the second experiment are pre-sented in Table 2. From these results we can observe that the best performing SVM-classifier is in combination with a Normalized Polynomial kernel, scoring a 38.6% accuracy score.

For the remaining experiment a normalized polynomial kernel is used.

Experiment 3: The results of experiment 3 are presented in Table 3; all accuracy scores are in %. Comparing the results for each size we can conclude that the performance of a model is best when the trained model contains the data of the same size of the selected test size, the only exception on this is when we evaluate the model trained on all collected samples of each size, in this case the model performance less then a model that is only trained on the 10 centimeter samples. The model trained on the data of the samples of size 10 resulted in the highest accuracy, as the difference between the best performing model per size is within 5 percent points. However, based on these results it is hard to claim that a model would work well with interchangeably different sizes.

Experiment 4: In Table 4 we present the results of experiment 4. The results show some interesting findings. As we add more data to our training, specifically data samples of the participants that are also in the test set, we expected the models to have a better fit regarding the test set. In the case for the 5 and 7.5 centimeters models this is indeed the result. However, for 10 centimeter the classification accuracy actually declines opposed to the accuracy score in experiment 3. When a model is trained on more data we expect it to perform better, thus the decline in accuracy of correctly classified characters for the 10 centimeter model, was not expected.

Selected Size Accuracy score

5cm 40.0%

7.5cm 32.5%

10cm 38.1%

Table 4: Accuracy scores per size, when a part of test data is included as extra training data.

Character A B O R T A 4 7 6 0 7 B 6 9 7 0 2 O 1 2 21 1 0 R 2 4 11 0 7 T 0 11 5 1 5

Table 5: Confusion matrix for each character for the SVM-classifier used in experiment 5.

As this experiment was not cross-validated we are not sure if this is an anomaly or a regular effect for the 10 centimeter model.

Experiment 5: We generated a total of 3 heatmaps for the three different words, see Figures 5,6,7, best viewed in colour. These are randomly selected from the dataset. The heatmaps show the probability assigned to each character by the SVM-classifier. Due to the absence of exact annotation of when a participant started writing a character, we are unable to exactly compare the heatmap of a word with its ground truth.

However, one would expect the probability assigned to each character to change over time depending on the word. For exam-ple, if we consider the word ’BOAT’, this word should result in a heatmap where during the first 1/1.5 second the probability for the character ’B’ should be higher then for the others, during the following 1/1.5 seconds after this we expect the probability of for the ’O’ to be highest, and so on for the remaining characters. In Figures 5,6,7, we can observe that for the word ’BAR’ there seems a fitting distribution as described earlier, in the first 1.5 seconds the highest probability is assigned to the character ’B’, in the following half a second the highest probability is assigned to the character ’A’, with the remaining part assigned to the character ’O’. Since the SVM-classifier only scored a 32.5% accuracy score, a correct estimation of the probability distribution for characters is not to be expected. To give a some extra insight into the results of the heatmaps, we computed a confusion matrix for the used classifier. The confusion matrix is presented in Table 5, and although the table does not give any direct evidence in support of the results of the distribution of the probability scores over time, it does show that the character ’R’ are is never classified correctly, and that the ’A’, ’B’, ’R’, and ’T’ are all misclassified as an ’O’, respectively 6, 7, 11, and 5 times for 24 test samples. These two observations can explain the low probability scores for the character ’R’ in the heatmap for the word ’BAR’, and the prominent presence for the probability scores for the character ’O’ in all three words.

(8)

Figure 5: Heatmap for the word BAR.

Figure 6: Heatmap for the word BOAT.

Figure 7: Heatmap for the word BAT.

6 CONCLUSION

This research presents the results of 5 experiments that are per-formed to determine several aspects for the general goal of character

classification based on motion sensor data, recorded when partici-pants write characters while wearing a smartwatch. Based on the results of the experiments we can give the following answers to our research questions:

• Does the size of a character effect the accuracy regarding correct classification of characters?

Although the size of a written character does effect the per-formance of classification, the difference in perper-formance is just 5.3 percentage points between the best and worst per-forming model after a 10-fold cross-validation. Interestingly, when we tested against testsets of other sizes, the accuracy was in certain cases similar, or dropped just 15%. So yes, the bigger the size of the written characters the better the performance, however as the difference is small between the different sizes we can not conclude that there is a minimally needed size of the characters to be written to be classified correctly.

• Based on other research, which selection of features works well to correctly classify the characters in the collected dataset. The following selection of features resulted in the best per-formance:

– X-axis: Max, Min, Num of Peaks, and Standard Deviation – Y-axis: Centroid, Min, Max, and Number of Peaks However, as this was tested only on the 10 centimeter part of the dataset, this might not yield in the same selection of features for the other sizes in the complete dataset. • How well will the the trained models work when used in practise

to classify the individual characters, when a word is written. Due to the absence of a ground truth when people are writing which character we could not directly compare the generated heatmaps with anything. However, based on observations of the heatmaps we do see a change in the probability distri-bution for the characters. Therefor we expect that if used in practice a sliding window approach would work well. Our current approach did not result in a model that performs well enough to be implemented in a real world application, However, as our research goals were not to create a model that would result in the highest accuracy per se, further research can study and test this. Areas for improvement could be to evaluate a broader range of features, test different algorithms for classification (see Section 7.2). Our research focussed on supporting text entry on a smartwatch based on the motion measured by the accelerometer inside the smartwatch. Other areas of interest for motion based interaction as an input modality could be gesture based interaction for smart devices in a home or work environment, fitness and health tracking, and we see potential to add motion based interaction measured with smartwatches or similar devices in the areas of Augmented, Mixed, and Virtual reality.

7 DISCUSSION & FUTURE WORK

In this study participants were requested to write characters similar to the way those were displayed on the screen. But even during this study we observed a distinct difference in how people write, this will be discussed more in depth in section 7.1. The difference in writing style is unique to each person, but makes the task of correctly classifying characters harder to do. In practise, options as explicit,

(9)

Figure 8: Example of the character ’A’ as a sequence of points.

based on collected data during a training session, and implicit, based on collection of data during normal use, learning methods could be used to learn movements unique to a single person. As the results of experiment 4 show, including the samples of participants from the test set does improve classification accuracy.

During the experiment a total of 1800 samples were collected. However, in general the accuracy of the trained models can im-prove if more data would be collected and used to train the models. This could be done by repeating the experiment, or building an application which could be published in the Google Play store11 and train the models based on crowdsourced data collection.

7.1 Observations during the experiment

The underlying idea behind this research is that whenever a person would ’write’ a character a smartwatch placed on a person wrist would register distinct motion signals that are specific to a character, so each character can be individually classified. However during the collection of the data some observations were made that likely effected the potential of highly accurate classification of characters. First, during the experiment participants were not specifically instructed how to hold their arm, wrist, and hand, but in a implicit in natural position. During the experiment we observed that this is not exactly the same position for each person, some hold their arm in parallel with the paper on which the blocks are drawn, while others hold their arm more in 20◦to 40◦ angle. This has likely resulted in difference among the x- and y-axis since for certain cases movement was more in an angle.

Second, we can consider the motion for writing a character to be a sequence of points to completed in a certain order, see Figure 8. During the experiment we noticed that the order in which the points are completed, ergo the way a character is written, was different among the participants, where some participants completed the sequence as 1, 2, 3, 4, 5 others completed the sequence as 2, 1, 2, 3, 4, 5. We did not capture exactly how a participant wrote a character, thus we cannot support the following argument with examples from the collected dataset. However, we do think this effects the overall accuracy of our models since this results in completely different signals for the same character. Furthermore, regarding the

11_{https://play.google.com/store/apps}

way people wrote characters, some of the participants wrote the characters differently when writing them as separate characters, opposed too how they wrote a character when writing the words. Third, participants were given a countdown before the next character is displayed on the smartwatch screen. Unfortunately is it was not always the case that their finger was in the correct starting potion to start writing a character so their first initial hand movement was too move to the starting position, this movement is also registered and due to their infrequently occurring nature hard to filter out of the dataset.

An approach for these problems would be to make a standardised vocabulary for the alphabet based on a single stroke or uni-stroke[4] movement for each character, with a clear start and end point. This way the sequence of how a character is written is the same for each person, which could improve the correct classification of characters.

7.2 Different Machine Learning models

We are unaware of research done of similar datasets, this makes it hard to compare our results to others for an indication of how well our models actually perform. In general we see room for im-provement of our methods, besides collecting more data, we could implement feature standardisation, and do hyper parameter tuning, via a grid- or random search. As for now we can conclude that our method results in a model that performs better opposed to the case where one would randomly guess the correct class.

As a SVM-classifier is inherently a two-class classifier this might not be the most optimal approach for this multi-class data set. An-other option to potentially improve the accuracy for classification is to test other classification methods, some other feature based clas-sification methods such as, k-Nearest Neighbour, Bayesian based approaches, a Random Forest classifier, and Neural Network based algorithms, including deep learning models. Testing a truly multi-class focussed approach could be a good next step. Furthermore, exploring other non-feature based methods, such as convolution networks where the raw signal would be the direct training set instead of computed features.

7.3 Limitations for practical use

One of the most present limitations for text entry based on motion data is the absence of a model to support classification of all 26 characters in the alphabet, and for support of whole responses including ’.’, ’,’, ’ ’, and an option for an enter key is also needed. This will result in a model with at least 30 different classes. Although it is hard to speculate about the performance of such a model, based on current results a SVM model might not be ideal and result in low accuracy for this task. However, as earlier discussed, different approaches might result in better models with higher accuracy. Another interesting thing to discuss is that by social convention, people are used to wear a (smart)watch on their left wrist, opposed to during the experiment where participant wore the smartwatch on their right wrist. Although no participants complained about it during the experiment, they might prefer to wear their watch on the wrist that is not their writing hand, which would make it more difficult to make practical use of this interaction modality.

Another limitation during interaction is that a person has to stand still when writing characters, as the classification is based on

(10)

the motion sensors which measure the acceleration, trying to write the character while walking, moving in a vehicle or on a bicycle will result in noisy and likely unusable data for classification.

Another notion to consider for practical use is the length of words and messages. The proposed method for character input might be faster to respond when the written message is short, however there is a threshold moment for the time is takes to write a number of characters, opposed to the time it takes to grab your phone, open the application and type a response to a message. But to optimize use, word suggestions, similar as in most mobile phone keyboards, could be implemented for this kind of application as well, potentially reducing the total time needed to write a word.

ACKNOWLEDGEMENTS

I would like to thank my supervisor Thomas Mensink for his advice, guidance, and support during the project.

DATASET

The collected data is available on the following link https://github. com/bwaanders/Character.

REFERENCES

[1] Christoph Amma, Marcus Georgi, and Tanja Schultz. 2014. Airwriting: a wearable handwriting recognition system. Personal and ubiquitous computing 18, 1 (2014), 191–203.

[2] MA Anusuya and Shriniwas K Katti. 2010. Speech recognition by machine, a review. arXiv preprint arXiv:1001.2267 (2010).

[3] Luca Ardüser, Pascal Bissig, Philipp Brandes, and Roger Wattenhofer. 2016. Rec-ognizing text using motion data from a smartwatch. In Pervasive Computing and Communication Workshops (PerCom Workshops), 2016 IEEE International Conference on. IEEE, 1–6.

[4] David Goldberg and Cate Richardson. 1993. Touch-typing with a stylus. In Pro-ceedings of the INTERACT’93 and CHI’93 conference on Human factors in computing systems. ACM, 80–87.

[5] VK Govindan and AP Shivaprasad. 1990. Character recognition, A review. Pattern recognition 23, 7 (1990), 671–683.

[6] Paul Huber. 2015. Inaccurate input on touch devices relating to the fingertip. (2015).

[7] Gierad Laput, Robert Xiao, and Chris Harrison. 2016. ViBand: High-Fidelity Bio-Acoustic Sensing Using Commodity Smartwatch Accelerometers. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 321–333.

[8] Meinard Müller. 2007. Dynamic Time Warping. In Information retrieval for music and motion. Vol. 2. Springer, Chapter 4.

[9] John Platt. 1998. Sequential minimal optimization: A fast algorithm for training support vector machines. (1998).

[10] Ke Sun, Yuntao Wang, Chun Yu, Yukang Yan, Hongyi Wen, and Yuanchun Shi. 2017. Float: One-Handed and Touch-Free Target Selection on Smartwatches. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 692–704.

[11] Bülent Üstün, Willem J Melssen, and Lutgarde MC Buydens. 2006. Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems 81, 1 (2006), 29–40.

[12] Sharad Vikram, Lei Li, and Stuart Russell. 2013. Handwriting and Gestures in the Air, Recognizing on the Fly. In Proceedings of the CHI, Vol. 13. 1179–1184. [13] Chao Xu, Parth H Pathak, and Prasant Mohapatra. 2015. Finger-writing with

smartwatch: A case for finger and hand gesture recognition using smartwatch. In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications. ACM, 9–14.

(11)

A

FIGURES

Figure 9: Absolute scores per feature for character A

Figure 10: Absolute scores per feature for character B

(12)

Figure 11: Absolute scores per feature for character T

Figure 12: Absolute scores per feature for character O

(13)

Figure 13: Absolute scores per feature for character R