Posture Similarity Estimation using a Convolutional Neural Network

(1)

using a Convolutional Neural

Network

Meike H. Kombrink 11306998 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science

Science Park 904 1098 XH Amsterdam

Supervisor

dhr. prof. dr. ing. Zeno J.M.H. Geradts

Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 28th_{, 2019}

(2)

Abstract 1 1 Introduction 1 1.1 Image processing . . . 3 1.1.1 Gait analysis . . . 3 1.1.1.1 Silhouette-based approach . . . 4 1.1.1.2 Model-based approach . . . 5 1.1.2 Face recognition . . . 6 1.1.3 Clothing analysis . . . 8 1.1.4 Posture analysis . . . 9 1.2 Research questions . . . 9

1.2.1 How can a prediction for posture similarity be analysed? . 10 1.2.2 What does a usable data set for the problem at hand look like? . . . 10

1.2.3 How well can humans predict posture similarities? . . . . 10

1.2.4 How well can a neural network predict posture similarities? 10 2 Method 11 2.1 Prediction analysis . . . 11

2.1.1 Correctness . . . 11

2.1.2 Similar/dissimilar ratio . . . 11

2.1.3 Deviation from the norm . . . 11

2.2 Usable data set . . . 12

2.3 Human predictions . . . 13

2.4 Neural network Predictions . . . 14

2.4.1 Input . . . 15

2.4.2 Architecture . . . 15

2.4.2.1 Convolution layers . . . 15

2.4.2.2 Fully connected layers . . . 17

2.4.3 Output . . . 17

3 Results 18 3.1 Prediction analysis . . . 18

3.4 Neural network predictions . . . 22

4 Conclusion and Discussion 24 4.1 Prediction analysis . . . 24

4.4 Neural network predictions . . . 25

4.5 Accurate predictions . . . 25

(3)

Abstract

Within digital forensics posture has been used a lot for gait analysis, face recognition and clothing analysis, for these analysis are partly determined by posture. However, this usage never lead to an explicit posture analysis. This research aims to implement a posture analysis that only focuses on posture. The implementation uses a convolutional neural network, with 3 convolution layers and 12 fully connected layers. The proposed implementation has achieved a correct posture analysis in 86.9% of the presented cases. These results prove that posture can be used to identify and re-identify individuals. Nevertheless, more research, specifically on posture, is required.

1 |

Introduction

Forensic science has been around as a science since the 1880’s [1], though not in the form it is known in nowadays. However, digital forensics has not been around nearly as long as forensics as a whole. Within forensic science, digital forensics is a relatively new area of interest. For this reason there have been many different definitions of the field over the years one of the first definitions of digital forensics was given by M. Pollitt in 1995:

“[Digital] forensics is the application of science and engineering to the legal problem of digital evidence. It is a synthesis of science and law. At one extreme is the pure science of ones and zeros. At this level, the laws of physics and mathematics rule. At the other extreme, is the courtroom.” [27]

Though this definition gives mention of the court of law, it does not state that all outcomes must be legally acceptable. McKemmish added this to the definition, stating that digital forensics is:

“[Digital forensics] is the process of identifying, preserving, analyzing and presenting digital evidence in a manner that is legally accept-able.” [21]

The above definition was not yet agreed upon by most of the workers in the field. Palmer developed a definition in 2001, at the Digital Forensic Research Workshop (DFRWS). Here all academic researchers in attendance accepted the definition Palmer developed [31]:

“The use of scientifically derived and proven methods for the preser-vation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from dig-ital sources for the purpose of facilitating or furthering the recon-struction of events found to be criminal, or helping to anticipate unauthorised actions shown to be disruptive to planned operations.” [25]

Yet even this accepted definition is issued to be missing a key feature of the field: its interdisciplinary nature. The lack of unanimity with regard to a definition

(4)

within the field has led to many more definitions and explorations of what the field entails. Yet all definitions that were created have roughly the same facts in them: all of them state that the ultimate goal of the field of digital forensics is to provide correct and legitimate digital evidence in a court of law. This means that the field is more than just examining computer equipment or analysing data. This leads to the conclusion that digital forensics is not a discipline that is focused solely on the technical issues, as some new researchers with a background in computer science have been believing. Digital forensics is a field that combines and embraces both legal issues and computer techniques. Another agreement within the field is it is a field that is very interdisciplinary. Resulting in many different researchers of different fields contributing to the progress made.[31]

Digital forensics has not emerged from academic research, as the traditional forensics (for example toxicology and ballistics) have. Digital forensics emerged from a practical need to solve real life problems [4]. This origination of the field has led to few scientifically well-developed and sound methods. Though this does not mean the field has not developed methods, it means few of them have been scientifically proven to be conclusive. This proven conclusiveness is important within the field since all results should be usable in the court of law. Without academic research supporting a method, it is less likely to hold its value in front of a judge. Traditionally this has lead to two different approaches for digital forensics problems: approaching the problem as a computer science problem, and approaching the problem as an investigative problem. In both cases, the aim is mostly identical: trying to locate discrete pieces of information that are probative. In the computer science approach characteristics of the data are used optimally to decide which objects, data and/or metadata to use and which not to use. The investigative approach reviews the content of the evidence available, in order to interpret the data in the light of known facts and elements of the crime. This is done to determine probative information or information of value [28]. The idea of combining both these approaches is not new, yet difficult to bring to practice.

Within the field of digital forensics, many of the researchers and practitioners have been active for several years. These people therefore have a relatively good overview of the field. On the other hand it is often difficult for researchers to get a holistic overview and understand which tasks are involved in digital forensics. This makes it less likely for them to join the field. Another unclear aspect for new researchers is the scope and depth of the discipline, as well as the risks and opportunities of the field [31]. Even from the struggle to gain an agreeable definition of the field, it is clear that the is challenging to capture. These are all aspect of the field that lead to less progress in the future, for it withholds new researchers from joining the discipline.

The rapid progress within digital forensics has lead to many positive develop-ments. Primarily it has lead to the immediate usage of the field when a persons digital traces can still be used optimally. In the past digital forensics was only used as one of the last resources. Another improvement made is that guidelines for several steps that are present in any digital forensics study have been for-malised, resulting in guidelines and rules for research to be usable within the field [4].

(5)

For this research, the area of digital forensics that focuses on image process-ing will be explored. This research area uses images (photos, security camera footage, videos etc.) to identify and re-identify suspects. Previous research within this area of digital forensics will be further discussed in the next sec-tions.

1.1 Image processing

Within image processing, there are a few branches of the field that focus on identification and re-identification of subjects. Several of these branches are relatively well developed. These branches are: gait analysis (see 1.1.1), face recognition (see 1.1.2) and clothing analysis (see 1.1.3). However, posture anal-ysis (see 1.1.4), which is the topic of this research, has had far fewer attention from the field. In the following sections all well-developed methods will be discussed, after which their link to posture analysis will be explored.

1.1.1 Gait analysis

Gait analysis is an identification method within digital forensics. It is one of the most researched areas, for gait is always present in CCTV of a crime scene. Gait is often confused with walking. In order to make the distinction between these clear, a definition of gait will be given next:

“Gait is the pattern of movement utilised during locomotion, key elements being its dynamic and repetitive nature.” [7]

When identifying the difference between gait and walking, it is important to notice the word ’locomotion’ that is used in the definition. Locomotion in this definition can be any method of moving, which can possibly include walking but could also include running, for example. Therefor, interchanging walking and gait is incorrect, for gait is a much broader term than solely walking [7]. Gait was first discovered to be an individual feature in a study meant for medical purposes. Murray et al. triee to identify a normal gait pattern, to then find a deviation for pathologically abnormal patients. The study used 60 patients with normal gait, and found that gait was different per individual [22].

However promising this sounds, there are still many difficulties with the auto-mated gait analysis. First and foremost, when using real-life footage, videos of gait will be cluttered, unstructured and in a dynamic environment. Especially when used within forensics to convict criminals, the footage will be of a crime scene. Because of this nature of the footage, research will most likely be done using CCTV present at the crime scene, which is mostly uncalibrated. Tracking movements and objects within uncalibrated CCTV is at this point in time very difficult, since this tracking depends on several factors related to the environ-ment - illumination changes, shadows and occlusions - while also depends on the nature of human beings (appearance variations and articulation) [9]. Re-gardless of the difficulties, several methods have been developed to conduct this comparison automatically.

(6)

The first research that aimed for an automated gait analysis within forensic science was conducted in 2011 [9]. This research concluded that they had devel-oped a method that could use CCTV to distinguish between identical gait and different gait. The method was already background invariant, which cancels out many of the above mentioned difficulties.

Since this very first paper, many have followed, of which most follow one of two approaches: the model-based approach, or the silhouette based approach. Both approaches will be further discussed separately.[24]

1.1.1.1 Silhouette-based approach

The silhouette approach can be explained as the approach that is based on the analysis of silhouettes. The general idea of the silhouette approach is that a walking subject is abstracted from the background, after which a set of measure-ments that describe the motion and shape of the silhouette are derived. These measurements are used to describe the gait of the subject. The silhouette-based approach leans more towards the classic computer vision views and methods [24].

Several methods have been developed to conduct a silhouette-based gait anal-ysis which were described by Nixon et al. [24]. A couple of these will be further discussed to achieve an overview of the possibilities for silhouette-based approaches:

1. Hidden Markov Models: this method considers two different image fea-tures: the width of the outer contour of a silhouette and the entire sil-houette. These features combined describe the gait of the subject.[17, 32]

2. Self similarity: this method uses principal component analysis (PCA) ap-plied to self-similarity plots, to gather a self similarity metric. This metric is then analysed using k-nearest neighbours, which is for comparing dif-ferent or identical gaits.[6]

3. Silhouette similarity: this method first defines the bounding box for a silhouette, which is done semi-automatically. Next this is used to match a silhouette, after which a gait period is estimated. This gait period provides insight in difference or similarity or gait.[29]

4. Relational statistics: this method uses the relation between a random group of features as its method to differentiate between different gaits. [35]

5. key frame analysis: Using normalised correlation, the key frames are com-pared to training frames. The correlation scores are then used together with a nearest neighbour algorithm to classify subjects based on their difference or similarity in gait. [10]

6. Area masks: this method derives a measure for the change in area of a selected region of a silhouette. This change in the selected area can be used to analyse differences and similarities in gait. [11]

(7)

7. Point distribution models: this method uses the movement of points on the 2D shape for gait classification. This is one of the few methods that worked both footage of walking subjects and of running subjects, and could even distinguish between these two. [33]

1.1.1.2 Model-based approach

The model-based approach generally tries to find features and use their motion information for gait analysis. Within model-based approaches one main dis-tinction can be made: structural approaches and modelling approaches. The main difference is that the modelling model-based approaches uses the (relative) motion of the angles between the limbs, whilst the structural model-based ap-proach uses static parameters, like length of the legs. For a graphical overview of the difference between these two, see figure 1

Figure 1: A systematic viewing of the difference between the two approaches within the model-based approach. Left shows the parameters used for structural model-based approaches, whilst right shows the parameters used for modelling model-based approaches. [24]

One of the main advantages of model-based approaches to gait analysis is the immunity - or indifference - that comes with it. This immunity means the effects of changes of clothing and changes in viewpoint cause less problems for the model, for it uses motions which are not altered by either of these. However, this immunity does come with a greater computational complexity. [24] Several methods have been developed that can be considered model-based ap-proaches. These were described by Nixon et al. [24]. A few of them will be mentioned to achieve an overview of the possibilities:

1. Stride parameters: this method is a structural model-based approach for gait analysis. It uses stride and cadence, this combination is found to be usable for identification based on gait. [5]

2. Human parameters: this method is a structural model-based approach. It uses gait to derive parameters to describe the subjects body. Difference in this computed body would entail different people. [8]

(8)

3. Articulated model: this method is a modelling model-based approach. An estimate of human shape is derived by shifting and accumulating the edge images. This computed human shape can then be used to differentiate between individuals. [36]

4. Linked feature trajectories: this method is a combination of structural-and modelling model-based approach. It used both angles structural-and static mea-sures to analyse gait. [38]

Apart from these two areas within gait analysis, there is a lot of research at this moment that aims to create a view-invariant automatic gait analysis. This means that independent of the angle at which the subject is filmed, the gait analysis should be conducted. At this moment it is possible that gait cannot be used properly solely because of the angle difference between footage. Creating a view-invariant gait analysis would thus be very useful.

Several researchers have succeeded to make the gait analysis view-invariant, however, one of the most profound and applicable researches was that of Gof-fredo et al [12]. This particular study succeeded not only in the creation of a view-invariant gait analysis, but it also made the system self-calibrate. This is what makes this algorithm particularly useful within forensics.

1.1.2 Face recognition

Gait is, however, not the only personal trait that can be gathered from video footage. A persons face is also highly personal and is - of the biometric modal-ities - the one that is most natural to use for identification. In fact, humans identify people by their facial features many times a day. Another advantage of face recognition is that it is a biometric that can be captured from a distance, unlike for example fingerprints. [16]

Face recognition is potentially the most developed and researched area that will be discussed in this paper. In recent years face recognition has made rapid advancements due to the fast developments in image capture devices, the avail-ability of a huge amount of face images on the internet, as well as an increased demand for higher security [16]. However, in this area the commercial world has massively stimulated the advances made, since face recognition can also be used by brands like Apple, who have added it to their newest devices. In a recent study, NIST has evaluated 127 algorithms from 45 developers, of which only 1 was a university and all other 44 were commercial parties [13].

The very first research on automatic face recognition was conducted as early as 1973, by Takeo Kanade. The research he conducted was part of his Ph.D. thesis [18]. However, the next research on automatic face recognition was only in 1987-1990. This later work used low-dimenstional face representations in combination with PCA. [19, 30] After these early researches, the work that finally made the area move ahead was that of Turk and Pentland [34], who suggested the Eigenface method. After this major advancement, 3 other milestones have been achieved according to Jain and Li [16]:

1. The Fisherface method which applied Linear Discriminant Analysis (LDA) after a PCA step to achieve higher accuracy

(9)

2. The use of local filters which provided more effective facial features 3. the AdaBoost learning based classifier which was capable of real-time face

detection

To date all face recognition implementations have opted for one of two options: face verification/authentication or face identification/recognition. The first of these options is the one where a one-to-one match is made. These systems compare a query face image against an enrolment face image. The outcome would tell the user whether the faces are the same or not. These kinds of systems are nowadays used at airports for self check-ins using passports. The second option, face identification, is where a one-to-many match is made. This means that a query face image is compared against multiple faces in the enrolment database. The goal is to see if the query face is of one of the people in the database, thereby validating an identity. This is used, for example, by the police to see if a subject is in their database of people [16].

Another major distinction that can be made between types of face recognition, is that of the user cooperation. The cooperative way to use face recognition, requires the user to show his/her face in a proper way, making the face recog-nition easier to perform. The uncooperative way is the one where the subject is unaware of being identified, this happens mostly in surveillance videos. In some cases there is an in-between way to recognise, the one where the subject is willing to cooperate, but circumstances make the correct position impossible. [16]

The general face recognition system consists of 4 different stages, as explained by Jain and Li [16]. These are systematically showed in figure 2, and further explained afterwards.

Figure 2: An overview of the steps generally present in a face recognition appli-cation. [16]

1. Face localisation: in this step the face is localised and separated from the background. It usually uses face landmarking, which is the process of finding landmarks -eyes, nose, facial outlines, etc-.

2. Normalisation: this is the step that enables the system to be invariant to pose and illumination. Geometric normalisation uses face cropping to get the face into a standard frame. Photometric normalisation is used to make the face recognition handle different illuminations.

3. Feature extraction: in this step the features that could possibly help to distinguish different faces are extracted. This could range from eye shape

(10)

to face width and nose size to the length of the forehead.

4. Matching: in this step the face features previously extracted are compared to the enrolled photo or photos. This results in the final recognition if there is a match, or in no recognition if there is no match.

Face recognition has made massive progress since the time of the Eigenface method. When external situations can be controlled -lighting, pose, facial wear, etc.- automated face recognition outperforms humans. This is due to both the quality of the face recognition, but also due to the database size. This is because humans have only a limited number of faces they can remember and thus recognise, whilst computers can store many more faces [16].

In 2018 NIST performed a evaluation study of 127 algorithms, which showed that massive gains in accuracy were reached in the last few years [13]. They also found that when portrait quality was good, the best tested algorithm was able to find a matching entry - if it existed - with an error rate as low as 0.2%. The remaining errors were explainable by ageing or injury. This proves that face recognition has come a long way, though with further developments of deep machine learning, it could still be optimised.

1.1.3 Clothing analysis

Clothing analysis is one of the research areas that is least well developed. This is for a large part due to the restricted time frame in which clothing analysis is helpful. Clothes can be easily changed and discarded, making it very changable and unsound for identification. However, in some cases - especially when the suspect did not have time to change - clothing has been used for identification. Other times specifics on clothing were written down and later on searched for in the possessions of the suspect [2].

Another major reason automated clothing analysis has not been developed as much as other areas, is that it is a job humans are relatively good at. Most hu-man experts can accurately gather clothing details from CCTV footage. Com-puter vision could, however, be useful by enriching clothing identification with objective and discriminative information [15].

While clothing analysis may not be fully automated yet, clothing specifics are used often. Several cases have used clothing description to ask the public if they recognised the person depicted. One example of this is the case of the Boston Attacks of 2013, were the public were provided with detailed description of the clothing worn by the attackers [3]. This usage of clothing within investigations proves that automated clothing analysis could be useful in the field [15]. In 2016, Jaha and Nixon proposed a method to automatically annotated clothing within CCTV footage [15]. This method resulted in both labels and descriptions of the clothing worn, all generated automatically.

Another research conducted used deep learning to identify clothing on individ-uals. This paper had a succes rate as high as 70% on CCTV footage [3]. This model did particularly well when the task was limited to clothing that held logo’s, where the accuracy reached 74,9%.

(11)

It can be concluded that for the importance of manual clothing analysis in the field of forensics, it has had little attention within digital forensics. Though the usability of an automated clothing analysis would be very high.

1.1.4 Posture analysis

This paper will be concerned with the identification of posture. In this case posture will be defined as ’a person’s bodily specifics, e.g. shoulder width, length and hip width’. This does not include details like eye colour, skin tone or clothing.

In many ways posture analysis is used within the researches conducted on gait. This is because gait is affected by posture, as explained in 1.1.1. When posture changes, gait changes as well. Especially body size is known to have an effect on the movements of all joints used for gait [14]. However, gait analysis does not focus of the examination of the posture, though this posture is a major component of gait. No data is gathered to solely inform the user of the posture of the subject, which is the goal of this research.

Within clothing analysis it is also known that posture plays a major role. Since the way clothing looks on a body greatly, or even solely, depends on the posture of the wearer. However, methods for clothing analysis do not inform the user about the posture of the subject.

While face recognition does not depend on someones posture, certain facial features are related to it, such as the width and height of the face. Yet it seems futile to use face recognition separately to gather posture fact of only the face, when the entire bodys posture is desired.

To the best of this author’s knowledge there has been no previous research on automated posture analysis. This research will focus on this new area within digital forensics. It will aim to acquire a similarity score for postures depicted in two images. This score can be achieved in many ways, of which one is a neural network. The approach taken will include the usage of a convolutions neural network (for an explanation of this, see 2.4).

1.2 Research questions

Combining all the information given above, the following research question was decided upon.

“To what extend is a neural network accurately able to give similarity scores for postures depicted in two images?”

It is hypothesised that the posture similarity scores will deviate a lot from the actual similarity. This is expected because capturing postures within an image is difficult. It can be influenced by clothing, the way one stands and the angle used to depict someone. Nevertheless, it is thought that posture will be analysable if these conditions are similar enough. It is not likely that the results will be good enough to use in the court of law, for this research has a limited time frame and is, to the best of the author’s knowledge, the first of its kind. Even so

(12)

the research is believed to show that automatic posture analysis can be useful within the field of digital forensics.

This question will be answered using 4 sub questions:

1.2.1 How can a prediction for posture similarity be analysed?

It is hypothesised that a prediction can be analysed using the deviation from the actual similarity, which gives insight into the distance between the correct answer and the prediction. Another possibility that is expected to analyse the prediction is the similar-dissimilar ratio, which gives insight into the bias with which a prediction is made. The last analysis method that will be used is the correctness score, which gives insight into the accuracy of the predictions.

1.2.2 What does a usable data set for the problem at hand look like?

It is hypothesised that a usable data set is to be limited, for this research has not been done before and it is necessary to ensure that only posture is taken into account. Therefore, it is hypothesised to be necessary to only incorporate data that shows no background, colours and clothing. With these restrictions it is believed to result in a data set that is both limited in variance and in number.

1.2.3 How well can humans predict posture similarities?

It is hypothesised that due to the limited data set that is available, humans will have great difficulty identifying similarities in posture. The reason for this is that humans identify posture similarities on a daily basis, but in an enriched environment. Examples of such enrichment’s are shades and colours, which will not be used in this research. Therefore humans will perform worse than they do in real life, while still being able to spot larger differences among postures.

1.2.4 How well can a neural network predict posture similarities?

It is hypothesised that a neural network will outperform humans, for it is not set back by the limits of the data set. Apart from this it is also hypothesised that the neural network will not perform well enough to hold up in the court of law, but that it will be able to make a reasonable prediction nonetheless.

(13)

2 |

Method

In this section the method used for all sub questions will be discussed thoroughly.

2.1 Prediction analysis

“How can a prediction for posture similarity be analysed?”

The analysis is done using 3 independent methods. Each of them is based on the assumption that the correct similarity is known. This assumption is made with the knowledge that it is known whether an image pair consists of the same person twice or two different people. For computational purposes, the values 1 and 0 were used for similar and different respectively.

2.1.1 Correctness

This score will be given to any prediction made, stadning for the percentage of the prediction that is predicted correctly.

This means that every similarity score will be rounded to either 1 or 0. This is done by rounding up any similarity over 0.5, whilst rounding down any similarity under 0.5. This prediction will then be compared to the known correct answer. The amount of correctly predicted similarities will be calculated and divided by the total amount of predictions to get the final correctness percentage. If 1555 out of 2000 predictions would be correct, the correctness would be 1555/2000 which is 77.75%.

2.1.2 Similar/dissimilar ratio

Another analysis that will be conducted based on the predictions made is the similar-dissimilar ratio. This ratio will show whether the prediction was created with a bias toward one of the two answers. It is known that all test and training scenario’s will have an equal distribution between similar and dissimilar image pairs. A non-biased prediction would have a ratio close to 0.5, since 2 answers are possible. Therefor, the similar-dissimilar ratio will give insight into the bias that was used to score the similarities.

The ratio will be calculated by acquiring the percentage of predictions that were above 0.5. This ratio can then be displayed as a number representing the bias. It could, however, also be reported in a graph, when multiple ratios can be calculated.

2.1.3 Deviation from the norm

Another evaluation tool that will be used is the deviation from the norm. That is to say, how far off the algorithm is with its prediction. It is known that the

(14)

true similarity is either 0 or 1, so a prediction of 0.8 on a similar posture (with similarity 1), will have an 0.2 deviation. The measure that will be used in this paper is the mean squared error loss (MSE loss).

M SEloss = n X i=1

(ytrue− ypredicted)2

The choice for this function is mainly due to the square in the function. This square makes larger mistakes weight heavier than smaller mistakes. For the problem at hand, mostly due to the ultimate goal of reaching high correctness, smaller deviations are not a major problem. On the other hand are the problems that were off by a large margin, these are the problems for the classifier and should thus weigh heavier. For this reason MSE loss is an appropriate choice for the analysis of predictions for this problem.

2.2 Usable data set

“What does a usable data set for the problem at hand look like?” A usable data set for the problem has several requirements:

1. The posture needs to be clearly depicted, for otherwise learning based on posture is impossible.

2. Illumination should be identical in all pictures, to eliminate the possibility of learning based on illumination

3. No backgrounds should differ between images, to eliminate the possibility of learning based on the background

4. No clothing difference should be seen between images, to eliminate the possibility of learning based on clothing as well as not letting clothing alter the way the posture looks in the image.

The data set found to be most suitable based on these requirements, is the CASIA-B data set [37]. The database consists of pictures that contain only silhouettes, as can be seen in figure 3. In total there are 124 individuals in the database, which were all filmed walking in 11 different angles and had several carrier conditions, which were: nothing, a coat or a bag.

Figure 3: These pictures are examples of images in the data set.

All pictures in this data set follow the requirements previously given. Postures are clearly depicted, and no external factors play any role, for the images are

(15)

only black and white where the silhouette is the only white visible. Another benefit this data set has is that, apart from the optional bag and coat, no clothing changes were made in between.

Due to the restricted time frame of this research, it was decided that all data that would be used had to be filmed from the same angle. For this reason only the frontal angled images (labeled 000 in the data set) were used. 000 was chosen because it is the most generic way to stand in front of a camera. It was also decided to not use the conditions where a bag or a coat was added. Both limitations were set due to time constraints, as well as the unknown possibility of identification through posture. Keeping the data simplistic would surely suffice for gathering the possibility of automated posture analysis.

Not all pictures within the data set were complete. In some cases people had holes or indents in their bodies. In other images it seemed like parts of, or whole, limbs were missing. There were also cases where someone held their arms up, which could cause bias in the classification. The last reason for eliminating images from the data set was a bulge of whiteness that could impossibly belong to the body of the person. All images that were not suitable for this application were taken out. Examples of pictures taken out of the data set can be seen in figure 4.

Figure 4: These pictures are examples of images in the data set that were not suitable for the application. The reasons are (from left to right) half the person’s body is missing, the person has their arms up instead of next to the body and the person has a gap in their head as well as between the legs.

If one subject had too few images left of them after cleaning, it would lead to an unbalanced data set. Therefore, all subjects with less than 100 images after cleaning have to be taken out off the final data set.

2.3 Human predictions

“How well can humans predict posture similarities?”

For gathering a score on human posture analysis, a survey1_{was distributed. The} survey consists of 20 picture pairs, with each pair containing either two pictures of the same subject, or two pictures of different subjects. Each participant was asked to identify whether the pair shown were identical or different. For examples of image pairs, see figure 5.

(16)

Figure 5: Both these image pairs were part of the survey for human participants. The image pair on the left is one were the subject is the same person, on the right is one were the subject is not the same person.

Apart from this, a few personal questions were asked, which were: gender, age and highest education. These were asked because no accurate representation of humanity could be constructed if only a single subgroup was represented through the survey.

After collecting all responses the score for humans will be analysed both as a collective and per individual. As a collective, the analysis will be done as ex-plained in 2.1. Per individual the analysis will follow the three analyses methods explained as well. However, these analyses will then be plotted in graphs, to achieve an overview of differences that exist per individual.

2.4 Neural network Predictions

“How well can a neural network predict posture similarities?”

For achieving a score from a neural network, first a neural network was con-structed. This neural network was decided to be a convolutional neural network (CNN), for CNNs achieve some degree of shift and deformation invariance [20]. In figure 62 _{an overview of the architecture can be seen. It should be kept} in mind that the neural network of this paper consists of 3 convolution layers, whilst 2 are shown, and 12 hidden fully connected layers. Both the input re-quired and the architecture of the network will be discussed more thoroughly. The output will be touched upon shortly as well.

Figure 6: The architecture used for the neural network.

2

(17)

2.4.1 Input

The input required for the network is a 3D tensor. This tensor has dimensions Nx600x240, where N is the number of image pairs used. Every image was resized to 300x240 pixels. Combining two images into one array thus results in an array of size 600x240.

The resulting data set (see 3.2) was of size 11,166. For the input of the training experiments, 8000 image pair were used. For testing purposes another 2000 image pairs were selected. All selection was done completely random except for the similar person selection or different person selection.

Though it could be argued that images could be used more than once in the situation illustrated above, it should be noted that for this research duplicated single images are of no importance. Within digital forensics matching to a database is a type of matching that is performed often. It is therefore not a problem if an image were to be represented in the training data more than once. Of course it would be a problem if the same image pair was to be input during the training epochs and test situation, yet due to the random selection of the image pairs the changes of this are severely slim. Though the existence of the same image pair in both input data sets cannot be ruled out.

For training purposes the resulting final input is a 8000 x 600 x 240 tensor, a separate 8000 x 1 tensor was constructed which consisted of the y values to train to. During training batch size was set to 50 and 1000 epochs were necessary to gain the best score.

To test the resulting neural network another, smaller, tensor was created. This tensor consisted of 2000 image pairs (2000 x 600 x 240]), along with a tensor with the actual outcome (2000 x 1) to compare the predictions with. The images used were again chosen randomly from the data set.

2.4.2 Architecture

The neural network itself consists of an input layer where a picture was provided as a 2D array. Next it consisted of 3 convolution layers, after which 12 fully connected layers were set up. The output layer consists of a single neuron. Both the convolutional layers and the fully connected layers will be further explained next.

2.4.2.1 Convolution layers

For this network, three separate convolution layers are used, which are con-nected to each other using one fully concon-nected layer. Three convolution layers were chosen because this results in a receptive field that captures the features -shoulder width, hip width, length, etc.- well. More convolution layers would have included more than the features this research is investigating, due to a receptive field that is larger than a feature. Yet less convolution layers would lead to a small receptive field that does not fully capture the features. How a

(18)

convolution layer works can best be explained graphically, see figure 73

Figure 7: Showing of the convolution between I and K. Where K is multiplied over several subsections of I, resulting in the final matrix I ? K

Convolution as it is shown in figure 7 was used because it can average out a small part of an image, combining all averages into one smaller image. This results in an image where not ever pixel itself has to be used, but rather one where an importance value for a combination of of pixels is assigned. Resulting in better representation of the picture as a whole. The convolution layers were applied using the pytorch package for python [26], which includes functions for convolutions that are pre-programmed. This package was used to take care of the basic functionality, however, the kernel size had to be determined by the user. It was found that a kernel size of 3 is optimal for this convolutional neural network (CNN) the kernel size was identical in all 3 convolution layers. After each convolution that was applied, max pooling was used to downsample the outcome.

Figure 8: This graphic example of max pooling shows the result of a 2x2 max pool over a matrix.

Max pooling is a simplifying algorithm that is used on images, see 84_{. Max} pooling takes every submatrix of a size that is predetermined, to then take

3_{https://github.com/PetarV-/TikZ/tree/master/2D%20Convolution} 4_{https://computersciencewiki.org/index.php/Max-pooling_/_Pooling}

(19)

only the maximum value found in this submatrix. For the CNN constructed for this research, the max pooling size was 2, the size was identical along all three convolution layers. These submatrices do not overlap, as can also be see in figure 8. In this figure it is also shown that the max pooling step in neural networks creates a smaller version of the image, in this case twice as small, the maximum value is then the value that is used as part of the max-pooled matrix. Max pooling is used as it prevents neural networks from over-fitting, because of the abstracted form it results in. Another positive of using max pooling is that it results in less computational power needed for the remaining calculations [23]. The convolution layers of the CNN all have the same order: they start by performing a convolution over every column of the matrix presented, which is then fully connected to the next convolution layer. Finally the max pooling step is performed, which leads to the second convolution layer.

2.4.2.2 Fully connected layers

Figure 9: A graphic representa-tion of fully connected neurons. The constructed neural network does not

only consist of the 3 convolutional layers pre-viously discussed (see 2.4.2.1), but also of 12 fully-connected layers. These last 12 layers follow immediately after the last convolution layer.

A fully connected layer in a neural network means that every neuron present in a layer is connected to every single neuron in the pre-vious layer and in the next layer, as can also been seen in figure 9. This is the most com-plicated computation one can do amongst layers.

The neural network consists of 12 layers that each hold 1000 neurons, except for the last layer which consists of only 1 neuron, since the output should be a single number.

2.4.3 Output

The output layer of the neural network

con-sists of 1 single neuron which is fully connected to all 1000 neurons in the previous layer. The output consists of only 1 neuron because the output of the neural network consists of a single number, which represents the similarity of the input. This output is a number anywhere between 0 and 1. The number output can be regarded as a similarity percentage. This way an output of 0.53 can be interpreted as 53% similarity in posture between the two images that were the input.

(20)

3 |

Results

In this section the results of all sub questions will be discussed thoroughly. Each sub question will be discussed individually.

3.1 Prediction analysis

The analysis of the given posture similarities can be done using 3 different analysis methods.

1. Correctness score 2. MSE loss

3. similar-dissimilar ratio

Correctness score gives an overview of how well the CNN can predict postures overall, whereas the MSE loss gives an indication of how far off the prediction is from the known similarity. The similar-dissimilar ratio gives insight into the bias with which the prediction was made.

3.2 Usable data set

“What does a usable data set for the problem at hand look like?”

Due to time constraints it was not possible to clean the entire database, in its restricted form explained in section 2.2. Therefore, the database was used up to and including person 062. Since not all images within this limited data set were conform the restrictions specified, not all images from the 62 people were usable.

Some of the subjects had to be completely removed from the data set. Reasons for this are that all images were unfit for the CNN thus no images were left after cleaning, and that after eliminating all unusable images less than 100 images were left of this subject. The subjects that are completely eliminated are: 002, 003, 005, 012, 016, 020, 022, 024, 028, 033, 036, 048, 051, 052, 054, 055, 058 and 061.

Of the subjects left in the data set, some images had to be eliminated as well. An overview of the amount of images used per subject can be seen in figure 10. The final data set5consists of 11,166 images, coming from 44 different subjects. As can be seen in figure 10, the difference in number of pictures per subject is immense. Some subjects barely touch the 100 mark, whilst others reach almost 500 images. Though this can cause bias in the neural network, a decision was made to work with this data set. This choice was made because eliminating

5

(21)

more subjects would create a unvaried data set, whereas including more subjects would lead to serious under representation.

Figure 10: In this figure the number of images left after cleaning are depicted. For every subject this number in represented as a bar.

3.3 Human predictions

The survey that was distributed (for explanation on this, see 2.3) was answered by a total of 120 different people. For details on the demographic information of all of these participants, figure 11 can be consulted, this figure shows the division in age groups and in education levels.

Figure 11: The rightmost diagram shows the division of ages among the par-ticipants, the leftmost diagram shows the division in education levels of the participants, the education level that was asked for was the highest reached education level, independent of whether this education was finished or not.

(22)

Figure 11 shows that there were by far most people between their 21 and 30th birthday. All other age groups scored similarly, except the >70 group, to which one participant belongs. The education levels in the diagram show that more than 50% of the participants are either doing HBO or WO bachelor. Of the remaining groups most participants followed a WO master. The female-male ratio of this survey was 67.5%-32.5% respectively. The results of the 20 posture questions will be discussed next.

On the 20 posture questions, all participants scored an average correctness of 57.9%, over all 20 questions. Figure 12 shows an overview of the correctness scores for different questions.

Figure 12: This graph shows the percentage of people answering correctly on the y-axis, whilst the question number is displayed on the x-axis.

Figure 12 shows a correctness score for a question as a bar, the running average over all questions is computed and shown as a coral coloured line. The correct-ness can also be analysed per participant. This correctcorrect-ness was plotted again, and can be seen in figure 13.

Figure 13 shows the amount of people reaching a certain correctness. The maximum correctness reachable would be 20, since participants would then have answered all questions correctly. It can be noted that not a single participant scored above 16 or under 7. Alongside the noticeable fact that there is a clear difference amongst participants in correctness score. The average correctness over all questions is depicted in the coral coloured line.

Another analysis that can be conducted is one that gathers information on the similar-dissimilar ratio (for an explanation on this see 2.1). In figure 14 the similar-dissimilar ratio can be seen per person.

(23)

Figure 13: This graph shows the amount of times each person answered cor-rectly. In total 20 questions were presented to the participants

Figure 14: This graph shows the percentage of people that answered ’similar’, on the y-axis, to each of the questions, displayed on the x-axis

Figure 14 shows the amount of times participants answered similar. Where the x-axis shows the number of ’similar’s filled in, whilst the y-axis shows how many

(24)

participants had that score. It can be gathered that some of the participants have bias towards scoring postures as similar, whereas others have a bias towards scoring them dissimilar. Some participants answered all 20 questions identical, whilst others interchanged their answers a lot. The average similar-dissimilar ratio over all participants was 0.47, which is depicted as a coral coloured line in the figure.

3.4 Neural network predictions

The constructed CNN (for details see 2.4), has been presented with the same image pairs that were used for the online survey, as well as 2000 randomly selected image pairs. The results for both these sets will be presented next. On the image pairs that were presented in the survey, the CNN’s correctness score is 18 out of 20, which is 90%. The ratio was: 0.5191. The MSE loss on these 20 image pairs was 0.0879. For an explanation of all these analyses see 2.1.

The data set with 2000 randomly selected image pairs resulted in a correct clas-sification of 1738 of the 2000, which gives a 86.9% correctness. The distribution of these 2000 predictions can be seen in the figure 15. This model had an overall MSE loss of 0.1112 on all 2000 image pairs, whilst scoring a similar-dissimilar ratio of 0.5137.

Figure 15: This graph shows the prediction of the CNN per image pair. A green dot is one that is correctly estimated, whilst a red dot represents an incorrect estimation.

(25)

In figure 15 it can be seen that the CNN is way more likely to predict a value close to 0, or close to 1, than it is to predict a value in between. Another observation that can be made is that the mistakes made by the CNN are quite well-distributed among the predictions. The false accept rate of the constructed network is 0.0505 whilst the false reject rate is 0.0825. Which also displays that there is not a noticeable difference between the false acceptance and the false reject.

(26)

4 |

Conclusion and Discussion

In this section the Conclusion and Discussion of all sub questions will be dis-cussed thoroughly. At the end of the section the overall conclusion will be given as well as the discussion on the main question.

4.1 Prediction analysis

The analysis of predicted posture similarity can be done through 3 different measures. This research used: the MSE loss, the correctness score and the similar-dissimilar ratio as measures to analysis the prediction. These methods were capable of giving an indication of how well the model constructed can predict posture similarity. During training it was also able to show whether the network was getting better or becoming worse.

The three measures above were analysis methods fit for the scope of this re-search. Nevertheless, they are all relatively high-level methods of analysis. This research has not gone beyond these high-level methods, although the more low-level ones could give more insight into the model. It could potentially gather insight into when an image pair is incorrectly classified, and inform the user what common feature there is between misclassified image pairs. Therefore fu-ture research should try to focus on using low-level analysis methods for the analysis of the predictions.

4.2 Usable data set

“What does a usable data set for the problem at hand look like?”

The data set that was used for implementing the model was a subset of the CASIA-B data set [37]. This subset was sufficiently large and varied for the scope of this research.

Future research is recommended to include more subjects in its data set, in order to get more variance in the data set. Future research could also expand this research by investigating ways to make the presented model work in real-life settings. This would include clothing details and personal attributes in the images used. This extra information could potentially enrich the model and increase its accuracy, yet it could also make the model learn based on information that is not related to posture. This possibility is one that should be further investigated.

Future research should also work on the subjectiveness of the data used for this research. This subjectiveness is due to the manual cleaning and interpretation of the data set. Future research should aim to decrease this subjectiveness to the best of their ability, so that the CNN will learn on less subjective data.

(27)

4.3 Human predictions

All participants together have shown to be incapable of scoring posture similar-ities. However, fluctuations in correctness per person meant that some of the participants were, by themselves, quite good at it. It can also be concluded that, though as a whole humans do not have a bias that is worth being concerned about - the bias was 0.47 amongst all participants - deviations in bias amongst humans is massive.

Future research could investigate whether the best performing people in a survey are capable of learning to predict posture similarity accurately enough to use in the court of law. Hints that this is possible are given by the rising cumulative average of people. However, it should be noted that this rise in average can also be due to the difficulty of the image pair presented.

4.4 Neural network predictions

This research concludes that neural networks can score posture similarities rel-atively well. With a correctness of 86.9% the scores on posture similarity are relatively reliable, though it can not hold it’s value in the court of law yet, as is desirable within digital forensics.

Future research could focus on different architectures of the CNN to score pos-ture similarities. It can be argued that a more complicated model is needed when less limitations are set for the data set. Lifting several limitations of the data set (as mentioned 4.2) would require for the CNN to be implemented in a completely different way, for many parameters would change with this. The possibilities seem to be massive, so it should be attempted to implement mod-els that are capable of handling more real-life situations, by lifting some of the restrictions.

4.5 Accurate predictions

“To what extend is a neural network accurately able to give similarity scores for postures depicted in two images?”

A CNN has proven to be able of scoring posture similarities accurately. With a correctness of 86.9% it is clear that the potential of using neural networks for scoring posture similarity is decent. It was also proven that the CNN constructed in this research was able to score way better correctness than untrained human participants could, with an equal error rate of 0.133, this proves that automatic posture similarity prediction would enhance the current possibility with regard to posture similarity.

It should however be noted that further research should be conducted to research the usability of this identification and re-identification method in practice. This

(28)

research should then focus on a method that could be applied throughout the field and is sound to be used in the court of law.

The research presented in this paper has shown that there is a possibility for posture to be used in digital forensics, but it was conducted on a small scale with very abstracted data. It should therefore be researched whether posture similarity could also be scored with high correctness when less, or ideally no abstraction is applied.

(29)

[1] M. Bartanen and R. Littlefield. Forensics in America: A history. Rowman & Littlefield, 2013.

[2] B. Batagelj and F. Solina. “Biometry from surveillance cameras-forensics in practice”. In: Proceedings of the 20th Computer Vision Winter Work-shop (Feb. 2015).

[3] M. Bedeli, Z. Geradts, and E. van Eijk. “Clothing identification via deep learning: forensic applications”. In: Forensic Sciences Research 3 (Oct. 2018), pp. 1–11. doi: 10.1080/20961790.2018.1526251.

[4] N. Beebe. “Digital Forensic Research: The Good, the Bad and the Unad-dressed”. In: Advances in Digital Forensics V. Ed. by G. Peterson and S. Shenoi. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 17–36. isbn: 978-3-642-04155-6. doi: 10 . 1007 / 978 - 3 - 642 - 04155 - 6 _ 2. url: https://doi.org/10.1007/978-3-642-04155-6_2.

[5] C. BenAbdelkader, R. Cutler, and L. Davis. “Stride and cadence as a bio-metric in automatic person identification and verification”. In: Proceed-ings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. May 2002, pp. 372–377. doi: 10.1109/AFGR.2002.1004182. [6] C. BenAbdelkader et al. “EigenGait: Motion-Based Recognition of People Using Image Self-Similarity”. In: Audio- and Video-Based Biometric Per-son Authentication. Ed. by J. Bigun and F. Smeraldi. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001, pp. 284–294. isbn: 978-3-540-45344-4. doi: 10.1007/3-540-45344-X_42.

[7] I. Birch et al. “Terminology and forensic gait analysis”. In: Science & Justice 55.4 (July 2015), pp. 279–284. issn: 1355-0306. doi: https : / / doi . org / 10 . 1016 / j . scijus . 2015 . 03 . 002. url: http : / / www . sciencedirect.com/science/article/pii/S1355030615000349. [8] A. F. Bobick and A.Y. Johnson. “Gait recognition using static,

activity-specific parameters”. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001. Vol. 1. Dec. 2001, pp. I–I. doi: 10.1109/CVPR.2001.990506.

[9] I. Bouchrika et al. “On Using Gait in Forensic Biometrics”. In: Journal of Forensic Sciences 56.4 (July 2011), pp. 882–889. issn: 1556-4029. doi: 10 . 1111 / j . 1556 - 4029 . 2011 . 01793 . x. url: https://doi.org/10. 1111/j.1556-4029.2011.01793.x.

[10] R.T. Collins, R. Gross, and J. Shi. “Silhouette-based human identification from body shape and gait”. In: Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition. May 2002, pp. 366– 371. doi: 10.1109/AFGR.2002.1004181.

[11] J. P. Foster, M. S. Nixon, and A. Prügel-Bennett. “Automatic gait recog-nition using area-based metrics”. In: Pattern Recogrecog-nition Letters 24.14 (2003), pp. 2489–2497. issn: 0167-8655. doi: https : / / doi . org / 10 . 1016/S0167-8655(03)00094-1. url: http://www.sciencedirect.com/ science/article/pii/S0167865503000941.

[12] M. Goffredo et al. “Self-Calibrating View-Invariant Gait Biometrics”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part B

(30)

(Cyber-[13] P. J. Grother, M. L. Ngan, and K. K. Hanaoka. Ongoing Face Recognition Vendor Test (FRVT) Part 2: Identification. Tech. rep. Nov. 2018. doi: https://doi.org/10.6028/NIST.IR.8238.

[14] M. Hora et al. “Body size and lower limb posture during walking in hu-mans”. In: PloS one 12.2 (2017), e0172112. doi: 1371/journal.pone. 0172112.

[15] E.S. Jaha and M.S. Nixon. “From Clothing to Identity: Manual and Au-tomatic Soft Biometrics”. In: IEEE Transactions on Information Foren-sics and Security 11.10 (Oct. 2016), pp. 2377–2390. issn: 1556-6013. doi: 10.1109/TIFS.2016.2584001.

[16] _{A. K. Jain and S. Z. Li. Handbook of face recognition. Springer, 2011. isbn:} 978-0-85729-931-4. doi: 10.1007/978-0-85729-932-1.

[17] A. Kale et al. “Identification of humans using gait”. In: IEEE Transactions on Image Processing 13.9 (Sept. 2004), pp. 1163–1173. issn: 1057-7149. doi: 10.1109/TIP.2004.832865.

[18] T. Kanade. Picture Processing System by Computer Complex and Recog-nition of Human Faces. Nov. 1973.

[19] R. Kumar, A. Banerjee, and B. C. Vemuri. “Volterrafaces: Discriminant analysis using Volterra kernels”. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. June 2009, pp. 150–155. doi: 10.1109/ CVPR.2009.5206837.

[20] S. Lawrence et al. “Face recognition: a convolutional neural-network ap-proach”. In: IEEE Transactions on Neural Networks 8.1 (Jan. 1997), pp. 98– 113. issn: 1045-9227. doi: 10.1109/72.554195.

[21] R. McKemmish. What is forensic computing? Australian Institute of Crim-inology Canberra, June 1999. isbn: 0642241023. url: http://www.aic. gov.au/publications/tandi/tandi118.html.

[22] M. P. Murray, A. B. Drought, and R. C. Kory. “Walking patterns of normal men”. In: JBJS 46.2 (Mar. 1964), pp. 335–360.

[23] J. Nagi et al. “Max-pooling convolutional neural networks for vision-based hand gesture recognition”. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA). Nov. 2011, pp. 342– 347. doi: 10.1109/ICSIPA.2011.6144164.

[24] M. S. Nixon, T. Tan, and R. Chellappa. Human Identification Based on Gait. eng. International Series on Biometrics ; 4. Boston, MA: Springer US, 2006. isbn: 1-282-82339-6.

[25] G. Palmer. “A road map for digital forensic research”. In: First Digital Forensic Research Workshop (Jan. 2001), pp. 27–30.

[26] Adam Paszke et al. “Automatic differentiation in PyTorch”. In: NIPS 2017 Workshop Autodiff (Oct. 2017).

[27] M Pollitt. “Computer forensics: An approach to evidence in cyberspace”. In: Proceedings of the Eighteenth National Information Systems Security Conference. 1995, pp. 487–491.

[28] M. Pollitt. “Digital Forensics as a Surreal Narrative”. In: Advances in Digital Forensics V. Ed. by G. Peterson and S. Shenoi. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 3–15. isbn: 978-3-642-04155-6. doi: 10.1007/978-3-642-04155-6_1. url: https://doi.org/10.1007/978-3-642-04155-6_1.

(31)

Machine Intelligence 27.2 (Feb. 2005), pp. 162–177. issn: 0162-8828. doi: 10.1109/TPAMI.2005.39.

[30] L. Sirovich and M. Kirby. “Low-dimensional procedure for the character-ization of human faces”. In: J. Opt. Soc. Am. A 4.3 (Mar. 1987), pp. 519– 524. doi: 10.1364/JOSAA.4.000519. url: http://josaa.osa.org/ abstract.cfm?URI=josaa-4-3-519.

[31] J. Slay et al. “Towards a Formalization of Digital Forensics”. In: Advances in Digital Forensics V. Ed. by G. Peterson and S. Shenoi. Berlin, Heidel-berg: Springer Berlin Heidelberg, 2009, pp. 37–47. doi: 10.1007/978-3-642-04155-6_3. url: https://doi.org/10.1007/978-3-642-04155-6_3.

[32] A. Sundaresan, A. RoyChowdhury, and R. Chellappa. “A hidden Markov model based framework for recognition of humans from gait sequences”. In: Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429). Vol. 2. Sept. 2003, pp. II–93. doi: 10.1109/ICIP.2003. 1246624.

[33] E. Tassone, G. West, and S. Venkatesh. “Temporal PDMs for gait classi-fication”. In: Object recognition supported by user interaction for service robots. Vol. 2. Aug. 2002, 1065–1068 vol.2. doi: 10.1109/ICPR.2002. 1048489.

[34] M. Turk and A. Pentland. “Eigenfaces for Recognition”. In: Journal of Cognitive Neuroscience 3.1 (1991), pp. 71–86. doi: 10.1162/jocn.1991. 3.1.71.

[35] I. R. Vega and S. Sarkar. “Statistical motion model based on the change of feature relationships: human gait-based recognition”. In: IEEE Trans-actions on Pattern Analysis and Machine Intelligence 25.10 (Oct. 2003), pp. 1323–1328. issn: 0162-8828. doi: 10.1109/TPAMI.2003.1233906. [36] D. K. Wagg and M. S. Nixon. “On Automated Model-Based Extraction

and Analysis of Gait”. In: 6th International Conference on Automatic Face and Gesture Recognition. Ed. by D. Azada. Event Dates: 17-19 May, 2004. 2004, pp. 11–16. url: https://eprints.soton.ac.uk/259374/.

[37] S. Yu, D. Tan, and T. Tan. “A Framework for Evaluating the Effect of View Angle, Clothing and Carrying Condition on Gait Recognition”. In: 18th International Conference on Pattern Recognition (ICPR’06). Vol. 4. Aug. 2006, pp. 441–444. doi: 10.1109/ICPR.2006.67.

[38] R. Zhang, C. Vogler, and D. Metaxas. “Human Gait Recognition”. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop. June 2004, pp. 18–18. doi: 10.1109/CVPR.2004.361.