Face recognition using Hidden Markov Models

Hele tekst

(1)Face recognition using Hidden Markov Models by. Johan Stephen Simeon Ballot. Thesis presented at the University of Stellenbosch in partial fulfilment of the requirements for the degree of. Master of Science in Electronic Engineering with Computer Science. Department of Electrical & Electronic Engineering University of Stellenbosch Private Bag X1, 7602 Matieland, South Africa. Study leaders: Prof. J.A. du Preez Prof. B.M. Herbst. April 2005.

(2) Copyright © 2005 University of Stellenbosch All rights reserved..

(3) Declaration I, the undersigned, hereby declare that the work contained in this thesis is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree.. Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . J.S.S. Ballot. Date: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ii.

(4) Abstract Face recognition using Hidden Markov Models J.S.S. Ballot Department of Electrical & Electronic Engineering University of Stellenbosch Private Bag X1, 7602 Matieland, South Africa. Thesis: MScEng (E&E + CS) April 2005 This thesis relates to the design, implementation and evaluation of statistical face recognition techniques. In particular, the use of Hidden Markov Models in various forms is investigated as a recognition tool and critically evaluated. Current face recognition techniques are very dependent on issues like background noise, lighting and position of key features (ie. the eyes, lips etc.). Using an approach which specifically uses an embedded Hidden Markov Model along with spectral domain feature extraction techniques, shows that these dependencies may be lessened while high recognition rates are maintained.. iii.

(5) Uittreksel Gesigsherkenning met behulp van Verskuilde Markov Modelle J.S.S. Ballot Departement Elektriese & Elektroniese Ingenieurswese Universiteit van Stellenbosch Privaatsak X1, 7602 Matieland, Suid Afrika. Tesis: MScIng (E&E + RW) April 2005 Hierdie tesis handel oor die ontwerp, implementering en bespreking van statistiese gesigsherkenningstegnieke. Spesifiek die gebruik van Verskuilde Markov Modelle in verskeie vorme, is as herkenningstegniek ondersoek en krities geëvalueer. Huidige gesigsherkenningstegnieke word meestal beperk deur faktore soos agtergrond, beligting en posisie van sleutel-kenmerke (soos byvoorbeeld oë, lippe ens.).. Deur spesifiek ’n ge¨ıntegreerde Verskuilde. Markov Model te gebruik in samewerking met frekwensiegebiedkenmerkdata, word getoon dat genoemde beperkings verminder word terwyl hoë herkenningsvermoë behou word.. iv.

(6) Acknowledgements I would like to express my sincere gratitude to the following people and organisations who have contributed to making this work possible: Professors du Preez and Herbst for being enthusiastic study leaders. and staying excited about this thesis even when at times I was not. The National Research Foundation who funded most of this work. through the grant holder linked program. My mother and father who for 6 years provided the best bursary any. student could hope for. Not to mention the emotional support and unconditional love! My friend, Pieter Rautenbach for being a wall of ideas and for always. giving me an honest opinion or two about my project. Also for helping on the segmentation code which provided the much needed artistic flavour in the sea of analytical despair. The lab coffee machine, for obvious reasons.. v.

(7) Contents Declaration. ii. Abstract. iii. Uittreksel. iv. Acknowledgements. v. Contents. vi. List of Figures. ix. List of Tables. xi. Nomenclature. xii. 1 Introduction. 1. 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. 1.2. Background . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.3. Literature synopsis . . . . . . . . . . . . . . . . . . . . . . .. 5. 1.4. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.5. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.6. Overview. 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2 Literature study 2.1. 11. First efforts in face recognition . . . . . . . . . . . . . . . . .. vi. 11.

(8) vii. Contents. 2.2. Hidden Markov Models enter the face recognition race . . . .. 11. 2.3. Extending the extensible . . . . . . . . . . . . . . . . . . . .. 12. 2.4. The latest HMM flavours used in face recognition . . . . . .. 14. 2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16. 3 Face databases and their peculiarities. 17. 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .. 17. 3.2. Possible issues . . . . . . . . . . . . . . . . . . . . . . . . . .. 19. 3.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. 4 Feature extraction methods. 24. 4.1. To feature or not to feature, that is the question . . . . . . .. 24. 4.2. Pixel intensities. . . . . . . . . . . . . . . . . . . . . . . . .. 25. 4.3. An introduction to the Discrete Cosine Transform . . . . . .. 27. 4.4. The Discrete Cosine Transform . . . . . . . . . . . . . . . .. 28. 4.5. Giving DCT features an extra boost of robustness . . . . . .. 32. 4.6. Comparison of methods and summary. 34. . . . . . . . . . . . .. 5 Constructing the Hidden Markov Models. 36. 5.1. A brief introduction to HMMs . . . . . . . . . . . . . . . . .. 36. 5.2. HMM background . . . . . . . . . . . . . . . . . . . . . . . .. 36. 5.3. Model Configurations . . . . . . . . . . . . . . . . . . . . . .. 39. 6 Implementation. 43. 6.1. Practical aspects . . . . . . . . . . . . . . . . . . . . . . . .. 43. 6.2. The HMM configurations . . . . . . . . . . . . . . . . . . . .. 46. 7 Experimental investigation. 53. 7.1. Experiments on the ORL database . . . . . . . . . . . . . .. 53. 7.2. Experiments on the XM2VTS database . . . . . . . . . . . .. 57. 7.3. Summary of classification results . . . . . . . . . . . . . . .. 59. 7.4. Face segmentation . . . . . . . . . . . . . . . . . . . . . . .. 62. 7.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 64.

(9) viii. Contents. 8 Conclusions and recommendations. 67. 8.1. Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 8.2. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .. 68. 8.3. Possible improvements and Recommendations . . . . . . . .. 70. A The ORL database A.1 The complete ORL database . . . . . . . . . . . . . . . . . . B Solution to the evaluation problem. 74 74 75. B.1 The forward-backward procedure . . . . . . . . . . . . . . .. 75. B.2 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . .. 76. C Face image segmentations. 77. C.1 Examples of segmentations of face images in the XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography. 77 82.

(10) List of Figures 1.1. Information flow in recognising human faces. . . . . . . . . . .. 1. 2.1. A one dimensional HMM for face recognition. . . . . . . . . . .. 12. 2.2. A one dimensional HMM with end-of-line states. 2.3. An embedded HMM for face recognition. 3.1. Examples of pictures from the ORL database. 3.2. Examples of pictures from the University of Surrey XM2VTS database. . . . . . . . .. 13. . . . . . . . . . . . . .. 14. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3. Histogram of pixel intensities of bottom left image of figure 3.2. 3.4. Example of differences between images of the same class in the. 19 20 21. XM2VTS database . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 3.5. Histogram of pixel intensities of top left image of figure 3.1. . .. 23. 4.1. Enlarged grey scale picture of matrix A. . . . . . . . . . . . . .. 26. 4.2. Histogram of matrix A containing grey scale values . . . . . . .. 27. 4.3. Example face from the University of Surrey, XM2VTS database (2002). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. 4.4. Ordering of DCT coefficients for N=M=4 . . . . . . . . . . . .. 30. 4.5. Reconstructions of figure 4.3 using DCT coefficients. . . . . . .. 31. 5.1. Standard left-to-right, non-ergodic HMM. . . . . . . . . . . . .. 37. 5.2. Vertical top-to-bottom HMM modelling a face . . . . . . . . . .. 39. 5.3. Embedded HMM modelling a face . . . . . . . . . . . . . . . . .. 41. ix.

(11) x. List of Figures. 6.1. Passing of features from the feature domain to an HMM configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 47. 6.2. HMM configuration I topology. 6.3. Average of the DCT means of the ORL database. . . . . . . . .. 49. 6.4. Average of the DCT means of the XM2VTS database . . . . . .. 50. 6.5. Average of the DCT-mod2 means of the ORL database . . . . .. 50. 6.6. Average of the DCT-mod2 means of the XM2VTS database. . .. 51. 6.7. HMM configuration II topology . . . . . . . . . . . . . . . . . .. 52. 7.1. Wrong classifications on the ORL database using pixel values as features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7.2. 48. 55. Examples of wrongly classified face images from the XM2VTS database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7.3. Segmentation of an ORL face using DCT-mod2 features. 7.4. Mapping segmentation from the DCT-mod2 domain to the pixel domain. . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7.5. Segmentation of a XM2VTS face using DCT-mod2 features. 8.1. Ultimate face classification system. 62 63 64. . .. 65. . . . . . . . . . . . . . . . .. 73. A.1 The Olivetti Research Laboratory, ORL database (1994). . . . .. 74. C.1 Segmentation of a XM2VTS face image using DCT-mod2 features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 78. C.2 Segmentation of a XM2VTS face image using DCT-mod2 features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 79. C.3 Segmentation of a XM2VTS face image using DCT-mod2 features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 80. C.4 Segmentation of a mystery face image using DCT-mod2 features. 81.

(12) List of Tables 3.1. Database comparison. . . . . . . . . . . . . . . . . . . . . . . .. 4.1. Comparisons of feature extraction methods. 4.2. Classification accuracy on small scale. 18. . . . . . . . . . . .. 34. . . . . . . . . . . . . . .. 35. 6.1. Comparable partitioning of databases . . . . . . . . . . . . . . .. 44. 7.1. Summary of classification results — configuration I. . . . . . .. 54. 7.2. Summary of classification results — configuration II. . . . . . .. 54. 7.3. Best classification results from literature . . . . . . . . . . . . .. 56. 7.4. Our best classification results on the ORL database . . . . . . .. 57. 7.5. Summary of classification results — configuration I. . . . . . .. 58. 7.6. Summary of classification results — configuration II. . . . . . .. 58. 7.7. Best classification results of Zhang et al. (2004). . . . . . . . .. 59. 7.8. Our results using configuration II and DCT-mod2 feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. xi. 59.

(13) Nomenclature Constants: π=. 3,141 592 653 589 793 238 462 643 383 279 5. Abbreviations: HMM Hidden Markov Model HMMs Hidden Markov Models AI. Artificial Intelligence. GMM Gaussian Mixture Model PCA Principal Component Analysis LDA Linear Discriminant Analysis EM. Expectation Maximisation. PDF Probability density function DCT Discrete Cosine Transform IDCT Inverse Discrete Cosine Transform JPEG Joint Photographic Experts Group AC. Alternating Current. DC. Direct Current. General Variables : x. A vector x. N. Dimension xii.

(14) Nomenclature. xiii. N (µ,Σ) Gaussian pdf with mean (vector) µ and covariance (matrix) Σ Variables referring to HMMs: Λ = {a, f } An HMM with transition probabilities a and probability density functions f fi (x|St , Λ) Probability density function of state i quantifying the similarity of a feature vector x to the state St = i given the model XT1. Observation sequence from t = 1 to t = T.

(15) Chapter 1 Introduction 1.1. Motivation. In a world where security has become a very high priority and where there is no tolerance towards human error in this regard, computers and especially software have developed to such an extent, that they are able to distinguish one human from another. Whether this is via fingerprint, voice or other physicalities, the uniqueness of each and every human is exploited to build robust computerised recognition systems which should in theory be more reliable and more cost effective than employing a person to do the same work. This thesis focuses on face recognition, especially using trained statistical models to distinguish between a variety of individuals. The possibilities for applications are endless. Especially in an era of global paranoia in terms of personal safety, the high technology security field would be the most exploitable for its application. Recognising one human face from another is a process which happens. Data. Sensor. Recogniser. Result. Figure 1.1: Information flow in recognising human faces. 1.

(16) Chapter 1. Introduction. 2. sub-consciously in a human being. The flow of information in a typical recognition process is shown in figure 1.1. In a human system the process can be summarised as follows. Information is passed from the sensors to the recogniser In the recogniser a database of hundreds of thousands of faces is. scanned in an instant and matched against the data obtained from the sensors The result is a recognition success or failure. This system is highly effective in humans. One of the problems in copying this process for computerised applications is that we do not know how the human brain (in computer terms “wetware”) does the recognising. What features are extracted from the test data? How is the massive internal database scanned in a fraction of a second? These are all questions which remain largely unanswered even with current available technology. To implement such a system, an artificial visual recogniser tries to simulate this process so natural to humans each and every day. In such an artificial system, referring to figure 1.1, as sensor there is a camera of sorts, as recogniser some software implemented on some hardware, and finally specialised software where a decision is made as to whether a subject is recognised or not. The problem is that in the “wetware” human system, a face which is already in the database will almost certainly be recognised, but in an artificial system this is not the case. An artificial system must be trained to recognise certain known features and it must also be designed to be robust in terms of eliminating background noise. In this respect the filtering capacity of the human brain is still an unrivalled technology. To summarise, the four basic problems in an artificial recogniser system are: Choosing robust features to interpret.

(17) Chapter 1. Introduction. 3. Choosing a model for the recogniser Running a classification experiment using the chosen model Interpreting the results. The construction of a computerised recogniser can be seen as a special case of creating some form of artificial intelligence (AI). A computer system is set up to perform a task usually reserved for humans and therefore this exercise in modelling is also an investigation in the understanding of the human brain to a certain extent. Hopefully we can furthermore show that the AI can generate both consistent and satisfactory results.. 1.2. Background. To recognise humans, three basic paths and one hybrid path could be followed namely: Chemical Audio Visual Hybrid. It could be argued that the most effective recogniser is the chemical model. It is however, probably the most impractical since humans tend to be sceptical to part with a sample of their DNA! Advances in speech recognition technology have shown such recognition systems to have substantial use. But a person about to be recognised must still be able or willing to speak. Security based applications could furthermore require a certain catch-phrase/language to be spoken. A visual recogniser is a subtle recogniser; it can take a photo, process it and recognise a subject. All of these steps can be done in an instant and if necessary, undercover. This is one of.

(18) Chapter 1. Introduction. 4. the reasons why dependable face recognition technology is a very attractive proposition for security based applications. A hybrid recogniser combines one or more of the above techniques to improve recognition rates. Roughly stated, in choosing a recognition system a trade off between ease of implementation and practicality exists. The usual problems that a face recognition system needs to solve are (Muller (2002)): Known/Unknown Classification Face verification Full identification. In the first problem the system needs to identify whether a specific face belongs to some group of known faces. This is typically encountered in access control or security applications. Secondly classification is when a decision has to be made about the identity of a given face by assigning its identity to a group of known faces. This means that if there are a couple of faces of persons X, Y and Z in the known group, would the given face most likely be person X, Y or Z? With face verification the given face is claimed to be of identity X. The system needs to verify whether this is correct. Typically this is also used for security type applications. This can be viewed as a special case of the first problem. Full identification is used to determine whether a face is known and then to classify it. This is a combination of the first and second problems. This thesis investigates the classification problem. The first step in designing a face recognition system is choosing the model for the recogniser. Hidden Markov Models (HMMs) have proved to be quite a flexible statistical modelling tool for this purpose. In this thesis HMMs are investigated as a solution to the second of the listed four basic problems in artificial recognition systems. A brief overview on Hidden Markov Model theory is.

(19) Chapter 1. Introduction. 5. given later on, as well as why HMMs could form the basis of quite a robust recognition mechanism. To summarise the scope covered in this thesis, the following problems are addressed: Sensible preprocessing of face images in a given database Construction of a suitable HMM model to recognise the faces in the. database Classification experiments Interpreting the results of a classification experiment Comparing the results to published results Segmentation of facial images. The relevant concepts of this study are therefore the peculiarities of the available database, the modelling using HMMs and finally the achieved results and their interpretation.. 1.3. Literature synopsis. Several approaches may be found in literature for face recognition without HMMs. These approaches are summarised and discussed in depth in Muller (2002). This thesis focuses on work done on recognising faces using HMMs. The most notable first efforts were made by Samaria & Young (1994). These first HMMs used in face recognition had a straightforward topology as can be seen in figure 2.1. These HMMs typically had five states, each state modelling a specific area of a face image. Each state of such an HMM contains a single multivariate Gaussian distribution as density function and pixel intensity values are used as feature vectors. A given image matrix of pixel intensity values is scanned in overlapping blocks from the top of the image to the bottom to train the HMM..

(20) Chapter 1. Introduction. 6. Satisfactory results were achieved but the flexibility of the HMM model allowed for further improvements. The seminal work in the field of HMM based face recognition is surely Samaria (1994). Here a left-to-right HMM is used to obtain segmentation information (or “meaningful regions”) of a given face. This segmentation information could then be used to identify a face. The HMM has a pseudo two dimensional lattice of states each describing a distribution of feature vectors belonging to a certain area of the face as shown in figure 2.2. Each HMM has an end-of-line state with two possible transitions, either to the first state of its row or to the next row of states. The relevant database used in Samaria (1994) is the Olivetti Research Laboratory, ORL database (1994). This database consists of faces of 40 individuals, with 10 different images of each individual. The main feature of this database is that a picture of an individual contains mainly facial information and very little background. Background (noise) often transforms a seemingly great recognition system into quite an average one. Simultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a) introduced an embedded HMM model which consisted of embedded states inside super states as shown in figure 2.3. This allowed for better transitions between states since the embedded HMMs proved to be “tighter” probability density functions than normal Gaussian distributions. Both furthermore showed that pixel intensity values do not form the most robust of features and that using two dimensional DCT coefficients as features, delivered better results. More recent developments on extending HMMs to be even more robust as recognising tool are discussed in Chapter 2. This study will aim to reconstruct most of these HMM based face recognition experiments, to verify their results and hopefully add some improvements..

(21) Chapter 1. Introduction. 1.4. 7. Objectives. In any study of recognition the main goal or objective is achieving some or other high rate of recognition, in other words classifying accurately. The other lesser objectives all relate to this main one in being the stepping stones in finding the ultimate result — perfect classification. The main goals of this study in face recognition can be summarised as follows: Investigating the use of HMMs as a face recognition tool Implementing a number of HMM topologies that could be used as face. classifiers Evaluating the chosen HMM topologies as face classifiers against avail-. able face databases Comparing the results of the HMM classifier against published sys-. tems It can be seen that all the objectives revolve around Hidden Markov Models and applying them to the rather uncustomary field of face recognition. Modelling with HMMs tends to be quite a flexible process and therefore a number of models can be constructed and tested as tools in order to accomplish the aforementioned main objective.. 1.5. Contributions. The available literature on HMMs used as a face recognition tool, covers the main issues regarding this solution to the face recognition problem. There is an aspect though which receives little attention that is related to the actual detail surrounding it. This aspect deals with the density functions inside the HMM states. We believe that this thesis deals with these details and in fact describes the process of selecting useful density function parameters based on the available databases and therefore generating very good results..

(22) Chapter 1. Introduction. 8. Another contribution deals with the question of what features to use, in other words, what preprocessing of images is necessary to obtain the best possible results. Furthermore, by using the segmentation of data provided by HMMs, we can extract faces from the background and locate facial features — something very useful in computer vision based applications. Again summarising these contributions: Choosing density function parameters for HMM (embedded or not. embedded) states and their peculiarities Training the HMMs with suitable features, i.e. feature extraction and. noise elimination Designing the HMM topologies in accordance with the physicalities. of the available database Segmentation of a face into “meaningful regions”. 1.6. Overview. The focus of this thesis is the modelling of a face classification system using Hidden Markov Models. We start off with an overview of the available literature on face recognition using HMMs in Chapter 2 on page 11. This chapter emphasises the fact that there is not much available in published literature on Hidden Markov Models used in face recognition applications. Two basic HMM topologies namely an embedded HMM and a single topto-bottom HMM are mentioned in the literature. We implement both these models as to test their value as face image classifiers. The focus of attention then moves on to the available databases used in the classification experiments in Chapter 3 on page 17. Both the databases we consider in this thesis have some interesting characteristics. Looking at typical image histograms (figures 3.3 and 3.5) it may be seen that background noise and other factors should clearly be taken into consideration.

(23) Chapter 1. Introduction. 9. for at least the University of Surrey, XM2VTS database (2002). We cut out the bulk of the background in all of the images of the XM2VTS database to stop it from confusing the classifier. The other database we use, namely the Olivetti Research Laboratory, ORL database (1994)) is used as it is since the images in this database are already in a “friendly” format with very little variation between images and background noise to confuse the classifier. This leads in to Chapter 4 on page 24. As suggested by results obtained by previous systems in the available literature, we use features other than pixel intensity values. This is done mainly to improve classification accuracy. Three feature extraction methods are implemented, focusing on the Discrete Cosine Transform (DCT) and why DCT coefficients form more robust features for face recognition than pixel intensity values. Furthermore, the feature extraction technique known as DCT-mod2 is also discussed and how it could improve the robustness of the classifier. Our classification experiments using the DCT-mod2 coefficients give excellent results. With all the theory of the preprocessing in place, Chapter 5 on page 36 then covers the theoretical modelling of the HMMs used in the face classification experiments. We decided to implement two HMM configurations, a normal top-to-bottom HMM modelling down the rows of an image and an embedded HMM with a vertical HMM containing horizontal HMMs as the probability density functions within its states. These two topologies were chosen as they are the most widely used in the available literature. It also provides a good comparison of what the extra complexity of an embedded HMM “buys” in terms of classification accuracy. With all the necessary modelling, motivation and theory in place, Chapter 6 on page 43 explains all the practical aspects concerning the implementation of the face classification system. Here we show the detail on how the HMMs we use as classifiers, are constructed. Furthermore we show the specifics of training and scoring our classifier on pixel intensity values, DCT coefficients and DCT-mod2 coefficients..

(24) Chapter 1. Introduction. 10. Finally the experiments conducted and results obtained are discussed in Chapter 7 on page 53, the list of classification results on both databases is noted starting on page 54. Excellent results are achieved on both the databases we used in the experiments. The embedded HMM using DCTmod2 features obtains the best classification results. It scores perfect classification (100%) on the ORL database and on the complex XM2VTS database a classification score as high as 93.31% is recorded. These results are furthermore shown to compare well against published systems. We furthermore show results of segmentations done on face images, as provided by the Viterbi algorithm. These segmentations show on what areas of the face the embedded HMM models on. The final section is Chapter 8 on page 67. There the conclusions of this thesis are encapsulated and recommendations are made for further improvements in possible future work. By using techniques such as LDA (Linear Discriminant Analysis) or KDA (Kernel Discriminant Analysis) we believe that the models we discuss in this thesis can be improved to be very robust recognisers..

(25) Chapter 2 Literature study 2.1. First efforts in face recognition. It can be argued that the pioneering work in the field of face recognition was done by Kirby & Sirovich (1990). The technique they proposed — commonly known as eigenf aces — is based on Principal Component Analysis (PCA) and has been extended and optimised by various institutions and people to make it one of the most widely used current face recognition techniques. This technique and other early methods (like elastic graph matching and linear discriminant analysis (LDA)) are discussed in Muller (2002) and Sanderson (2003). These first methods all used facial geometry and symmetry to classify faces.. 2.2. Hidden Markov Models enter the face recognition race. The first efforts to use HMMs as a face recognition tool were made by Samaria & Young (1994). They introduced the HMM as quite a robust mechanism to deal with face recognition. The HMM used was a single leftto-right HMM as seen in figure 2.1 with each state modelling a specific facial region. Each state of this HMM contains a single multivariate Gaussian dis11.

(26) 12. Chapter 2. Literature study a11. 1 Forehead. a22. a12. 2 Eyes. a33. a23. a44. 3 Nose. a34. 4 Mouth. a55. a45. 5 Chin. Figure 2.1: A one dimensional HMM for face recognition. tribution as probability density function (pdf). This HMM is trained on a database of pictures, all of them read from top to bottom with each row of pixel intensity values used as feature vectors. This approach achieved better classification rates than a PCA based approach on the tested database. Another bonus of introducing HMMs is that it segments the face into meaningful regions which can also be used for other applications like facial gesture recognition. Follow-up work by the same author, Samaria (1994), extended the classic one dimensional left-to-right HMM to a pseudo two dimensional (pseudo-2D) one. This HMM had a pseudo two dimensional lattice of states each describing a distribution of feature vectors belonging to a certain area of the face as shown in figure 2.2. Each HMM had an end-of-line state with two possible transitions, either to the beginning state of its row or to the next row of states. In each state a multivariate Gaussian distribution was used to model the distribution of feature vectors relevant to that state. This approach was tested on the Olivetti Research Laboratory, ORL database (1994) and again it outperformed previous face recognition techniques at that time.. 2.3. Extending the extensible. Simultaneous efforts by Nefian & Hayes (1999) and Eickeler et al. (1999a) introduced an embedded HMM which consisted of embedded states inside super states as shown in figure 2.3. Again, each of the top-to-bottom states models a specific facial region. This extended HMM model allows for better transitions between states since the embedded HMMs prove to be “tighter” probability density functions than normal Gaussian distributions. Both.

(27) Chapter 2. Literature study. 13. Figure 2.2: A one dimensional HMM with end-of-line states. authors furthermore showed that pixel intensity values do not form the most robust of features and that using selected two dimensional discrete cosine transform (DCT) coefficients as features delivered better results. Perfect classification (100%) was obtained on the Olivetti Research Laboratory, ORL database (1994) using this technique and overall recognition speed increased because using only selected DCT features significantly compresses the data. The main problem in all the above mentioned techniques was that they were tested on a database which consisted of pictures with very little background (see figure 3.1). The modelling can therefore be done very accurately and the HMMs can be fine tuned to deliver remarkable results. In a practical system this step would only be possible if faces could be.

(28) 14. Chapter 2. Literature study. Forehead. Eyes. Nose. Mouth. Chin. Figure 2.3: An embedded HMM for face recognition. identified from pictures and then preprocessed to form a background free image for the HMMs to classify.. 2.4. The latest HMM flavours used in face recognition. Hidden Markov Models have traditionally been used to model time dependent data. For this use they have been fine tuned and thorough research has already been done on the subject, especially concerning what features to use (for example cepstra features in automatic speech recognition systems)..

(29) Chapter 2. Literature study. 15. In image processing, HMMs are quite a new addition to the fold of well established techniques and therefore extracting robust features is still one of the major areas for future development. Some novel new feature extraction techniques are discussed in Sanderson (2003). One of these techniques is the DCT-mod2 approach, which we included as a feature extraction method in this thesis. The DCT-mod2 feature extraction method could be seen as a form of delta-coefficient extraction. This method shows lots of potential especially in keeping the recogniser robust when illumination changes occur. As far as we could establish, DCT-mod2 features have not previously been used in HMM based classifiers. Consensus, it seems, has been reached that DCT based feature extraction methods are probably the most effective. Other advanced efforts were made by M¨ uller et al. (2002) where they proposed a triple embedded HMM – based model to recognise facial expressions. It is also worthwhile mentioning the HMM recogniser used by M.Bicego et al. (2003) where the author proposes wavelet coding as a feature extraction method. Using the wavelets as features shows the same perfect classification score on the ORL database. In Othman & Aboulnasr (2003) the authors proposed an HMM with an extended two dimensional structure to use as a recogniser. This means that all states allow both vertical and horizontal transitions. Again DCT coefficients were used as features which underlined the trend to move away from pixel intensities when choosing feature vectors. They also achieved remarkable results but again it was on the ORL database. Another improved HMM-based recogniser was proposed by Eickeler et al. (1999b) using JPEG format features. What makes this technique useful is that it can recognise faces directly from the JPEG format compressed data and it is therefore an improvement speed-wise on previous efforts. This method also underlines the fact that DCT-based features are used to suppress the sensitivity to changes in light intensity..

(30) Chapter 2. Literature study. 2.5. 16. Summary. The literature provides a summary of previous HMM-based classifiers. The HMM topology most widely used seem to be the basic top-to-bottom HMM modelling down the rows of an image as first proposed by Samaria & Young (1994). The extension of this model to an embedded HMM (by Nefian & Hayes (1999) and Eickeler et al. (1999a)) shows a lot of promise as a possibly robust classifier. Furthermore, spectral domain feature extraction techniques are widely used in published systems to improve the robustness of a classifier. In this thesis we reconstruct and improve the top-to-bottom and embedded HMMs. In training these models we also use specific spectral domain features (DCT coefficients, as proposed in the literature) to improve the classification accuracy of the HMMs..

(31) Chapter 3 Face databases and their peculiarities 3.1. Introduction. The results of any classification experiment should always be seen in the context of the database of face images that the classifier involved has been trained and tested on. Such a database could be characterised by the following properties: Format of the pictures (i.e. file type, size, grey scale/colour) Number of persons in the database Number of images per person Variations in lighting conditions between images Variations in individuals’ features between images Amount of background in a picture. These properties all play some part in either confusing or helping the classifier to classify the faces in the database. In our experiments we use the University of Surrey, XM2VTS database (2002) and the Olivetti Research 17.

(32) 18. Chapter 3. Face databases and their peculiarities. Laboratory, ORL database (1994). These databases differ in all the properties mentioned above, so we list the differences in table 3.1. In order to ORL database. XM2VTS database. Format. Grey scale .pgm. RGB .tiff. Image size. 112x92. 576x720. Persons in database. 40. 295. Images per person. 10. 8. Total images. 400. 2360. Light variation. Slight. Slight. Percentage background. <10% of image. >40% of image. Background uniformity. Uniform black. Non-uniform blue. Table 3.1: Database comparison. fully understand table 3.1, see figure 3.1 for samples of the ORL database and figure 3.2 for samples of the XM2VTS database. For the purposes of this thesis the XM2VTS database images were resized to be 288x360, which corresponds to a scaling of a. 1 2. on the rows and columns. These images were. also converted from RGB1 to grey scale. Finally a window of 236x144 pixels was cut out, trying to capture as much of the face as possible. The ORL database pictures were already in a “friendly” format since the pictures were all cropped around the faces they represented reducing confusion caused by background. The differences between these two databases provided a good test to show the robustness of our methods. 1. Colour pictures are represented by three pictures, each corresponding to the red,. green or blue (three primary colours) values of the pixels..

(33) Chapter 3. Face databases and their peculiarities. 19. Figure 3.1: Examples of pictures from the ORL database. 3.2 3.2.1. Possible issues The XM2VTS database. In order for our classifier to perform well on both databases, we need to investigate any possible issues that could be encountered when testing our classifier on these databases. The XM2VTS database is an extensive frontal face database containing images of 295 individuals (8 images each). This database was mainly constructed with face verification in mind and established a testing protocol to ensure that different institutions compare equivalent results. This protocol is known as the Lausanne protocol. For a.

(34) Chapter 3. Face databases and their peculiarities. 20. Figure 3.2: Examples of pictures from the University of Surrey XM2VTS database. comprehensive discussion on the particulars of the XM2VTS database see Messer et al. (1999). The main issue that arises when using this database is to classify faces against the large amount of background that exists in the images. This database was acquired over a period of five months, with acquisition sessions spaced over one month intervals. The fact that the sessions were spaced a month apart means that background detail also differs in different images. We focus on figure 3.2, and specifically on the sample face at the bottom left of this image. It can be seen that the background takes up a high percentage of the pixels of the picture. When referring to the histogram of the sample face image (see figure 3.3), this problem becomes even more evident. Most of the pixel values lie in and around the value of 50. Because HMMs are powerful modelling tools, they tend to model on the non-uniform background rather than the facial data purely because the background takes up so much of the data. Three possible solutions exist to overcome this problem. The first so-.

(35) 21. Chapter 3. Face databases and their peculiarities. 8000. 7000. 6000. 5000. 4000. 3000. 2000. 1000. 0 0. 50. 100. 150. 200. 250. Figure 3.3: Histogram of pixel intensities of bottom left image of figure 3.2. lution is to adapt our model by carefully choosing the probability density functions (pdfs) and the features to extract. This probably represents the most scientifically correct solution. A second possible solution is to crop all the pictures so that they consist mainly of the facial data. Automatic procedures to do this exist but we manually extracted the faces for our final experiments. The third possible solution is to normalise or transform the images in some or other way and then use the feature extraction methods as described in chapter 4. Another feature of the XM2VTS database is the way in which lighting and background as well as personal features (glasses, hair etc.) vary between images belonging to the same class. One of the more extreme cases is presented in figure 3.4.2 2. The colour images have been presented as they better highlight the subtle differences. between images..

(36) Chapter 3. Face databases and their peculiarities. 22. Figure 3.4: Example of differences between images of the same class in the XM2VTS database. 3.2.2. The ORL database. This database consists of images of 40 people, with 10 images per person. An image of the complete database (400 faces) is given in appendix A. The persons captured in this database are aged between 18 and 81. There are 4 female and 36 male subjects, with each image containing a different facial expression. For most of the images light conditions differ but all of the images are set against a uniform black background. All of the images are cropped to consist of mostly facial data with very little background. The varying conditions of light and expressions but limited background, makes this database ideal for controlled face classification experiments. Take for instance the top left sample face in figure 3.1 and that image’s histogram as shown in figure 3.5. When comparing this histogram with the one presented in the previous section, it may be seen that it should be easier to model on this database because the pixel values are more evenly spread without extremities at specific pixel values..

(37) 23. Chapter 3. Face databases and their peculiarities. Number of pixels. 150. 100. 50. 0 0. 50. 100. 150. 200. 250. Pixel intensity on the gray scale. Figure 3.5: Histogram of pixel intensities of top left image of figure 3.1. 3.3. Summary. The characteristics of both databases have been established. A controlled effort can therefore be made to extract robust features for the classification experiments. The next chapter deals with feature extraction and how it is necessary to develop a way to overcome the difficulties, especially those presented by the complex images in the XM2VTS database..

(38) Chapter 4 Feature extraction methods 4.1. To feature or not to feature, that is the question. Our model operates on features extracted from the images. With the issues surrounding databases as described in the previous chapter, these features should be chosen in such a way as to ensure a separation between individuals. The extraction of features concerns the passing on of object data in a specific format and size to some model, mainly for the purpose of recognising the object. Referring to figure 1.1, feature extraction is the step in between the sensor and the recogniser. Humans are restricted to features based on the five senses. Therefore, using the eyes, the frequencies (colour) and the intensity of light are the only features from which objects can be identified. In training an artificial recogniser, the features seen by the recognition models can be manipulated. Specifically in this thesis, the main question concerning feature extraction that arises is: what numerical values are necessary to effectively train the HMM based classifier from? The identification of these values is the basis of the feature extraction problem. The following features were investigated and specifically used to train the HMMs used in the face classification experiments:. 24.

(39) 25. Chapter 4. Feature extraction methods. Pixel intensity values Discrete Cosine Transform (DCT) coefficients DCT-mod2 coefficients. Pixel intensity values are the raw data representing an image. In a grey format they typically vary in value from 0 to 255. DCT coefficients are obtained by applying the two dimensional DCT to blocks of a given image. The DCT-mod2 coefficients are extended DCT based features as proposed by Sanderson & Paliwal (2002). As far as could be ascertained DCT-mod2 based feature extraction in HMM based face classification has not been investigated before. The following sections deal with the in-depth explanation of the method behind each of these feature extraction techniques and their advantages or disadvantages when used in the classification of face images.. 4.2. Pixel intensities. Pixel intensity values are numerical values of light intensity on a specific scale and are used to store pictures digitally. For instance, say that a grey-scale digital photo is taken of the face of a human at a resolution of 720x576. This makes it possible to store a matrix (with 720 columns and 576 rows) of light intensity values on a computer. The grey scale implies that the intensity values are integers representing shades of grey ranging from 0 (black) to 255 (white). The following example illustrates grey scale pixel values: Assume we have a matrix of pixel intensity values (matrix A) representing an image (figure 4.1):  2 255 2 255   10 200 200 100 A=  50 100 50 2  2 50 150 200.      . Storing pictures in this raw format wastes space, one of the many available compression routines is used instead. These pixel values do however repre-.

(40) Chapter 4. Feature extraction methods. 26. Figure 4.1: Enlarged grey scale picture of matrix A. sent features that can be used to train the HMM topologies discussed in this thesis and satisfactory face classification results are obtained. The problem however is that many features have to be kept, therefore the training and scoring of models becomes computationally expensive. If we wanted to classify the image represented by matrix A for some or other reason using an HMM based classifier, the image could be scanned from top to bottom with each row forming a single feature vector. The complete observation sequence is therefore the four rows of this matrix. A histogram (figure 4.2) of the pixel intensities can be drawn. As was shown in the previous chapter, typical histograms of facial images in the available databases (see figures 3.3 and 3.5) show how face data and background, which can be regarded as noise, are embedded in the features (pixel intensity values). This is one of the reasons why pixel intensity values are not the best features to use. For robust face classification we want features to be decorrelated in some way so we can model a face image as distinctly as possible. To summarise the advantages of pixel intensity values as features: they are easy to obtain and they have the same dimensions as the image data..

(41) 27. Chapter 4. Feature extraction methods 5 4.5 4. Number of pixels. 3.5 3 2.5 2 1.5 1 0.5 0 0. 50. 100. 150. 200. 250. Gray scale value. Figure 4.2: Histogram of matrix A containing grey scale values. The disadvantages of pixel values are: they tend to be sensitive to image noise as well as image rotations or shifts, and changes in illumination. They furthermore induce large dimensions on observation vectors. This causes any complex algorithm to take an unacceptably long time to complete.. 4.3. An introduction to the Discrete Cosine Transform. Compressing data is essential in both biological and signal processing applications. Even in human vision the light signals received by the approximately 130 million photo-receptors (see Steven W. Smith (1999) for more details) on retinal level in the eye, are sent to the brain for compression and processing. By the time these signals arrive at the higher centres of the brain, they convey magnitude (contrast), phase and frequency, which are all principle attributes of Fourier analysis. Especially in the image processing community the two dimensional DCT has been used as a data compres-.

(42) 28. Chapter 4. Feature extraction methods. sion tool. The two dimensional DCT forms the basis of the JPEG (Joint Photographic Expert Group) image compression standard. It is important to note that we will henceforth be referring to the two dimensional version of the DCT only as “the DCT”. The original DCT is mainly used in one dimensional applications (i.e. not image processing).. 4.4 4.4.1. The Discrete Cosine Transform Motivation and the case of the missing sine coefficients. In general, to obtain the frequency representation of a two dimensional signal the Fourier Transform is used and specifically the FFT (Fast Fourier Transform) algorithm. The Fourier theorem specifies that any signal can be represented as a weighted sum of even and odd sinusoidal terms. The DCT is a transform very much like the Fourier Transform, but with the DCT a signal is represented only by the even sinusoidal terms (hence naming it a cosine transform). Representing image information in terms of the DCT rather than with the FFT has the important advantage that DCT coefficients are always real valued. The DCT also delivers better energy compression and the coefficients are nearly uncorrelated (Eickeler et al. (1999a)). Having nearly uncorrelated coefficients makes the DCT very attractive in terms of image processing. It means that for instance in the application of face recognition that DCT features will be less sensitive to changes in image illumination. In general, the two dimensional DCT of a M xN matrix F is defined as follows: C(u, v) = α(u)β(v). M −1 N −1 X X. F (i, j) cos. i=0 j=0. where. r α(0) =. 1 , M. (2i + 1)uπ (2j + 1)vπ cos 2M 2N r. α(u) =. 2 M. (4.4.1).

(43) 29. Chapter 4. Feature extraction methods. r β(0) =. 1 , N. r β(v) =. 2 N. and. 0≤u≤M −1 , 0≤v ≤N −1 From equation 4.4.1 a DCT coefficient matrix can be constructed. These coefficients represent the energy contribution by different frequencies. The first coefficient (C(0, 0)) represents the “DC” component or the average value of the M xN block. The rest of the coefficients represent the different “AC” components, as contributed by each of the frequencies present. For the subsequent discussion refer to the sample image of a person (figure 4.3, scaled to two thirds the size) taken from the University of Surrey, XM2VTS database (2002). The main advantage of the DCT is that it. Figure 4.3: Example face from the University of Surrey, XM2VTS database (2002). compresses data. This compression property of the DCT allows a block of.

(44) 30. Chapter 4. Feature extraction methods. u 0 v. 1. 2. 3. 0. K=0. K=1. K=5. K=6. 1. K=2. K=4. K=7. K=12. 2. K=3. K=8. K=11. K=13. 3. K=9. K=10. K=14. K=15. Figure 4.4: Ordering of DCT coefficients for N=M=4. pixels to be represented by just a few DCT coefficients and it is therefore possible to work with less features, and still obtain more information than would be present when using the larger number of pixel values. In order to extract the coefficients which contain the most data about the block of data transformed, the DCT coefficient matrix needs to be scanned in a zig-zag pattern as shown in figure 4.4. This is because the contributing frequencies are arranged from low to high as indicated by the zig-zag pattern represented by increasing K. To show these compression properties the first 10x10 (compression of ±4000 times), 50x50 (compression of ±160 times), 100x100 (compression of ±40 times) and 200x200 (compression of ±10 times) coefficients were extracted from figure 4.3 and run through the inverse transform (IDCT) to obtain approximated images. See figure 4.5 for the approximations of the face image.1 It can be seen that the DCT provides suitable data compression and for this reason alone it should be considered when constructing features used in face recognition. 1. This example shows the compression capabilities of the DCT and should not be. confused with the JPEG compression standard, in which the DCT is used, but not in this manner..

(45) 31. Chapter 4. Feature extraction methods. Figure 4.5: Reconstructions of figure 4.3 using DCT coefficients. 4.4.2. Feature extraction using the DCT. In this thesis the selection of suitable DCT coefficients from pictures in the available databases (see figures 3.1 and 3.2) was evaluated as a feature extraction method. For this method of feature extraction a sliding window of 8x8 pixels was scanned over a picture with the standard overlap of 50% in both the horizontal and vertical directions. For each window of 8x8 pixels, a DCT coefficient matrix of the same size was obtained. This means that for an image of Y rows and X columns there are ND = (2. Y X − 1) × (2 − 1) N N. (4.4.2). number of 8x8 DCT coefficient blocks (with N = 8 being the size of the window). These DCT coefficient blocks are then reduced by keeping their first 15 coefficients (as suggested by experiments of Sanderson (2003)) by following the zig-zag pattern described earlier. Thus every 64 values are reduced to L = 15 values and a single observation used to represent the data of block (b, a) is now the vector: (b,a). ~x = [c0. (b,a). c1. (b,a). c2. ···. (b,a). cL−1 ]T. (4.4.3).

(46) Chapter 4. Feature extraction methods. 32. A complete observation sequence is obtained consisting of ND of these vectors. Specifically, for the two databases used, the images were of size 112x92 and 236x144.2 This means we have observation sequences of sizes ND = 594 blocks and ND = 2030 blocks respectively.. 4.5. Giving DCT features an extra boost of robustness. In Sanderson & Paliwal (2002) a novel way of adding more robustness to the DCT is introduced. This method of feature extraction is based on polynomial coefficients, also known as deltas. In speech recognition applications an analogue to this method of feature extraction has proved very successful in eliminating background noise and channel mismatch. Images however consist inherently of two dimensional signals and therefore we have to redefine these coefficients. As proposed in Sanderson & Paliwal (2002) we will name this new method of feature extraction DCT-mod2. For images we now define the n-th horizontal delta coefficient for a block located at (b, a) as a modified first order orthogonal polynomial coefficient (Sanderson & Paliwal (2002)): (b,a+k) k=−K khk cn PK 2 k=−K hk k. PK ∆hc(b,a) n. =. Similarly, the n-th vertical delta coefficient is defined as: PK (b+k,a) k=−K khk cn v ∆c(b,a) = PK 2 n k=−K hk k. (4.5.1). (4.5.2). where h is a 2K +1 dimensional symmetric window vector and cn is the n-th DCT coefficient of a block located at (b, a). For our purposes we let K = 1 and h = [1 1 1]T be a rectangular window. To illustrate the advantage of 2. Important to note that when speaking of the size of an image the customary format. is (number of rows) x (number of columns) but the resolution of an image is written the other way around..

(47) 33. Chapter 4. Feature extraction methods. using these modified delta features, assume we have three consecutive blocks X, Y and Z, as explained in Sanderson & Paliwal (2002). Let us assume that each block contains an information component and a noise component, say X = XI + XN , Y = YI + YN and Z = ZI + ZN . Let us assume that each block is corrupted by the same noise, therefore XN = YN = ZN . This is a reasonable assumption to make if the blocks are small and close to each other or if these blocks are “neighbours” as the result of overlapping used in the sampling process. The deltas for block Y can now be computed using equation 4.5.1 and 4.5.2: ∆h Y. 1 (−X + Z) 2 1 (−XI − XN + ZI + ZN ) = 2 1 = (ZI − XI ) 2 =. (4.5.3). and ∆v Y. 1 (−X + Z) 2 1 = (−XI − XN + ZI + ZN ) 2 1 (ZI − XI ) = 2. =. (4.5.4). and the noise component is removed. We now modify our DCT feature vector by replacing the first three coefficients by their horizontal and vertical deltas and form a feature vector representing a given block at (b, a) as a new vector: x = [∆hc0. ∆vc0. ∆hc1. ∆vc1. ∆hc2. ∆vc2. c3. c4. ···. cL−1 ]T. (4.5.5). where the (b, a) indication was left out to maintain clarity and L = 15. The first three coefficients represent the most information held in the block and therefore to limit the size of the features they are replaced with their delta coefficients. A block of coefficients taken on the edges of the picture will not have a neighbouring block on the one side, so when using the DCT-mod2.

(48) 34. Chapter 4. Feature extraction methods. approach we end up with ND2 = (2. X Y − 3) × (2 − 3) N N. (4.5.6). blocks. This gives observation sequences of sizes ND2 = 500 blocks and ND2 = 1848 blocks respectively.. 4.6. Comparison of methods and summary. To summarise, in general any method of feature extraction has certain characteristics which need to be taken into account when constructing an artificial recogniser. The three feature extraction methods discussed are characterised in table 4.1. When training HMMs to recognise faces, it is. Preprocessing. Pixel intensities DCT. DCT-mod2. None. ND 2-D DCTs and. ND 2-D DCTs. ND2 linear operations Dimensionality. Large. Small. Small. Robustness. None. Very. Most. Table 4.1: Comparisons of feature extraction methods. desirable to speed up the process without sacrificing accuracy. By using the two DCT-based feature extraction methods, we improved the speed of our system (because of the fewer dimensions of the observation sequences). Furthermore, our system becomes robust to changes in light illumination — something that is inherent in any picture. To briefly illustrate the value of the above feature extraction methods, a small classification experiment was run on the first 8 individuals (using 4 images of each) in the University of Surrey, XM2VTS database (2002) using each of the feature extraction methods. The leave-one-out method of training/scoring was used, with HMM configuration II (see chapter 6 for details of this configuration – it is a configured embedded HMM). The results we obtained from this mini.

(49) 35. Chapter 4. Feature extraction methods. Recognition accuracy. Wrong classifications. Pixel values. 84.38%. 5 faces. DCT. 90.63%. 3 faces. DCT-mod2. 100.0%. 0 faces. Table 4.2: Classification accuracy on small scale. experiment are summarised in table 4.2. The full results achieved on both evaluated databases and using all three feature extraction methods are listed and discussed in chapter 7. We see from this discussion on feature extraction techniques that we need to give our HMM classifier as much information as possible about an image wasting as little space as possible. The next chapter deals with the foundation of this thesis on face recognition, namely the construction of the specialised HMMs used in the classification experiments..

(50) Chapter 5 Constructing the Hidden Markov Models 5.1. A brief introduction to HMMs. Hidden Markov Model theory forms the background of the industry standard in speech recognition based applications. HMMs tend to be robust recognisers with extreme flexibility in terms of parameters. These characteristics caused us to believe that HMMs might be suitable to image recognition and as this thesis shows, this is in fact the case. An in depth discussion on HMMs is deferred to the many excellent references on the topic, one being by Rabiner & Juang (1986). The purpose of this chapter is to introduce our application specific HMMs and show how an expansion on conventional one-dimensional HMM theory will suit our inherently two-dimensional application.. 5.2. HMM background. We now introduce the notation and mathematical descriptions (regarding HMMs) necessary to illustrate subsequent discussions on our face recognition model. 36.

(51) 37. Chapter 5. Constructing the Hidden Markov Models. 5.2.1. Topology and notation. Over the years of research in pattern recognition quite a number of HMM topologies and configurations have seen the light, as mentioned in du Preez (1997). The standard topology we are concerned with is the non-ergodic, left-to-right Hidden Markov Model as in figure 5.1. The reason this specific model was chosen is because the human face can naturally be divided into segments common to every human (eyes, nose, mouth, chin etc.) and these features are in the same order. A Hidden Markov Model Λ is defined as a set a11. 1. a22. a12. 2. a33. a23. a44. 3. a34. 4. a55. a45. 5. Figure 5.1: Standard left-to-right, non-ergodic HMM. of N emitting states as well as an initial and an end-of-line state (these states are so-called null-states), so we end up with N + 2 states. The expression St = i will indicate the occurrence of state i at time t. The time indices run from t = 1 to t = T , where T is the length of the observation sequence X = [x1. x2. x3. ···. xT ] to be matched to the HMM. The states are. coupled by transitions with aij denoting the state transition probability with the subscripts indicating the two states involved and aii refers to the selfloop probability. The first null-state has a transition probability of 1 and no self-loop probability. The last null-state has no emitting probabilities, it is the termination state. Each emitting state has an associated probability density function (pdf) described as fi (x|St , Λ). This pdf quantifies the similarity of a feature vector xt from the observation sequence to the state St = i. It is important to note that no time step is needed to enter the first null-state, the process will already occupy that state. Using the common shorthand notation, a single left-to-right HMM can now be described as Λ = {a, f }. Introducing the null-states effectively cancels the need for.

(52) Chapter 5. Constructing the Hidden Markov Models. 38. defining an initial value often denoted by π in most of the literature on HMM theory. In order to train an HMM we need to quantify a few probabilities. The match between an observation sequence XT1 and the model Λ can be expressed in terms of the likelihood f (XT1 |Λ). The calculation of this likelihood is often known as the evaluation problem. A possible solution to this problem is to enumerate all possible sequences of states S0T +1 and determine the value of f (XT1 , S0T +1 |Λ) and then determine the marginal pdf by summing over all of them. A more efficient approach is the forwardbackward procedure, described in appendix B. We approximate this by the well known Viterbi algorithm since it is faster. The sequence which delivers the highest “score” will be the solution to what is known as the decoding problem yielding the most likely state sequence. In training the HMM we need to optimise the parameters of the model based on the observation sequence. This can be quantified as finding the highest value of f (Λ|XT1 , S0T +1 ).1 We used what is known as Viterbi reestimation to solve what is often known as the learning problem. This method uses the state sequence (segmentation) obtained by the Viterbi algorithm to re-estimate the parameters of the HMM. This can easily be accomplished by simply updating all the parameters (pdfs and transition probabilities) within the segment specified by the Viterbi algorithm’s segmentation. This algorithm is an example of an Expectation-Maximisation algorithm as we change our pdf’s parameters to obtain the maximum probability score (expectation). The described procedures involving HMMs involve matching an observation sequence to the model. This is quantified as a probability f (XT1 |Λ), showing that any HMM can be seen as a special kind of pdf. 1. A reader familiar with basic statistics will note that this is the reverse of the eval-. uation problem, and therefore simple Bayesian identities can solve this problem..

(53) Chapter 5. Constructing the Hidden Markov Models. 5.3 5.3.1. 39. Model Configurations First configuration — 1D HMM. For the face classification task we used two basic configurations of Hidden Markov Models. In the first case the face was modelled with a vertical HMM running along the rows of the image as seen in figure 5.2. With each state of the HMM representing a distinct facial region (i.e. the eyes, mouth, chin etc.) the characteristic features of any person can be modelled. Inside. Figure 5.2: Vertical top-to-bottom HMM modelling a face. each state S we use a Gaussian mixture model (GMM) as the probability density function fi (x|St , Λ) within the state. A Gaussian mixture model.

(54) Chapter 5. Constructing the Hidden Markov Models. 40. can be expressed as a weighted sum of K Gaussian distributions: L(x) =. K X. p(k)Nk (x). (5.3.1). k=1. where Nk (x) is a D dimensional Gaussian distribution with mean µ and covariance matrix Σ: Nk (x|µ, Σ) =. 1 exp[− (x − µ)T Σ−1 (x − µ)] 2 (2π) |Σ| 1. D 2. 1 2. (5.3.2). and p(k) is a mixture weight constrained by: 0 ≤ p(k) ≤ 1 and. K X. p(k) = 1. k=1. The mixture weights can be seen as probabilities since they represent the importance of each separate Gaussian pdf in the GMM. The dimension D of the Gaussians depends on the feature extraction method we use.2 Now that we have the density functions, we can finalise our HMM by initialising it. This is done by uniformly segmenting the face under consideration along its rows and obtaining the mean vector and covariance matrix of each of these segments. This will be the initial values of the parameters of our GMMs. Furthermore we set all the transitional probabilities of the HMM equal to aij = 0.5, keeping in mind that for each state of the HMM these probabilities sum to 1. Now we have a complete model of our face represented by: Λ = {a, f }. In order to train this model we use the procedure described in the previous section, matching an observation sequence to the model and then optimising the model’s parameters.. 5.3.2. Second configuration — Embedded HMM. We illustrated that calculating the match between an observation sequence and a model, can be characterised as a probability f (XT1 |Λ). This means that an HMM itself could be seen as a specialised pdf. Embedding HMMs 2. See chapter 4 for feature extraction methods..

(55) 41. Chapter 5. Constructing the Hidden Markov Models. to serve as the pdfs of the states of our vertical HMM could indeed enhance the modelling capabilities of our system. For an embedded HMM the conventional top-to-bottom HMM has a horizontal HMM as the pdf of each of its states (instead of a GMM) as shown in figure 5.3. This means that for each vertical state we have fi (x|St , Λ) → λi {aei , fie } where the superscript indicates that we are referring specifically to the horizontal HMM λi with i indicating the vertical state under consideration. Each of the horizontal. 2. 3. 4. 5. 2. 1. 2. 3. 4. 5. 3. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 4. 1. 1. 5. Figure 5.3: Embedded HMM modelling a face. HMMs also needs probability density functions fie (x|λi ) for its states. These pdfs were chosen to be Gaussian mixture models (as described by equation 5.3.1), due to their flexibility. To initialise the whole HMM structure, an image under consideration is segmented uniformly. Each of these segments is then again uniformly segmented across its columns — so we end up with uniform blocks of data.3 The mean vector and covariance matrices of each of these blocks are now found so that every GMM obtains values for its 3. Thus for a 5x5 embedded HMM we will have 25 uniform blocks.

(56) Chapter 5. Constructing the Hidden Markov Models. 42. parameters. Just to clarify, the estimated mean of column m of a M xN block (matrix) xd of data is: \ Axn = E[x d (m)] =. N 1 X xd (i, m) N i=1. (5.3.3). Thus the total mean µd of the block can be expressed as a vector of M means (m = 1 · · · M ). The covariance matrix is estimated by: N 1 X Σd = (xd (i, m) − Axn )(xd (i, m) − Axn )T N i=1. (5.3.4). To summarise the whole embedded HMM concept, we have a vertical HMM containing a number of horizontal HMMs as probability density functions within its states. The initial values of this vertical HMM are the combined initial values of the horizontal HMMs along with uniform transition probabilities. Each horizontal HMM has a GMM as probability density function within each of its states, uniform transitional probabilities and the initial values are obtained from uniform blocks of data to initialise the GMMs. The horizontal HMMs are trained as described in a previous section and this means the calculation of a likelihood. Using the likelihoods calculated of all of the horizontal HMMs, we can finally obtain the same type of likelihood value for our vertical HMM. This results in a trained final model..

(57) Chapter 6 Implementation 6.1. Practical aspects. Now that we have built all the necessary foundations that explains the gist of our system, we can proceed to discuss the implementation issues and the practicalities of building a robust classifier.. 6.1.1. Classifying faces — database partitioning. In order to classify faces we need to train our HMM based classifier on sample images (training data) of each person in the database. Then other “unseen” images or test data is scored against our trained model and, as in most real life scenarios, the highest score wins! The immediate question which arises, involves the partitioning of the databases in terms of training and testing data. In order to compare our classification results with published results, the same partitioning must be used. The partitioning of a database is often done only once (by the first publisher) and then, in order to compare results, such a partitioning seems to propagate through all further publications in the field. To explain the partitioning problem we summarise common partitions of our two databases in Table 6.1. Just to clarify what is meant by the percentage values — it reflects the ratio. amount of training data amount of test data. 43. × 100%. Although the XM2VTS.

(58) 44. Chapter 6. Implementation. Partitioning. Reference. XM2VTS. 75%. Zhang et al. (2004). ORL. 50%. Samaria (1994). Table 6.1: Comparable partitioning of databases. database has 8 images per person, we only used 4 faces (each time using 3 faces to train on and one to test on) in experiments. We chose these faces in accordance with what seems to be the four faces used in Zhang et al. (2004). This could prove to be quite limiting as efficient modelling using HMMs is known to be very training data dependent. In the XM2VTS database one is also dealing with 295 individuals, so face classification does become more difficult. The only drawback is that as far as could be established Zhang et al. (2004) is the only known publication with classification results. All other publications concerning this database tackled the verification problem (because of the well defined protocol as described in Messer et al. (1999)) and a large number of verification results are obtainable. Following the above discussion we set up the XM2VTS database experiments with 3 faces to train on and one to test on to compare classification results.1 The experiments on the ORL database give a good indication on how our system compares to other systems, since a large number of published results are available. This database however has a collection of only 40 individuals (figure A.1) which means results could be seen only as a rough approximation to how a commercial system would perform. We used the ORL database as it is and did classification experiments using the historical 50% partitioning to compare our system with previous results as well as a full leave-one-out experiment. To clarify the 50%, it means that the first five faces were used to train on and the last five used to test on. Perfect classification rates (100%) is obtainable on the ORL database, as we show in the next chapter. 1. This partitioning does have its merits, the images were shot one month apart so. differences in appearance (different hair \ glasses etc.) are present..

(59) Chapter 6. Implementation. 6.1.2. 45. Classifying faces — image background. One of the most frustrating problems encountered in constructing our classifier was the large amount of background present in the XM2VTS database. As shown previously, the background can represent more than 40% of an image. A classification experiment was conducted using the embedded HMM approach with DCT-mod2 coefficients, using the first face of four as test data and the other three faces as training data on all 295 individuals in the database without reducing the amount of background. This carelessness reflected on the results as we achieved a correct classification rate of only 58%. Since we have 8 faces available, but use only four, the effect of the background can easily be verified by running the same experiment and again using 4 faces. However, this time 2 faces are replaced with 2 of those that have been left out.2 In doing this the classification rate increases to 80%! This means that the error rate is effectively halved. It is necessary to take out the background — it is confusing our classifier! This problem at least illustrates that because of the modelling power and dynamic aspects of HMMs, they are so flexible that they tend to model on the background if it represents the bulk of the available data. As a solution to the problem posed by too much background we cut out all the relevant faces of the XM2VTS database (first face from each of the 4 capturing sessions) giving a total of 295 × 4 faces to classify. The faces were cut out manually from downsized images, so as to fit as much of the face as possible into a 236x144 sized window. This corresponds to 56x33 blocks of DCT-mod2 coefficients and 58x35 blocks of DCT coefficients. Again the dimension of the DCT features is 15 and that of DCT-mod2 features is 18. The DCT coefficients are obtained with a sampling overlap of 50%. These were the combination of dimensions used in our final experiments. In the ORL database no cropping of faces was needed as the data is already presented in a “friendly” format as shown in chapter 3. 2. The 8 faces per person in the XM2VTS database were shot across 4 sessions, here. we use the 2 faces from the first 2 sessions..

(60) Chapter 6. Implementation. 6.1.3. 46. Classifying faces — training and scoring HMMs. The process of classification on a database of faces can be summarised as follows: First a database is partitioned into a training and a testing part. with the training data used to train an HMM for each person in the database. This means we have for each person in the database an HMM trained on the training data. Each test face is then scored against all these models and it is clas-. sified to the model with the highest similarity measure. This scoring procedure was done with the reversed Viterbi algorithm, each score representing the similarity between the test face image and a trained model.. 6.2. The HMM configurations. For our experiments we use two basic configurations of HMMs using all three the feature extraction techniques described in chapter 4. We also evaluate these six possible setups on both the available databases. For all the following discussions on the dimensions of the observations and hence the Gaussian Mixture Models, see Figure 6.1.. 6.2.1. HMM configuration I. For configuration I we use a simple one dimensional left-to-right HMM modelling down the rows of each image. It has Gaussian mixture models within its states as probability density functions each one modelling horizontal data. We specify seven states, each state modelling a specific facial region or background, namely: top background and hair, forehead, eyes,.

No results found