A comparison of Gaussian mixture variants with application to automatic phoneme recognition

Hele tekst

(1)A Comparison of Gaussian Mixture Variants with Application to Automatic Phoneme Recognition Rinus Brand. Thesis presented in partial fulfilment of the requirements for the degree Master of Science in Electronic Engineering at the University of Stellenbosch. Supervisor: Prof J.A. du Preez December 2007.

(2)

(3) Declaration. I, the undersigned, hereby declare that the work contained in this thesis is my own original work and that I have not previously in its entirety or in part submitted it at any university for a degree.. Signature. Date. c 2007 Stellenbosch University Copyright All rights reserved.

(4) Abstract The diagonal covariance Gaussian Probability Density Function (PDF) has been a very popular choice as the base PDF for Automatic Speech Recognition (ASR) systems. The only choices thus far have been between the spherical, diagonal and full covariance Gaussian PDFs. These classic methods have been used for some time, but no single document could be found that contains a comparative study on these methods in the use of Pattern Recognition (PR). There also is a gap between the complexity and speed of the diagonal and full covariance Gaussian implementations. The performance differences in accuracy, speed and size between these two methods differ drastically. There is a need to find one or more models that cover this area between these two classic methods. The objectives of this thesis are to evaluate three new PDF types that fit into the area between the diagonal and full covariance Gaussian implementations to broaden the choices for ASR, to document a comparative study on the three classic methods and the newly implemented methods (from previous work) and to construct a test system to evaluate these methods on phoneme recognition. The three classic density functions are examined and issues regarding the theory, implementation and usefulness of each are discussed. A visual example of each is given to show the impact of assumptions made by each (if any). The three newly implemented PDFs are the Sparse-, Probabilistic Principal Component Analysis- (PPCA) and Factor Analysis (FA) covariance Gaussian PDFs. The theory, implementation and practical usefulness are shown and discussed. Again visual examples are provided to show the difference in modelling methodologies. The construction of a test system using two speech corpora is shown and includes issues involving signal processing, PR and evaluation of the results. The NTIMIT and AST speech corpora were used in initialisation and training the test system. The usage of the system to evaluate the PDFs discussed in this work is explained. The testing results of the three new methods confirmed that they indeed fill the gap between the diagonal and full covariance Gaussians. In our tests the newly implemented methods produced a relative improvement in error rate over a similar implemented diagonal covariance Gaussian of 0.3–4%, but took 35–78% longer to evaluate. When compared relative to the full covariance Gaussian the error rates were 18–22% worse, but the evaluation times were 61–70% faster. When all the methods were scaled to approximately the same accuracy, all the above methods were 29–143% slower than the diagonal covariance Gaussian (excluding i.

(5) ii the spherical covariance method)..

(6) Opsomming Die diagonale kovariansie Gaussiese Waarskynlikheid-Digtheid-Funksie (WDF) is ’n baie populêre keuse as basis vir outomatiese spraak-herkenning sisteme. Tot dusver was die enigste keuses gewees tussen die sferiese-, diagonale- en vol-kovariansie Gaussiese WDFs. Alhoewel hierdie klassieke metodes al vir ’n geruime tyd in gebruik is, kon daar geen dokument gevind word wat hierdie metodes teenoor mekaar opweeg vir die gebruik in patroon herkenning nie. Daar bestaan ook ’n gaping in terme van kompleksiteit en spoed tussen die diagonale en vol kovariansie modelle. Die verskil in akkuraatheid, spoed en model-grootte tussen hierdie twee metodes is relatief groot. Daar bestaan ’n noodsaaklikheid vir een of meer modelle wat die spasie tussen hierdie twee klassieke metodes kan vul. Die hoofdoele van hierdie tesis is die evaluasie van drie nuwe WDF tipes wat die area tussen die diagonaal en vol kovariansie Gaussiese implementasies vul om sodoende die keuses van WDF vir outomatiese spraakherkenning groter te maak, om ’n vergelyking te tref tussen al die nuut ge¨ımplementeerde (vanaf vorige werk) en klassieke metodes en dit te dokumenteer, en om ’n toetsstelsel te implementeer wat hierdie metodes op foneemherkenning evalueer. Die drie klassieke digtheidfunksies word elk ondersoek in terme van hul teorie, implementasie en bruikbaarheid. ’n Visuele voorbeeld van elke metode word voorsien om die impak van die aannames wat deur elk gemaak word (indien enige) voor te stel. Die drie nuwe voorgestelde WDFs is die yl kovariansie, die waarskynlikheids gebaseerde hoof komponent analise kovariansie en faktor analise kovariansie Gaussiese WDFs. Die teorie, implementasie en praktiese bruikbaarheid van hierdie metodes word gewys en bespreek. Weereens word visuele voorbeelde gebruik om die verskille in die modellering metodieke uit te wys. Die opstel van ’n toetsstelsel wat twee spraak-databasise gebruik, word ge¨ıllustreer, insluitende aspekte rakende seinprosessering, patroonherkenning en evaluasie van die resultate. Die NTIMIT en AST spraak-databasisse was gebruik vir inisialisering en afrigting van hierdie toetsstelsel. Die gebruik van hierdie stelsel om die verskeie WDFs te evalueer word verduidelik. Toetsing van die drie nuwe metodes benadruk die feit dat hulle inderdaad die gaping tussen die diagonale en vol kovariansie metodes vul. In die verskeie toetse wat uitgevoer was, het die nuut ge¨ımplementeerde WDFs ’n relatiewe verbetering op die fout tempo van ’n soortgelyke diagonale kovariansie Gaussiese WDF getoon van omtrent 0.3–4%. Dit het egter 35–78% langer geneem om te evalueer. Wanneer ons die metodes vergelyk teenoor iii.

(7) iv die vol kovariansie Gaussiese WDF, kry ons ’n relatiewe verswakking in die fout tempo van 18–22%. Evaluasie was 61–70% vinniger. Met al die metodes geskaleer tot min of meer dieselfde akkuraatheid, was al die bogenoemde metodes 29–143% stadiger as die diagonale kovariansie Gaussiesse WDF (uitsluitend die sferiese-kovariansie Gaussiesse WDF)..

(8) Acknowledgements I would like to thank the following people. Without them, this work would certainly not have been possible: • Prof. J. A. du Preez, for his advice, insights and ideas, for being generally interested in my work and having the ability to always keep me motivated, • my parents, who not only supported me financially, but also provided moral support, • the National Research Fund (NRF), for providing me with financial aid, • Gert-Jan van Rooyen and Jaco de Witt for this thesis template and general help with various LATEX issues, • Herman Engelbrecht, for sharing his insights on many topics related (and sometimes not related) to this thesis, • in alphabetical order: Eugene, George, Gert-Jan, Herman, Jaco and Willie, for enduring my rants, listening and commenting on my ideas, laughing at my jokes, and the many coffees shared during the last two years, • Hansie, Lourens, Gid and Giep, for being such great friends, for all the laughs, the coffees, braais and general support, • Nita, who always had time for a chat, especially non-work related and for being an inspirational friend, • and lastly, Michelle, for all the love, support and patience the last eight years, for helping me keep my sanity, for always being interested in my work and for always believing in me.. v.

(9) Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.2 Research Objectives . . . . . . . . . . . . . . 1.3 Concepts Relating to Statistical Modelling of 1.4 Prior Work on PDFs . . . . . . . . . . . . . 1.5 Overview of This Work . . . . . . . . . . . . 1.5.1 The Classic Gaussian PDF variants . 1.5.2 The New Gaussian PDF variants . . 1.5.3 Implementation of Test System . . . 1.6 Contributions . . . . . . . . . . . . . . . . . 2 The Classic Distribution Models 2.1 Introduction . . . . . . . . . . . . . . . . 2.2 The Full Covariance Gaussian PDF . . . 2.2.1 Theory . . . . . . . . . . . . . . . 2.2.2 Implementation . . . . . . . . . . 2.2.3 Strengths and Weaknesses . . . . 2.3 The Diagonal Covariance Gaussian PDF 2.3.1 Theory . . . . . . . . . . . . . . . 2.3.2 Implementation . . . . . . . . . . 2.3.3 Strengths and Weaknesses . . . . 2.4 The Spherical Covariance Gaussian PDF 2.4.1 Theory . . . . . . . . . . . . . . . 2.4.2 Implementation . . . . . . . . . . 2.4.3 Strengths and Weaknesses . . . . 2.5 Summary . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . Speech Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 3 A Newer Generation of More Flexible Gaussian Models 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 A General Linear Transform for Dimension Reduction . . . 3.2.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The Sparse covariance Gaussian PDF . . . . . . . . . . . . vi. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. . . . . . . . . . . . . . .. . . . . .. . . . . . . . . .. 1 1 2 2 4 6 6 8 10 11. . . . . . . . . . . . . . .. 15 15 17 17 19 20 22 22 23 23 26 26 27 27 30. . . . . .. 32 32 32 33 34 35.

(10) vii. CONTENTS 3.3.1 Theory . . . . . . . . . . . . . . 3.3.2 Implementation . . . . . . . . . 3.3.3 Strengths and Weaknesses . . . 3.4 The PPCA Covariance Gaussian PDF 3.4.1 Theory . . . . . . . . . . . . . . 3.4.2 Implementation . . . . . . . . . 3.4.3 Strengths and Weaknesses . . . 3.5 The FA Covariance Gaussian PDF . . 3.5.1 Theory . . . . . . . . . . . . . . 3.5.2 Implementation . . . . . . . . . 3.5.3 Strengths and Weaknesses . . . 3.6 Summary . . . . . . . . . . . . . . . . 4 Test System Implementation 4.1 Introduction . . . . . . . . . . . . . 4.2 Test System: Initial Model Training 4.2.1 The NTIMIT speech corpus 4.2.2 Signal Processing . . . . . . 4.2.3 Pattern Recognition . . . . 4.3 Test System: Final Model Training 4.3.1 The AST Speech Corpus . . 4.3.2 Signal Processing . . . . . . 4.3.3 Pattern Recognition . . . . 4.3.4 Testing . . . . . . . . . . . . 4.4 Baseline Systems . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. . . . . . . . . . . . .. 35 37 38 41 41 43 43 47 47 50 51 54. . . . . . . . . . . . .. 57 57 57 57 58 59 62 62 62 62 65 66 67. 5 Experimental Results 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Comparative Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Evaluation Speed Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 First Baseline: Twenty-four Mixture Components with a Full Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Second Baseline: Twenty-four Mixture Components with a Reduced Training Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Third Baseline: Three Mixture Components with a Full Training Set 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 69 69 69 72. 6 Conclusion 6.1 Concluding Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Topics for Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 82 82 84. 72 75 77 79.

(11) CONTENTS. viii. Bibliography. 86. A Speech Corpora A.1 The NTIMIT Speech Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 The African Speech Technology Speech Corpus . . . . . . . . . . . . . . . .. 89 89 89. B Code Implementations. 91.

(12) List of Figures 1.1 1.2 1.3 1.4. A A A A. data set with two classes before and after LDA transformation. three-state left-to-right HMM. . . . . . . . . . . . . . . . . . . typical speech signal’s first two cepstral coefficients. . . . . . . . GMM representation of figure 1.3. . . . . . . . . . . . . . . . .. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14. A test dataset for illustrative purposes (2 dimensional view). . . . . . . . . . A test dataset for illustrative purposes (3 dimensional view). . . . . . . . . . Single dimension Gaussian with varying mean values. . . . . . . . . . . . . . Single dimension Gaussian with varying standard deviation values. . . . . . . Top view of a full covariance GMM fit to test set (12 mixture components). . Side view of a full covariance GMM fit to test set (12 mixture components). Top view of a diagonal covariance GMM fit to test set (12 mixture components). Side view of a diagonal covariance GMM fit to test set (12 mixture components). Top view of a diagonal covariance GMM fit to test set (16 mixture components). Side view of a diagonal covariance GMM fit to test set (16 mixture components). Top view of a spherical covariance GMM fit to test set (16 mixture components). Side view of a spherical covariance GMM fit to test set (16 mixture components). Top view of a spherical covariance GMM fit to test set (32 mixture components). Side view of a spherical covariance GMM fit to test set (32 mixture components).. 16 16 17 18 21 21 24 24 25 26 28 28 29 29. Top view of a sparse covariance GMM fit to test set (12 mixture components). Side view of a sparse covariance GMM fit to test set (12 mixture components). Top view of a sparse covariance GMM fit to test set (16 mixture components). Side view of a sparse covariance GMM fit to test set (16 mixture components). Top view of a PPCA covariance GMM fit to test set (12 mixture components). Side view of a PPCA covariance GMM fit to test set (12 mixture components). Top view of a PPCA covariance GMM fit to test set (16 mixture components). Side view of a PPCA covariance GMM fit to test set (16 mixture components). Top and side views of an example dataset. . . . . . . . . . . . . . . . . . . . Full Gaussian representation of figure 3.9 with principal components and factors. FA covariance Gaussian representation of figure 3.9a with principal component and original factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12 Top view of a FA covariance GMM fit to test set (12 mixture components). .. 39 39 40 40 44 44 45 46 47 48. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11. ix. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 3 4 4 5. 49 52.

(13) x. LIST OF FIGURES 3.13 Side view of a FA covariance GMM fit to test set (12 mixture components). . 3.14 Top view of a FA covariance GMM fit to test set (16 mixture components). . 3.15 Side view of a FA covariance GMM fit to test set (16 mixture components). .. 52 53 54. 4.1 4.2 4.3 4.4 4.5 4.6 4.7. 61 63 64 64 65 65 66. A three-state left-to-right HMM. . . . . . . . . . . . . . . . . . . . . An HMM consisting of a set of phonemes occurring in the utterance. Example of a second order HMM. . . . . . . . . . . . . . . . . . . . . A first order reduced HMM from the second order HMM in figure 4.3. A single state HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . A three-state parallel HMM. . . . . . . . . . . . . . . . . . . . . . . . A combination of the single state and parallel HMMs. . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . ..

(14) List of Tables 1.1 Resultant gains/losses in error rate and speed over the diagonal covariance Gaussian PDF for each PDF type with equal mixture components. . . . . . . 1.2 Resultant gains/losses in speed over the diagonal covariance Gaussian PDF for each PDF type with scaled mixture components for similar accuracy (ranged between various testing systems). . . . . . . . . . . . . . . . . . . . . . . . .. 14. 4.1 Error rates for the baseline systems. These systems all use the diagonal Gaussian covariance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 67. 5.1 Comparison of error rates and evaluation speed for differing PDF-types with the same target number of mixture components. . . . . . . . . . . . . . . . . 5.2 Comparison of different PDFs to the first baseline. . . . . . . . . . . . . . . . 5.3 Comparison of different PDFs to the second baseline. . . . . . . . . . . . . . 5.4 Comparison of different PDFs to the third baseline. . . . . . . . . . . . . . . 5.5 Summary of all evaluation speed tests. . . . . . . . . . . . . . . . . . . . . . 5.6 Summary of number of operations needed for the tests in table 5.5. . . . . .. 70 72 75 77 79 80. xi. 13.

(15) Nomenclature Acronyms ASR AST CMS DCT EM FA GMM HLDA HMM KLT LDA MFCC ML ORED PCA PDF PPCA PR T-BAGMM. Automatic Speech Recognition African Speech Technology Cepstral Mean Subtraction Discrete Cosine Transform Expectation Maximisation Factor Analysis Gaussian Mixture Model Heteroscedastic Linear Discriminant Analysis Hidden Markov Model Karhunen-Loéve Transform Linear Discriminant Analysis Mel-scale Frequency Cepstral Coefficient Maximum Likelihood Order Reducing Principal Component Analysis Probability Density Function Probabilistic Principal Component Analysis Pattern Recognition Tree-Based Adaptive Gaussian Mixture Model. Variables symbol σ σ2 ǫ ρ µ ˆ µ. description Standard Deviation Variance Noise Element Correlation Coefficient Mean Vector Estimated Mean Vector xii.

(16) xiii. LIST OF TABLES symbol Σ λ λ ˆ Σ Ψ ˆ Ψ diag(·) p(·) B C D E(·) I M N P (·) R Pˆ (·) U W ˆ W. description Covariance Matrix Eigenvalue Eigenvalue Matrix Estimated Covariance Matrix Noise Covariance Matrix Estimated Noise Covariance Matrix Diagonal Operator Probability Whitening Matrix Number of Clusters/Classes Dimension of set Expected Value Identity Matrix Reduced Number of Dimensions Number of samples in set Prior Rotation Matrix Estimated Prior Eigenvector Matrix Transform Matrix Estimated Transform Matrix.

(17) Chapter 1 Introduction 1.1. Motivation. Today Automatic Speech Recognition (ASR) systems are slowly gaining popularity. Some computer operating systems (such as Microsoft Windows XP) have these systems incorporated and they are also becoming popular in luxury cars. It is obvious that systems like these are very beneficial as extra input devices. When one is driving, it is much safer just to ’ask’ the car to perform a task such as calling a number on the phone or selecting a radio station than doing it by hand and possibly losing control of the vehicle. The technology driving these systems has not changed drastically in the last few years. While the full system implementations have been well documented [16, 23, 25], the finer details regarding the statistical models to calculate probabilities are not well documented and compared. The motivation behind this thesis is not only to shed some light on the different statistical models (more specific all the Gaussian based models), but to do comparative tests between them and discuss the strong and weak points of each one. This includes classic approaches and also a few newly introduced ones, where implementation is also discussed. When building models of speech utterances (phonemes in the case of this work), the statistical model usually consists of a mixture of smaller models. This ensures better coverage of the statistical properties of the segment of data we want to model. For the past few years, the prevalent base model for these mixture components has been the diagonal covariance Gaussian Probability Density Function (PDF) [23]. Traditionally three variants of multivariate Gaussian PDFs are used. The full covariance, diagonal covariance and spherical covariance Gaussian PDFs. These variants represent a range of precision. This ranges from a big generalisation (the spherical covariance) to no generalisation (the full covariance) of the Gaussian shape. Practical results have shown (and we will again show this later in this thesis) that the spherical covariance matrix is a gross generalisation and is not practically usable in ASR systems of today. Full covariance Gaussians are very detailed, but very slow to train and to calculate the likelihoods. The only reasonable option is the diagonal covariance Gaussian,. 1.

(18) 1.2 — Research Objectives. 2. as it is relatively fast to execute in both training and calculating likelihoods, and it retains enough information to be useful. This means that currently the diagonal covariance Gaussian is the only realistic option to use in an ASR system. We are limited by the fact that to improve the mixture model, we can only raise the number of mixture components. This can introduce problems such as over training. If we want a more detailed model than the diagonal covariance, we are forced to use the full covariance model. This represents a huge difference in performance. The motivation for this work is to find more intermediate options for the implementation of PDFs in an ASR system. These options should fill the void that exists between the diagonal and full covariance PDFs. Another useful addition is the description, implementation and benchmarking of such methods against their traditional counterparts.. 1.2. Research Objectives. Although documentation regarding the methods discussed in this thesis exist in isolation, there is no one document encompassing all these methods together and comparing them to each other in terms of implementation. In this document we have the following objectives: • Introducing three PDF classes from previous work [1, 8, 10, 14, 28] to be used in mixture models. These are the sparse-, the Probabilistic Principal Component Analysis(PPCA) and the Factor Analysis (FA) covariance Gaussian models. • Describing the mathematical description of these models. • Showing examples of the practical usage of each model, including strengths and weaknesses. • Providing comparative results for these and the traditional models on a working ASR system.. 1.3. Concepts Relating to Statistical Modelling of Speech Data. When modelling speech data, the raw sound format is first processed into a form easy for recognition. We want to have the speech data in sets of equal length vectors to make modelling practical. Preprocessing techniques are applied to the sound data to emphasise the higher frequencies and to make the speech segments all have equal average power [21]. Next we extract the feature vectors from the data as Mel-scale Frequency Cepstral Coefficients (MFCCs) using the Mel-scale filter bank and Discrete Cosine Transform (DCT) [3, 5, 21]. Next we do post-processing on the feature vectors (now consisting of cepstral coefficients). First we do Cepstral Mean Subtraction (CMS) to get rid of most of the recording channel.

(19) 1.3 — Concepts Relating to Statistical Modelling of Speech Data. 3. effects. We then normalise each dimension on a set of feature vectors to have unity variance. This is done in order to improve the calculation accuracy when using floating point numbers on a computer. Next we take each feature vector and append the preceding four and following four feature vectors to it. This includes the changes over time for the following features. To decrease the size (or dimension) of the resulting feature vectors, we calculate a Linear Discriminant Analysis (LDA) transform. We take the parts of the speech we want to model under a single PDF and cluster them together into separate classes. The LDA not only reduces the dimensionality of the features, but also rotates them to ensure maximum separability between the classes we chose. The dimensions that have the highest separability between the classes are retained. An illustration of LDA is seen in figure 1.1.. transformed feature 2. 25. 20. feature 2. 15. 10. 5. 0. −5. 10. 5. 0. −5. −10. −10 −5. 0. 5. feature 1. 10. −2. 0. 2. 4. 6. 8. 10. transformed feature 1. Figure 1.1: A data set with two classes before and after LDA transformation.. In figure 1.1 we see that before the LDA transform is applied, the two features contain similar information regarding class separability. After LDA transformation, the transformed feature 1 contains the maximum class separability information. Transformed feature 2 has no separability information (the two classes overlap each other fully) and can for the purpose of class separation be removed. This reduces the dimension from two to one while still retaining maximum class separation. After these steps, we have the data in a form suitable to generate descriptive models. We model each phoneme by using a Hidden Markov Model (HMM). A three-state left-to-right HMM is shown in figure 1.2. Besides the start and stop states, this HMM contains three states that can be traversed from left to right. This creates a model that has three distinct time sections. Each state represents a certain time section of the sound. State one models the beginning of the phoneme, state two the middle and state three the end. Each state of the HMM contains a Gaussian Mixture Model (GMM) to describe the features expected at this part of the phoneme. This mixture model consists of a sum of weighted Gaussian PDFs. When examining speech data (figure 1.3) we see the data is distributed in a non Gaussian way. It is therefor not practical to use one Gaussian PDF to calculate the statistics of the data. When using a GMM however, the data from figure 1.3.

(20) 4. 1.4 — Prior Work on PDFs. start. 1. 2. stop. 3. Figure 1.2: A three-state left-to-right HMM.. can be modelled in an accurate PDF (shown in figure 1.4).. 30 20 10. coefficient 2. 0 −10 −20 −30 −40 −50 −60 −250. −200. −150. −100. −50. 0. coefficient 1. Figure 1.3: A typical speech signal’s first two cepstral coefficients.. Using the above concepts we can model each phoneme with an HMM that describe certain time sections of the speech with GMMs.. 1.4. Prior Work on PDFs. As mentioned before, the full, diagonal and spherical covariance Gaussian PDFs have been used for some time. Of these three, the diagonal covariance variant is the most used. Some work has been done to make up for the large gap between the full and diagonal covariance variants. One of the ways to include more information is by the use of Heteroscedastic Linear Discriminant Analysis (HLDA) [12, 18]. This method would replace normal LDA in the post-processing of the speech signal. HLDA also considers the difference in variance for each class and applies this information for better class separation. After HLDA, models are.

(21) 5. 1.4 — Prior Work on PDFs. −4. x 10 8. probability. 6. 4. 2. 0 20 0. 0. −50 −100. −20. −150. −40. coefficient 2. −200 −60. −250. coefficient 1. Figure 1.4: A GMM representation of figure 1.3.. estimated from the data in the same fashion as before. With HLDA, as with LDA, we do not improve the models, but we improve the data sent to the models. The sparse covariance Gaussian PDF was developed for use in the Pattern Recognition system (PatrecII) of the Digital Signal Processing (DSP) group of the Stellenbosch University (SUN). It was conceived and implemented by Prof. Johan du Preez [8, 10]. This method was developed because there was a need to build better statistical models for pattern recognition than was possible with the diagonal covariance Gaussian PDF. This method also has the benefit of being able to scale to model more or less information as needed. This method has not been documented formally and will be discussed in depth in chapter 3. The Probabilistic Principal Component Analysis (PPCA) covariance matrix is derived from the plain Principal Component Analysis (PCA) transform [1, 28]. With the PCA transform, one can extract the axes describing the most information. The PCA is commonly used for dimensionality reduction (see section 3.2). While the PCA works well on a global scale, it is difficult to use it on individual mixture components of a data set as there is no correspondence between the PCA and a probability density structure. With the PPCA method, the PCA structure is reformulated in a maximum likelihood framework. The result of this and the previous work is a mathematical model for the PPCA method that can be used in a Gaussian mixture model (GMM) setup. The PPCA covariance matrix can also be scaled to directly store a variable amount of information. The number of principal components explicitly modelled determine how much.

(22) 1.5 — Overview of This Work. 6. of the information is modelled. Factor Analysis (FA) uses the same maximum likelihood framework of the PPCA method, but with a different assumption [1]. The PPCA method assumes the discarded information can be modelled isotropically (i.e. via a spherical covariance matrix). With the FA covariance matrix the discarded data is modelled via a diagonal covariance model. Using the same structure as the PPCA method means it is equally usable in GMMs. One of the disadvantages of the FA computation is that unlike the PPCA method there is no closed loop solution and the parameters need to be estimated via a Expectation Maximisation (EM) algorithm implementation. An algorithm for simultaneously training the EM parameters of the GMM and FA covariance mixture components is proposed in the literature [14]. The FA covariance matrix can be scaled in a similar manner as the PPCA covariance matrix. In this case the number of factor loadings to compute determines the complexity of the model. It is clear that there have been a few attempts to formulate new processing methods and PDFs to make up for the gap between the diagonal and full covariance matrices. One way is to improve the information in the data by the HLDA. Another is to retain more information from the covariance matrix and discarding information that is less important. This is what the sparse-, PPCA- and FA covariance matrices try to achieve. Another thing to note is the absence of any work to directly compare all these methods. Even a comparison between the classic methods alone is scarce. One of the aims of this thesis is to provide such a comparison.. 1.5. Overview of This Work. This section contains a summary of the work done in this thesis. This summary contains discussions on the methods implemented and issues surrounding them. Motivation and detail are left to the chapters describing these methods and their implementation. Chapter 2 describes the theory, implementation and properties of the three traditional Gaussian PDFs, namely spherical, diagonal and full covariance. Chapter 3 does the same for the three new PDFs: the sparse-, PPCA- and FA covariance Gaussians. Chapter 4 describes the test system used for evaluating these methods and chapter 5 the results of these evaluations. The following sections will outline the work in these chapters:. 1.5.1. The Classic Gaussian PDF variants. The three classic PDFs used in GMMs are the full, diagonal and spherical covariance Gaussian PDFs. As mentioned before, the diagonal is the method used most often, as it is fast and gives reasonably good models of the data. Al three these methods are very easy to implement. The full covariance Gaussian requires estimating the covariance from the data, a well established statistical procedure..

(23) 1.5 — Overview of This Work. 7. Determining the diagonal covariance Gaussian is just as easy, as one discards all the off diagonal components of the full covariance matrix. One can also take advantage of the fact that only the diagonal of the matrix is wanted and adjust the equations to estimate only the diagonal. Estimating the spherical covariance matrix is achieved by taking the mean of the elements on the diagonal of the covariance matrix. Again the math can be simplified by only estimating this value. These three methods are all used in a GMM by using the Expectation Maximisation (EM) algorithm [20]. The first (Expectation) step requires calculating the likelihood of each feature for each mixture component. The calculation of the likelihood depends on the type of PDF used for the component. Calculating the likelihood for the simpler methods, like the diagonal and spherical covariance Gaussians, is faster than for the full covariance Gaussian. The second (Maximisation) step involves using the information of step one to regroup feature data into the mixture components and reestimating the component statistics. Estimation of the statistics depends again on the type of mixture component used. These two steps are repeated until the estimation converges. The models estimated by these methods also differ. The full covariance Gaussian is the best model, as it not only models the covariance of each dimension, but also all of the correlations between dimensions. It needs a lot of training data to estimate a reasonable model and less mixture components are needed than with other PDFs. It is very slow to estimate and also slow to calculate the likelihood. The diagonal covariance Gaussian only models the variance of each dimension. Correlations between dimensions are ignored. It needs less training data than the full covariance Gaussian, but more mixture components are needed to model data efficiently. It is fast to estimate and also fast to calculate the likelihood. The gain in time over calculation overshadows the loss in time caused by using more mixture components. This makes it more practical to use than the full covariance Gaussian. The spherical covariance Gaussian models the average covariance of all the dimensions and ignores individual covariances and correlations. It needs very little training data, but to use in a mixture model many mixture components need to be used. It is very fast to estimate and calculate the likelihood. This model is really only practical when data is very scarce. The overhead in calculating the EM-algorithm is pronounced because so many mixture components are needed. This method is therefor only practical in very special circumstances. When using these methods, the dimensionality of the feature vectors usually have been reduced via the LDA before modelling takes place (later discussed in section 3.2). After the LDA transform, the average covariance matrix of all the predetermined data clusters is the identity matrix. Each of these clusters will be individually modelled by a GMM. With the average cluster covariance matrix being the identity matrix, it means that the data in each cluster has been scaled in such a way that the covariance matrix is as close to.

(24) 1.5 — Overview of This Work. 8. identity as possible. This means that correlations between dimensions are at a minimum. This makes the assumption of the diagonal covariance matrix more feasible. With the full covariance Gaussian, a lot of computation goes into the correlations between the dimensions. When these values are minimised, it makes more sense to use the diagonal covariance Gaussian. The LDA transform also favours the spherical covariance Gaussian, because the variances of the dimensions are closer to each other than before the LDA. Realistically, in speech data these clusters are not similarly shaped. Even if the average covariance matrix is the identity matrix, the individual cluster covariance matrices could still be far from identity. The LDA does, however, scale these matrices closer to the ideal than before the transform.. 1.5.2. The New Gaussian PDF variants. The three new Gaussian PDFs in this thesis are the sparse-, PPCA- and FA covariance Gaussian PDFs. The aim of this work is to examine the implementation and performance of each of these methods in relation to the classic methods mentioned above. All three methods have a trickier implementation than the classic methods. They all are more complex than the diagonal covariance Gaussian, but they all fill the space between the diagonal and full Gaussians. Another benefit is the scalability of these methods. They can be made progressively more complex up to the same complexity as the full Gaussian. Estimating all these equations does require the estimation of a full Gaussian covariance matrix first. With the sparse covariance Gaussian we calculate the correlation coefficient matrix from this covariance matrix. The minimum correlation coefficient value is then used to discard any correlations below this threshold. The correlation coefficients are then evaluated from the largest to the smallest. Each value represents two dimensions that are correlated. All correlated dimensions are stored in blocks. The length of each block is limited by the maximum block length parameter. When the blocks have been calculated, only the covariance values corresponding to these blocks are stored. When calculating the likelihood only these blocks and the diagonal are considered. When using the sparse covariance Gaussian, the model has full covariance within the correlation blocks, but diagonal covariance between them. If the dimensions were to be rearranged so that the correlated dimensions were adjacent, the resulting covariance matrix would only have had blocks of non-zero values around the main diagonal. This allows for more precision between dimensions that are heavily correlated. The sparse covariance Gaussian needs more training data than the diagonal, but the amount of data needed can be determined by the size of the blocks. When using this model in a GMM, the number of mixture components needed for a good result therefor depends on the maximum size for these correlation blocks. Estimating the models and calculating the likelihood do take longer than for only the diagonal, but the modelling accuracy is also improved..

(25) 1.5 — Overview of This Work. 9. The estimation of the PPCA covariance Gaussian requires finding the principal components of the covariance matrix. The number of components to consider is determined by a preset parameter and also by the relative value of the eigenvalue corresponding to each principal component. The discarded components are generalised in a spherical covariance Gaussian. The sum of the information contained in the principal components and the spherical covariance element combine to give the PPCA covariance Gaussian representation. When the likelihood is calculated, only these values are used for the calculation. The PPCA covariance Gaussian models the directions of the principal components fully. In these directions all the information is retained. The remaining directions orthogonal to the principal directions are generalised to one value (as the whole covariance matrix is represented by the spherical covariance Gaussian). This method also needs more training data than the diagonal covariance method. With every principal component, the required amount of data increases. In a GMM, the number of mixture components for a good representation thus depends on the number of retained principal axes. This method also takes longer to evaluate or estimate than the diagonal does, because it uses more information. The estimation of the FA covariance Gaussian relies on finding the factor loadings of the covariance matrix via factor analysis. This method is not closed loop and requires a few iterations to converge. The number of factor loadings to calculate is predetermined. The rest of the covariance information is generalised into a diagonal covariance matrix. The sum of the information in the diagonal element and the retained factors gives the combined FA covariance Gaussian representation. When the likelihood is determined, only these parameters are considered. The FA covariance Gaussian explicitly models the information in the factor loadings. As with the PPCA method we have some directions that are modelled completely. The remaining information is modelled by a diagonal matrix, such as the diagonal covariance Gaussian does. This means the directions orthogonal to the ones corresponding to the factor loadings are modelled, but the correlations between them are not. This method needs more training data than the diagonal covariance Gaussian, but the amount of data needed is determined by the number of factor loadings. In a GMM, the number of mixture components needed for a good representation therefor relies on the number of factor loadings per component. This method takes longer to estimate than the diagonaland PPCA covariance Gaussians. It also takes longer to calculate the likelihood than the diagonal does because more information is modelled. These three methods can be used in a GMM in a similar fashion as the classic methods. The EM-algorithm is implemented in the same manner. With the expectation step, the likelihoods are calculated according to each method. In the maximisation step, the reestimation is done for the covariance matrix of each mixture component depending on which PDF is used. With the FA covariance Gaussian one would have to do an iterative re-estimation for.

(26) 1.5 — Overview of This Work. 10. each mixture component, for each step of the EM-algorithm. This would be very time consuming. To remedy this, only one iteration of the re-estimation calculation is done for each component, using the previous values of the components as the initial conditions. This proved to converge simultaneously with the EM-algorithm for calculating the GMM.. 1.5.3. Implementation of Test System. To accurately evaluate the above PDFs, it is necessary to implement them in a real world ASR system. For the purpose of this thesis, two sets of speech corpora were used. The first is the NTIMIT speech corpus (see Appendix A for more detail on speech data sets). This corpus has time aligned transcriptions for each utterance. This makes it possible to identify the segments of each utterance that contain the phoneme data. These segments are identified and used to create phoneme HMMs (similar to the one in figure 1.2) that model each phoneme. These HMMs are then used to evaluate a test data set (not used in training). The phonemes in this test set are evaluated and automatically transcribed. These transcriptions are then compared to the test set’s (assumed) correct transcriptions. An accuracy score is calculated based on correct placings and wrong phoneme insertions. These scores are used to evaluate each PDF type used as base for the GMMs that are in the HMM states. The second data set is the AST speech corpus. This corpus does have transcriptions for each utterance, but these transcriptions are not time aligned. A different technique, called embedded training, is used to train the phoneme HMMs for this system. This technique needs an initial set of phoneme HMMs. We use the set from the previous NTIMIT corpus to initialise this training system. The HMM phonemes are trained with each of the six PDF types investigated in this thesis. The diagonal covariance Gaussian is used for the baseline systems, as it is the standard base PDF used in ASR. For the sparse covariance Gaussian we choose to train with a maximum block size of two and a minimum correlation coefficient value of 0.4. This results in a model very close to the diagonal covariance Gaussian. The time difference in evaluation is therefor only due to a few extra parameters. For the PPCA covariance Gaussian we choose to train with one principal component and we limit the eigenvalue of the evaluated components to 0.083 (see Chapter 3). Similarly we train an FA covariance Gaussian system with one factor loading. These are again one iteration up from the diagonal for each model. In doing these tests it is found that the three new PDFs do indeed model more information than the diagonal covariance Gaussian. They all give better results with a slower evaluation time than the diagonal covariance Gaussian. The models also have more free parameters, confirming the fact that they do store more information than the diagonal covariance Gaussian. Although the three newly implemented PDFs were more accurate, slower and larger.

(27) 1.6 — Contributions. 11. than the diagonal covariance Gaussian, they were less accurate, faster and smaller than a full covariance Gaussian. This confirms that these methods do indeed fill the gap between these two. We next trained a system with each method to match the baseline system accuracy by altering the number of mixture components. These systems are evaluated and the evaluation times are compared. The result of this test showed that the three newly implemented PDFs have slower evaluation times than the diagonal covariance Gaussian at the same accuracy. This further encourages the use of diagonal covariance above any other method. Retrained systems on a data scarce version of the AST speech corpus showed that while there was some improvement in the number of mixture components needed in some cases, the diagonal covariance system still was the fastest for the given accuracy.. 1.6. Contributions. 1. Although the classic distribution models, namely the full, diagonal and spherical covariance Gaussian PDFs are well known, not one text could be found that directly compare these methods to each other and comment on their strengths and weaknesses: (a) Implementation of these three methods is very simple. No complicated math is needed, as simple assumptions on the full covariance Gaussian lead to the diagonal and spherical covariance Gaussians. (b) Visual examples give the reader a more intuitive idea how these methods model the data. Full covariance Gaussians model each dimension and all correlations. The diagonal covariance Gaussian only models the variance of each dimension and the spherical covariance Gaussian models the average covariance of all the dimensions. (c) The comparison of these three methods confirms the popularity of the diagonal covariance Gaussian. Although the full covariance Gaussian is found to be the most accurate, it is slower and much larger than the other methods. The spherical covariance Gaussian is very inaccurate and not practically usable. The diagonal covariance Gaussian is fast and accurate enough to model the data effectively. 2. Three additional methods are considered for the use in ASR systems. The sparse, PPCA- and FA covariance Gaussian PDFs are specifically created to fill the void between the diagonal and full covariance Gaussians. Not only are these implemented, but compared to each other and the classic PDFs: (a) There are some issues regarding implementation by these three methods. These issues are discussed and the full implementations are described. Matlab code for these implementations are supplied in Appendix B. The algorithm for calculating.

(28) 1.6 — Contributions. 12. the sparse covariance matrix is not trivial and requires a few passes through the covariance matrix elements. The PPCA covariance matrix elements only require an eigenvalue decomposition and some minor calculations. The FA covariance matrix requires factor analysis, which is an iterative process. (b) Some visual examples give the reader a sense of what extra information is regarded important to model for these three methods. The sparse covariance Gaussian acts in some dimensions like a full covariance Gaussian and in others like a diagonal covariance Gaussian. Which correlations to model is determined by the input parameters. With the PPCA covariance Gaussian the principal directions are modelled fully, while the remaining data is generalised by a spherical covariance matrix. The FA covariance Gaussian also fully models the information in the factor loadings, but generalises the remaining information with a diagonal covariance matrix. (c) When comparing these models to the classic methods, we find that all three improve in accuracy on the diagonal covariance Gaussian. They are still faster and smaller than the full covariance Gaussian. These methods do fit in the gap between the full and diagonal covariance Gaussians. All three have differing accuracy and speed depending on their input parameters. This is due to the fact that each one models extra information according to their own measures. It is difficult to directly compare these methods as their scale factors are not equal. 3. Evaluation of these methods in a real world ASR system is needed to fully examine the differences. A test system is implemented and a few tests are run: (a) Implementation of a phoneme recogniser is discussed broadly. Two datasets are used, with one having transcriptions that are time-aligned (the exact time of each word occurrence is known) and one dataset that has transcriptions where the exact time occurrence of each word is not known. Issues regarding signal processing, pattern recognition and testing are discussed. With time-aligned transcriptions the phonemes are easy to identify and extract from the speech data. With nontime-aligned transcriptions a process called embedded training is used to train the phoneme models since it is impossible to establish which feature vectors belong to which phoneme (or phoneme section). (b) The results of the comparison between the PDFs are shown and discussed. The error rate gains/losses are shown in table 1.1. Again it is seen that the diagonal covariance Gaussian gives the best balance between accuracy, speed and size when comparing the classic models. In our test system, it has a relative 18% higher error rate than the full Gaussian, but is 355% faster. The spherical covariance Gaussian has a 14% higher error rate than the diagonal covariance Gaussian and is 30.6% faster. It was found though, that even when doubling the mixture components,.

(29) 13. 1.6 — Contributions. it still does not come close to the diagonal covariance Gaussian in accuracy and becomes progressively slower. (c) When comparing the newly implemented models it is found that the three models improve the accuracy over the diagonal covariance Gaussian by 0.3–4% (see table 1.1). Evaluation took from between 34–77% longer. It is clear that a gain in accuracy is achievable with penalties in speed. The sparse covariance and PPCA Gaussians have added functionality, because an extra parameter is used to evaluate if correlations are strong enough to model, making it faster and smaller on data that is not heavily correlated. The FA covariance Gaussian explicitly models data, regardless of the strength of correlations. The FA covariance Gaussian makes up for this fact by modelling the discarded dimensions more accurately with a small speed penalty over the PPCA method, ensuring higher accuracies. PDF (Gaussian) Type Spherical Full Sparse PPCA FA. Gain(+)/Loss(-) Error Rate Speed +13.7% -18.4% -1.4% -0.3% -4%. +30.6% -355.8% -34.8% -69.6% -77.9%. Table 1.1: Resultant gains/losses in error rate and speed over the diagonal covariance Gaussian PDF for each PDF type with equal mixture components.. (d) The results of further comparisons are shown. Mixtures for each method are scaled to get close to the accuracy of the diagonal covariance Gaussian. The speed of each method is then compared. Additional tests are run with limited mixture components and limited training data. The general results are shown in table 1.2 (note that the spherical covariance Gaussian could only yield comparative results with lower mixture components). The results on table 1.2 shows that none of the methods manage to beat the diagonal covariance Gaussian. This implies that the diagonal covariance Gaussian is still the best method to use and taking more mixture components is more efficient than trying to model the correlations between dimensions..

(30) 14. 1.6 — Contributions. PDF (Gaussian) Type. Gain(+)/Loss(-) Speed. Spherical Full Sparse PPCA FA. -23.5% -(142.9–200.5)% -(28.9–53.3)% -(68.7–82.2)% -(53.7–61.9)%. Table 1.2: Resultant gains/losses in speed over the diagonal covariance Gaussian PDF for each PDF type with scaled mixture components for similar accuracy (ranged between various testing systems)..

(31) Chapter 2 The Classic Distribution Models 2.1. Introduction. In this chapter we focus on the classic PDFs used to model speech data. These models are mostly used in GMMs to approximate complex data. In the case of phoneme recognisers, the GMMs model each of the states of an HMM. The models discussed in this chapter are: 1. the full covariance Gaussian distribution model, 2. the diagonal covariance Gaussian distribution model, and 3. the spherical covariance Gaussian distribution model. For each one we begin by describing the theory behind the model. This includes mathematical formulae. We also discuss, if relevant, the reason for the development of the models. Implementation and the issues thereof are also discussed, followed by a discussion of the strengths and weaknesses of each model. For illustrative purposes we select a common test dataset to illustrate the behaviour of each model. This set can be seen in figures 2.1 and 2.2. It is a 3 dimensional spiral that has the following properties: • On the X-Y plane it has the shape of an annulus that is evenly distributed between the radius of 3 and 4. • On the Z plane it has a constant increasing slope with a Gaussian noise element added. We use the EM algorithm [20] to train GMMs with each PDF type and examine how they fit this test set. We see how many mixture components are needed for a good representation and compare the differences between each PDF.. 15.

(32) 16. 2.1 — Introduction. 3. y-axis. 2 1 0 −1 −2 −3 −10. −5. 0. 5. 10. x-axis 4 3.5. z-axis. 3 2.5 2. 1.5 1 0.5 0 −6. −4. −2. 0. 2. 4. 6. y-axis. Figure 2.1: A test dataset for illustrative purposes (2 dimensional view).. 4.5 4 3.5. z-axis. 3 2.5 2 1.5 1 0.5 0 −0.5 4 3 4. 2 3. 1. 2. 0. 1 −1. 0 −1. −2. y-axis. −2. −3. −3 −4. −4. x-axis. Figure 2.2: A test dataset for illustrative purposes (3 dimensional view)..

(33) 17. 2.2 — The Full Covariance Gaussian PDF. 2.2. The Full Covariance Gaussian PDF. 2.2.1. Theory. The full covariance Gaussian PDF is the base for all the other models discussed in this thesis. It is the model that holds the most information. It gives all the statistics over a certain data collection with the assumption that the data is normally distributed. Let us look at the one dimensional case. The Gaussian PDF is mathematically formulated as 1 2 2 p(x) = p e−(x−ax ) /2σx 2 2πσx. (2.1). where σx > 0, −∞ < ax < ∞ and p(x) is the density of x. In figure 2.3 we can see Gaussian representations for ax equal to -1, 0 and 1. It illustrates that ax denotes the mean value 0.4. ax = 0 ax = 1 ax = −1. 0.35. likelihood. 0.3. 0.25. 0.2. 0.15. 0.1. 0.05. 0 −5. −4. −3. −2. −1. 0. 1. 2. 3. 4. 5. x. Figure 2.3: Single dimension Gaussian with varying mean values.. of the PDF, or in simpler terms it gives us the value of x around which the distribution is centred. It can also be said that ax is the expected value of the distribution. This is evident as it has the highest likelihood. In figure 2.4 we illustrate Gaussian representations with varying values for σx . This changes not only the ’spread’ of the function, but also the size of the peak. The latter is because a true PDF always has an area of one. We call σx the standard deviation of the function and σx2 is defined as the variance. The maximum value for the Gaussian PDF is (2πσx2 )−1/2 . The function has 0.607 times its maximum value at the points x = ax ± σx . The Gaussian PDF is popular because it models the density function of many real life events that are deemed to be random. The Gaussian is also a simple statistical model as all the moments can be calculated as functions of the mean and variance and these are the only values needed to compute the Gaussian approximation of an event [11]..

(34) 18. 2.2 — The Full Covariance Gaussian PDF. 0.8. σx = 1 σx = 0.5 σx = 2. 0.7. likelihood. 0.6. 0.5. 0.4. 0.3. 0.2. 0.1. 0 −5. −4. −3. −2. −1. 0. 1. 2. 3. 4. 5. x. Figure 2.4: Single dimension Gaussian with varying standard deviation values.. In general, we are usually working with more than one dimension in a PR problem. Thus we are working with the multivariate Gaussian PDF. The multivariate Gaussian PDF has at its input a multi dimensional vector. It has a mean vector µ, instead of a mean value, that has a mean value for each dimension. The variance is also replaced by Σ, the covariance matrix. The covariance matrix is a square matrix that has as many rows and columns as there are dimensions. Each row and each column represents a certain dimension. The elements of Σ are a function of the variance and correlation between the dimensions belonging to the specific row and column. On the diagonal we have the variances of each dimension. This gives as the following expression for the multivariate Gaussian PDF: 1 1 T −1 p(x|µ, Σ) = exp − (x − µ) Σ (x − µ) (2.2) (2π)D/2 |Σ|1/2 2 where p(x|µ, Σ) reads ”the density of x given µ and Σ”, and D is the number of dimensions [27]. The mean vector µ and the covariance matrix Σ will be in the form:     σ12 C12 C13 · · · C1(D−1) C1D µ1      µ2   C21 σ22 C23 · · · C2(D−1) C2D      2  µ3   C31  C σ · · · C C 32 3D 3(D−1) 3     µ =  . , Σ=  .. .. .. .. .. ..  ..    . . . . . .      µ   C  2 C C · · · σ C (D−1)2 (D−1)3 (D−1)D   D−1   (D−1)1 (D−1) 2 µD CD1 CD2 CD3 · · · CD(D−1) σD. where Cxy = ρxy σx σy and ρxy is the correlation coefficient between x and y. Now, if we have a set of points that we assume are Gaussian, and we have a sufficient number of points, we can approximate the first and second moments (mean and variance) by.

(35) 2.2 — The Full Covariance Gaussian PDF. 19. using the law of large numbers [22]. The equations for the above statistics are the following: N 1 X ˆ = µ xn N n=1. (2.3). N. ˆ = Σ. 1 X ˆ n − µ) ˆ T (xn − µ)(x N − 1 n=1. (2.4). where N is the number of samples in the data. ˆ is approximated by the mean The data used for x above is known as the training data. µ ˆ of the training data and Σ is approximated using the mean subtracted training data. ˆ in equation ˆ and Σ Now we can take a set of test vectors y and substitute the above µ 2.2. We can then calculate the likelihood of y belonging to the same set as the training data x. This is called scoring. Generally when we want to score a test set, we use log likelihoods instead of probabilities. The log likelihood is the log value of the likelihood. This simplifies the math by turning multiplication into addition and division into subtraction. When log base e is used, we can also discard the exponent operation [29]. The resultant equation for the log likelihood is: D 1 1 loge (p(x|µ, Σ)) = − (x − µ)T Σ−1 (x − µ) − loge (2π) − loge (|Σ|) 2 2 2. (2.5). where x represents the test vector.. 2.2.2. Implementation. Implementing the full covariance Gaussian PDF is as simple as implementing equations 2.3– 2.5. For training we only need one iteration as equations 2.3 and 2.4 have a closed loop solution. Thus estimating the mean and covariance matrix is sufficient for a full model of the Gaussian. Now if we have a series of vectors and we want to know the chance that these vectors were indeed generated by the estimated model, we use equation 2.5 to get the likelihood. When we have a series of vectors, the likelihood of all of them originating from one model is their individual likelihoods multiplied with each other. In the log case we just add all the likelihoods. If we want to translate the final answer back to probabilities we just get the exponent. The result will probably be a very small number. This is another reason why we work with log likelihoods, as it prevents numerical underflow. The answer for the above may not mean much numerically. To interpret the results you need to compare results. Usually one has many models and want to know which one is more likely to generate the given set of points. Comparing results, the one with the highest numerical value would be the one most likely to generate the data (hence the term likelihood ). Next we want to implement a GMM with the full covariance Gaussian as base. The GMM is trained with the EM algorithm. The GMM is a set of Gaussian PDFs, added together to form one PDF. Each component is weighted by a prior probability and all the.

(36) 2.2 — The Full Covariance Gaussian PDF. 20. prior probabilities add up to one (to keep the volume of the density equal to one). A representation of a GMM of figure 1.3 is shown in figure 1.4 on page 5. We can see that a decidedly non-Gaussian distribution is effectively modelled by the GMM. The EM algorithm uses Maximum Likelihood (ML) estimation to find the parameters of the GMM. The first decision is how many mixture components to use. The components are then initialised (by binary split and/or K-means for best results [27]). Iterative training then takes place. The current parameters are used to calculate the probabilities for each component (E-step). Then we take the results of the E-step and recalculate the parameters (M-step) using ML estimation [20]. The two steps are repeated until the parameters of the GMM converge. In this case, the E-step entails the following: P (k|xn ) =. p(xn |k)P (k) p(xn ). (2.6). where p(xn |k) is the likelihood of xn being generated by mixture component k, P (k) is the prior probability of component k and p(xn ) =. K X. p(xn |k)P (k). (2.7). k=1. where equation 2.7 is the likelihood of xn being generated by the GMM. In the M-step we calculate the following: N. 1 X P (k|xn ) N n=1 PN n=1 P (k|xn )xn = P N n=1 P (k|xn ) PN ˆ k )(xn − µ ˆ k )T n=1 P (k|xn )(xn − µ = PN n=1 P (k|xn ). Pˆ (k) = ˆk µ ˆk Σ. (2.8) (2.9) (2.10). We repeat for equations 2.6 to 2.10 until the sum of 2.7 for all values of xn converges.. 2.2.3. Strengths and Weaknesses. First we train a GMM to fit the data in figure 2.1. We find with repeated experiments that twelve mixture components give a good fit, as can be seen in figures 2.5 and 2.6. Note that the ellipses plotted in these figures are only an indication of the variance shape of the mixture components and do not indicate the full spread of the components. The prior probabilities are also not included in these figures. We only illustrate the position (mean) and shape (variance) of each mixture component. With twelve mixture components the full covariance GMM is a good approximation of the dataset. If we now want to score a test set to this PDF, each feature needs to score against twelve full covariance Gaussians. In this lies the weakness of the full covariance.

(37) 21. 2.2 — The Full Covariance Gaussian PDF. 3. 2. y-axis. 1. 0. −1. −2. −3. −5. −4. −3. −2. −1. 0. 1. 2. 3. 4. 5. x-axis. Figure 2.5: Top view of a full covariance GMM fit to test set (12 mixture components).. 5. 4. z-axis. 3. 2. 1. 0. −1 −3. −2. −1. 0. 1. 2. 3. y-axis. Figure 2.6: Side view of a full covariance GMM fit to test set (12 mixture components)..

(38) 2.3 — The Diagonal Covariance Gaussian PDF. 22. Gaussian. It is mathematically very expensive. With the training of the GMM, each time one needs make as many multiplications as there are terms in the full matrix. When the training is completed the final estimated covariance matrix is inverted and stored along with its determinant (see equation 2.5) Another weakness is the fact that a full matrix needs to be inverted. This means the matrix needs to be of full rank [19]. At least as many linear independent features are needed as the number of dimensions. For large dimensional systems (as speech models typically are) one thus needs more training data than other methods. To get a good estimation, full covariance Gaussians require even more data per mixture component. This implies full covariance Gaussians usually do not fare well with data scarce systems. The strength of the full covariance Gaussian is its precision. By using a full matrix for each mixture component, maximum information is stored for each component. This means it uses fewer components to model a dataset than the other methods in this thesis, however, one must beware of the danger of over training. If one adds more components, the model complexity increases substantially. This can be seen in figure 2.5 where there is a small Gaussian at the top right. This is a mixture component that modelled part of the dataset that is only slightly more dense than the surrounding parts. This shows that the full GMM is not easily scalable via mixture component adjustment. Another strength of the this method is the simplicity of implementation. Although the algorithms used are expensive, they are relatively easy to implement. If time and storage space are no issue, this method would be suitable.. 2.3 2.3.1. The Diagonal Covariance Gaussian PDF Theory. The diagonal covariance Gaussian PDF is a simplification of the full covariance Gaussian PDF. It makes the simple assumption that there are no correlations between dimensions, in other words it assumes that all the dimensions are statistically independent. We take equation 2.2 and redefine Σ to be the following:   σ12 0 0 · · · 0 0    0 σ22 0 · · · 0 0     0 0 σ2 · · ·  0 0 3   Σ= . .. .. . . .. ..   .. . . . . .     0 0 0 · · · σ2  0   D−1 2 0 0 0 ··· 0 σD. The above matrix has a lot of entries that are zero. It would make more sense to save the diagonal values of Σ in a vector rather than a matrix. Estimating the statistics from the training data would only change the way the covariance is estimated. The mean estimation (equation 2.3) would stay the same and equation 2.4.

(39) 23. 2.3 — The Diagonal Covariance Gaussian PDF would be replaced by: N. σ î2. 1 X (xin − µ ˆ i )2 = N − 1 n=1. (2.11). with σ î2 being the estimation of the ith diagonal element, xin is the ith element of xn and µ î ˆ With equations 2.3 and 2.11 the log likelihood becomes: the ith element of µ. D. D. 1 X (xi − µi )2 D 1X loge (p(x|µ, σ )) = − − log (2π) − loge (σi2 ) e 2 2 i=1 σi 2 2 i=1 2. 2.3.2. (2.12). Implementation. As with the full covariance Gaussian, the implementation of the diagonal model is as easy as implementing equations 2.3, 2.11 and 2.12. This method also has a closed loop solution. As long as the training data does not change, only one iteration of these equations is needed for estimation. As previously we score the diagonal covariance Gaussian with the likelihood in equation 2.12. The likelihood is interpreted in the same way as with the full covariance Gaussian. It is intuitive that the diagonal is a simpler model than the full model. More on this in section 2.3.3. The same issues regarding interpreting the log likelihood discussed in section 2.2.2 are true for the diagonal covariance Gaussian: likelihoods for a vector can be summed for the total likelihood and models are scored and compared in exactly the same way. Implementing the GMM with the diagonal covariance Gaussian as basis is very similar than with the full covariance GMM. The EM algorithm is still used and in the E-step all that change is the way that p(xn |k) is calculated. With the M-step, all remains the same, except the calculation of the covariance matrix, which now becomes: PN P (k|xn )(xin − µ ˆ ik )2 2 σ îk = n=1 PN , i = 1, . . . , D (2.13) n=1 P (k|xn ) Again we repeat the EM iteration until convergence.. 2.3.3. Strengths and Weaknesses. We train a GMM with diagonal covariance Gaussians to fit the dataset in figure 2.1. First we take twelve mixture components, the same number that gave a good model with full covariance Gaussians. The results can be seen in figures 2.7 and 2.8. We can see that the diagonal covariance GMM does not create such a good model for the data as the full GMM did. This is because none of the correlations between dimensions are modelled by the diagonal covariance Gaussians. One can see this in the fact that the ellipses in figures 2.7 and 2.8 only models information in the directions of the axes (in other.

(40) 24. 2.3 — The Diagonal Covariance Gaussian PDF. 3. 2. y-axis. 1. 0. −1. −2. −3. −5. −4. −3. −2. −1. 0. 1. 2. 3. 4. 5. x-axis. Figure 2.7: Top view of a diagonal covariance GMM fit to test set (12 mixture components).. 5. 4. z-axis. 3. 2. 1. 0. −1 −3. −2. −1. 0. 1. 2. 3. y-axis. Figure 2.8: Side view of a diagonal covariance GMM fit to test set (12 mixture components)..

(41) 25. 2.3 — The Diagonal Covariance Gaussian PDF. words, in the directions of the data dimensions). This means that the correlations between dimensions that were modelled by full covariance Gaussians and resulted in ellipses that could be shaped in any direction (seen in figures 2.5 and 2.6) have to be modelled by additional diagonal covariance mixture components. This implies that more mixture components will give a better result. After some experimenting a good approximation is found with sixteen components. This result is shown in figures 2.9 and 2.10. Now we need to score against sixteen diagonal covariance Gaussians. 3. 2. y-axis. 1. 0. −1. −2. −3. −5. −4. −3. −2. −1. 0. 1. 2. 3. 4. 5. x-axis. Figure 2.9: Top view of a diagonal covariance GMM fit to test set (16 mixture components).. for each feature. It is faster to calculate the likelihood for a diagonal covariance Gaussian than for a full covariance Gaussians. We do not have to invert any matrices and for each mixture component all the information that are not on the diagonal are not needed for any calculations. Considering that in this three dimensional case we removed more than half of the stored parameters for each matrix and only added a few mixture components, it results in an overall increase in calculation speed. This shows why diagonal covariance Gaussians are the dominant PDF used in GMMs in ASR systems today [23]. Because there is no matrix inversion, this method can model systems with less data better than the full covariance Gaussians can. More mixture components over less data means a better realization of the data and better modelling of individual clusters in the data. It also means that one can get a better approximation with diagonal covariance Gaussians than full covariance Gaussians when data is scarce. One can also scale the diagonal covariance GMM easily, because adding or removing one mixture component does not equate in a large change in the model complexity. Implementing this method is also very easily done as shown in section 2.3.2. It requires only a few equations to change from the full covariance Gaussian model. When data is scarce and/or speed is an.

(42) 26. 2.4 — The Spherical Covariance Gaussian PDF. 5. 4. z-axis. 3. 2. 1. 0. −1 −3. −2. −1. 0. 1. 2. 3. y-axis. Figure 2.10: Side view of a diagonal covariance GMM fit to test set (16 mixture components).. issue, this method is preferable.. 2.4 2.4.1. The Spherical Covariance Gaussian PDF Theory. The spherical covariance Gaussian PDF is a further simplification of the diagonal covariance Gaussian PDF. It makes the same assumption that there is no correlation between dimensions and adds the assumption that all dimensions are equally distributed. The covariance matrix in equation 2.2 becomes:   σ2 0 0 · · · 0 0    0 σ2 0 · · · 0 0     0 0 σ2 · · · 0 0    Σ= .  . . . . . .  . . . . ..  .. . . .    0 0 0 · · · σ2 0    2 0 0 0 ··· 0 σ. The values on the diagonal are equal and is equal to the mean of all the values of the diagonal of the original covariance matrix it is derived from. It is intuitive that only the value of σ 2 needs to be stored. This again only influences how the covariance is estimated and the mean stays the same.

No results found