Process monitoring with restricted Boltzmann machines

(1)

Restricted Boltzmann

Machines

by

John Matali Moody

Thesis presented in partial fulfillment of the requirements for the Degree

of

MASTER OF SCIENCE IN ENGINEERING

(EXTRACTIVE METALLURGICAL ENGINEERING)

in the Faculty of Engineering

at Stellenbosch University

Supervisor

Prof C. Aldrich

Co-Supervisor

Dr C. Dorfling

April 2014

(2)

DECLARATION

By sub mitting this thesis electro nically, I d eclare that the entirety o f the work contained therein is m y o wn, o rig inal wo rk, that I am the so le author thereo f (save to the extent exp licitly o therwise stated), that repro d uction and p ublicatio n thereo f b y Stellenb o sch University will no t infringe any third party rig hts and that I have no t p reviously in its entirety o r in p art sub mitted it fo r o b taining any q ualification.

John Moody

19/02/2014

………

……….

Signature Date

(3)

ABSTRACT

Pro cess mo nito ring and fault d iagno sis are used to d etect ab no rmal events in p rocesses. The early d etectio n o f such events o r faults is crucial to continuous p rocess imp ro vement. Althoug h princip al comp o nent analysis and partial least sq uares are wid ely used fo r p ro cess m onito ring and fault d iag no sis in the metallurg ical ind ustries, these mo d els are linear in princip le; nonlinear ap p ro aches should p ro vid e mo re co mp act and informative mod els. The use o f auto asso ciative neural netwo rks o r auto enco d ers provid e a p rincipled app ro ach fo r p ro cess mo nito ring. Ho wever, until very recently, these multip le layer neural netwo rks have b een d ifficult to train and have therefo re not been used to any sig nificant extent in p ro cess mo nito ring .

With newly p ro p o sed alg o rithms b ased on the p re-training o f the layers o f the neural netwo rks, it is no w p o ssib le to train neural netwo rks with very co mp lex structures, i.e. d eep neural netwo rks. These neural networks can b e used as auto enco d ers to extract features fro m hig h d imensio nal data. In this stud y, the applicatio n o f d eep auto encod ers in the fo rm o f Restricted Bo ltzmann machines (RBM) to the extraction o f features fro m p ro cess d ata is co nsid ered . These netwo rks have mo stly b een used for d ata visualizatio n to d ate and have no t b een ap p lied in the context o f fault d iag no sis o r p ro cess mo nito ring as yet. The ob jective of this investig atio n is therefo re to assess the feasib ility o f using Restricted Boltzmann machines in vario us fault d etectio n schem es. The use o f RBM in p rocess mo nito ring schem es will b e d iscussed , to g ether with the ap p licatio n of these mod els in automated contro l framewo rks.

Keywords: RBM, auto encoders, dimensionality reduction, process m onitoring

(4)

OPSOMMING

Pro sesmonitering en fo ut d iag no se word geb ruik om ab normale g eb eure in p ro sesse op te sp o or. Die vro eë o p sp o ring van sulke gebeure o f fo ute is no o d saaklik vir deurlo pend e verb etering van p ro sesse. Alho ewel hoo fko mp o nent-analise en p arsiële kleinste kwad rate wyd g eb ruik wo rd vir pro sesmonitering en fo ut diag no se in d ie metallurg iese ind ustrieë, is hierd ie mo d elle lineêr in b eg insel; nie-lineêre b enad ering s b eho o rt meer ko mp akte en insig g ewende mod elle te vo o rsien. Die g eb ruik van o uto -asso siatiewe neurale netwerke of outoko deerd ers bied 'n beg insel g eb aseerd er benad ering o m d it te b ereik. Hierd ie veelvoud ige laag neurale netwerke was eg ter tot onlangs moeilik om op te lei en is dus nie tot ŉ beduidende mate in die p rosesmo nitering g eb ruik nie.

Nuwe, voo rg esteld e alg o ritmes, gebaseer o p voorafop leiding van d ie lae van die neurale netwerke, maak d it no u moontlik om neurale netwerke met b aie ing ewikkeld e strukture, d .w.s. d iep neurale netwerke, o p te lei. Hierd ie neurale netwerke kan g eb ruik wo rd as o uto kod eerd ers o m kenmerke van ho ë-d imensio nele d ata te o nttrek. In hierd ie stud ie wo rd die toep assing van d iep outoko d eerd ers in die vorm van Bep erkte Bo ltzmann Masjiene vir d ie o nttrekking van kenmerke van p roses d ata o o rweeg . To t d usver is hierd ie netwerke m eestal vir d ata visualisering g ebruik en d it is no g nie to eg ep as in d ie konteks van fout d iagno se o f pro sesmonitering nie. Die do el van hierd ie o nd erso ek is d us o m die haalbaarheid van d ie geb ruik van Bep erkte Bo ltzmann Masjiene in verskeie fo uto p sp oring skemas te assesseer. Die g eb ruik van Bep erkte Bo ltzmann Masjiene se eienskap p e in p ro sesmonitering skemas sal b esp reek wo rd , tesame met d ie toep assing van hierd ie mo d elle in o utomatiese b eheer raamwerke.

Sleutelwoorde: Bep erkte Boltzmann Masjiene, outokodeerders, dimensionaliteit vermindering ,p ro sesm onitering

(5)

ACKNOWLEDGEMENTS

I hereb y exp ress my g ratitud e to the following for making this wo rk a success:

To my sup erviso rs, Pro fesso r Aldrich, for yo ur valuab le technical g uid ance, enco urag ement and p atience; and Dr Do rfling fo r taking o ver the p ro ject and ensuring it is co mp leted .

Fo r all the assistance thro ug hout my stud ies, a special thank you to Phillip fo r all your inp ut.

Fo r financial assistance, a b ig thank you to the Rio Tinto Rö ssing Uranium emp lo yee b ursary scheme.

Fo r emo tio nal sup p ort, I am very g rateful to my wife Kahundu, friend s and family.

Fo r all I am, to my Lo rd and Savio ur Jesus Christ for seeing me thro ug h.

Finally, I d ed icate this wo rk to my two children, Ray and Nankole.

(6)

CHAPTER 1 INTRODUCTION

1.1 Process Monitoring and Fault Diagnosis

Process mo nito ring and fault d iag no sis are used to d etect faults o r ab normal events in p ro cesses. The early d etectio n o f these events o r faults is crucial to co ntinuo us p rocess imp ro vement. Trad itio nal method s have b een b ased o n mechanistic o r causal p ro cess mo d els. Ho wever, such mod els are not always available o r may b e exp ensive to co nstruct and therefore alternative ap pro aches b ased o n multivariate statistical p ro cess co ntro l have b een prop o sed. These mo d els are b ased o n emp irical correlations b uilt fro m no rmal p lant o perating d ata when co mmo n cause variatio n is p resent.

A fault can b e defined as an unpermitted deviation of at least one characteristic p rop erty o f a variab le fro m its normal acceptab le behavio ur (Isermann, 1997). Therefo re, a fault is a state that may lead to a malfunction o r failure o f a system, which in turn results in p ro cess inefficiencies. Fault diag no sis has increasingly become an area o f g reat imp o rtance in process contro l and auto matio n. It provides a framewo rk in which d ata is mo nito red and sub mitted to fault d etectio n schemes. The fault d etectio n scheme reco rds alarm s whenever faults are d etected . These faults are then id entified and classified acco rding to their nature as well as trace their so urces.

The fault detectio n and diagnosis techniques that are used normally depend o n p ro cess mo d els. The p ro cess d ata from the plant histo rians is inp ut to fault d etectio n alg o rithms, and then co mp ariso ns are made with the co rresp ond ing plant o utputs. A difference in these comp arisons is an indication that a fault has occurred, and hence could be investig ated. Once the type of fault is known, it can then be classified and corrective measures can be p ut in place to remed y the fault.

(11)

As useful as they are, linear method s (such as Princip al Co mp onents Analysis) d o have sig nificant limitatio ns, althoug h a larg e range o f d ifferent no nlinear ap p ro aches have b een co nsid ered to d ate, no ne o f these ap p ro aches so lve all pro b lems all the time. A majo r limitatio n o f current linear feature extractio n b enchmarks is their linear nature. It has b een fo und that using a linear metho d to extract features from no nlinear d ata can b e inad eq uate (Do ng & McAvo y, 1996).

The d iversity that is fo und in p ro cess data structures motivates the exp loratio n o f o ther feature extractio n method s. In light of this, many statistical inference techniq ues such as neural netwo rks (Do ng & McAvo y, 1996; Zhu & Li, 2006), kernel metho d s (Lee et al., 2004; Cho et al., 2005, 2005), rando m fo rests (Auret & Ald rich, 2010b) and many o thers have b een investig ated in feature extractive fault diag no sis.

1.2 Restricted Boltzmann Machines

Even tho ug h p rincip al co mp o nent analysis and p artial least sq uares have g enerally b een used in p ro cess mo nito ring and fault diag nosis, these mo d els are linear in p rinciple; therefo re no nlinear ap p ro aches are more likely to pro vide more accurate, co mp act and info rmative mo d els. The use o f auto asso ciative neural netwo rks; or auto encod ers, p ro vid e a b etter ap p roach to achieve this. Ho wever, until very recently, these multip le layer neural networks have b een d ifficult to train and have therefo re no t b een used to any sig nificant extent in p ro cess mo nito ring.

With newly p ro p o sed alg o rithms that are based o n the p re-training o f layers o f the netwo rks, it is no w p o ssib le to train neural networks with co mp lex structures, which are referred to as d eep neural networks. These neural networks can b e used as auto enco d ers to extract features from hig h dimensional d ata.

Restricted Bo ltzmann machines have been used in many applications as generative

(12)

mo dels fo r d ifferent typ es o f data, includ ing imag es (Hinto n et al., 2006). Furthermo re, Restricted Bo ltzmann Machines are very interesting b ecause they are used as the b uild ing b lo cks in Deep Belief Netwo rks, which can have many layers and hence are efficient at rep resenting comp licated d istrib utions (Beng io , 2009). The p rocess of learning o ne hid d en layer at a time is in effect a very go o d way to train d eep netwo rks that have many hidden layers and millions o f weig hts to d eal with. Even thoug h the learning is co mp letely unsupervised , the hig hest-level features are usually much mo re useful fo r classificatio n tasks than the raw d ata vecto rs that the netwo rk learns.

These d eep netwo rks can then b e fine-tuned to b e b etter at classificatio n o r even in d imensio nality red uctio n p ro b lems by using the backp ro p agatio n alg o rithm (Hinton & Salakhutdinov, 2006). Because o f the fact that these RBMs can b e stacked in d eep learning schemes and are g enerative mo d els, their use as a no nlinear ap p ro ach in p rocess mo nito ring and fault d iag no sis will b e investig ated . In view o f the ab o ve, the usefulness o f the features extracted using these netwo rks wo uld b e key in using RBMs fo r fault d iag no sis.

In this stud y, the ap p licatio n o f deep auto encod ers in the fo rm o f Restricted Bo ltzmann machines (RBM) to the extractio n o f features fro m p ro cess d ata is co nsid ered . These netwo rks have mostly been used fo r d ata visualizatio n to d ate and have not been ap p lied in the co ntext o f fault diag nosis o r process mo nito ring as yet. The ob jective o f this investig ation is therefore to assess the feasibility o f using Restricted Bo ltzmann machines in vario us fault detectio n schemes. The use o f RBM machines features in p ro cess mo nitoring schemes will b e d iscussed , to g ether with the app licatio n o f these mo d els in automated co ntro l framewo rks.

An auto enco der with RBM pre-training will be used to extract features fro m d ata, and these features used as a b asis for process fault diagnosis in several case stud ies.

(13)

1.3 Problem Statement

Many chemical and metallurg ical p ro cesses are characterized b y hig hly no nlinear and complex d ynamics, with lo ng time constants and sig nificant d elays. A lo t o f research has b een d o ne o n no nlinear p rocess mo nito ring techniques o ver the p ast two d ecad es. This research is d riven b y the fact that most p ro cesses are no nlinear and the metho d s used to mo d el, mo nitor and co ntro l them are p red o minantly linear. Althoug h the linear metho d s that are being used can mod el and mo nitor the p ro cesses to so me g o o d d eg ree o f accuracy, there are instances where they fail to capture the no nlinearity that is inherent in the p ro cess. Vario us no nlinear metho d s have b een d evelo p ed o ver the years, so me o f which are already being used in d ifferent ap p licatio ns in the chemical and mineral p ro cessing industry. There is no single techniq ue that p o ssesses all the desirable features to accurately mo del and monito r all p ro cesses, hence there is need to find more, and even b etter mo nito ring techniq ues.

1.4 Research Objectives

The o verall o b jective o f this study is to assess the feasib ility of using Restricted Boltzmann Machines in vario us fault d etection schemes. This ob jective will b e covered b y the fo llo wing tasks:

A literature review o f the feature extractio n fault d iag no sis and the app licatio ns o f Restricted Boltzmann Machines

Numerical wo rk in which features are extracted fro m process d ata with Restricted Boltzmann Machines (RBMs) and used as the b asis fo r p ro cess fault diagnosis in several case studies.

Comp arison and evaluation of the results with other nonlinear approaches.

(14)

1.5 Thesis Layout

The rest of this thesis is o rg anised as fo llows: In Chapter 2, the literature review o f multivariate statistical p ro cess co ntrol and an o verview o f fault d iag nosis is d iscussed . Chapter 3 d eals with the theo retical framewo rk o f the Restricted Bo ltzmann Machine and ho w they are used as a b asis fo r multilayer auto enco d ers. The ap p licatio ns o f Restricted Bo ltzmann Machines for feature extraction are also d iscussed in this chap ter. In Chap ter 4, the metho d o logy that was used in the stud y is d iscussed. The app licatio n o f the RBM metho d o log y in several case stud ies is d ealt with in Chap ter 5. In Chapter 6, reco mmend atio ns and conclusions fro m the study are d iscussed .

(15)

CHAPTER 2 MULTIVARIATE STATISTICAL PROCESS

CONTROL – LITERATURE REVIEW

This chap ter b riefly reviews the basics o f MSPC (multivariate statistical p ro cess co ntro l). An overview o f current no nlinear metho d s used fo r p ro cess monitoring is discussed . The overview is no t meant to b e exhaustive, b ut to g ive an o utline o f what has b een studied in the ind ustry in recent years.

2.1 Basics of MSPC (Multivariate Statistical Process Control)

Statistical pro cess co ntro l can b e a po werful tool to characterize the chemical p rocess in b o th no rmal and ab no rmal cond itions. Once the p ro cess is characterized, statistical p ro cess contro l can b e used to mo nitor and g ive early warning o f existing , o r develo p ing , ab no rmal co nd itio ns.

2.1.1 Univariate Statistical Process Control

Fo r mo st metallurg ical and chemical p ro cesses, there is a p rocess contro l system where larg e amo unts o f d ata is co llected and sto red o n histo rian data servers. In o rder to d etect and correct p ro b lems and p ro cess inefficiencies, this d ata is used , in which o nly a sing le variab le is co nsidered . Even tho ugh the d ata is available and can b e q ueried fro m the d atab ases at any time, it is usually d ifficult for anyo ne to use this data and d etermine whether the pro cess is being contro lled acco rd ing to the set co ntrol p arameters.

Statistical p ro cess control charts, such as the Shewart chart, cumulated sum plot, and the exp o nentially weighted moving average chart, (Venkatasubramanian et al., 2003a) are well established charts and are used in many plants to determine how

(16)

well the p ro cess is p erfo rming. An example of a univariate statistical pro cess co ntrol chart is sho wn in Fig ure 2.1. From Figure 2.1, Var rep resents the variab le that is being mo nito red , LCL is the lo wer co ntro l limit and UCL is the up p er co ntrol limit.

The confid ence limits; the lo wer and the upper co ntro l limits, are usually calculated and then used as a b asis fo r d etecting the p ro cess d eviatio ns. When a co nfid ence limit is exceed ed , it sho ws that a fault has occurred and that the p rocess is no lo ng er op erating accord ing to the set co nditio ns. These control limits are usually calculated , b ased o n the no rmal op erating co nd itions (NOC) d ata, which is taken when the p lant is o p erating at the d esired, o p timum cond itions.

Figure 2.1: Univariate statistical control chart

In this p ro cess d iag no sis and mo nito ring scheme, there is o nly o ne variable that is measured and hence tested . When this scheme is used, it d oes not p erform well fo r p rocesses in which there are hig h correlations among the ob served p ro cess variab les. One of the disad vantag es of this scheme is that for a single process, many variables are available that can be mo nitored and even controlled (Stefatos & Ben Hamza, 2007). This monitoring scheme treats the variables independently, and , as a result, it

(17)

o nly extracts the d eviatio ns in each variable indep endent o f all the others. In the pro cess, it ig no res the co rrelatio n structure b etween these variab les. As a result, pro cess d eviatio ns in the p ro cess may no t be d etected at all b y the schem e. Ho wever, the use o f multivariate statistical pro cess co ntro l method s can p ro vid e a better alternative.

The need to d o multivariate d ata analysis arises when mo nitoring p ro cess p erfo rmance b eco mes critical d ue to that fact that the numb er of measured p ro cess variab les increases. This is b riefly d iscussed in the next sectio n (2.1.2).

2.1.2 Multivariate Statistical Process Control

Multivariate statistical p ro cess co ntrol (MSPC) is an ad vanced statistical metho d that attemp ts to id entify the critical variab les and p atterns in pro cess d ata. It also sho ws the relatio nship s b etween the p ro cess variab les and how they have an effect on each other. This is imp o rtant, and ap p licab le when dealing with co mp lex metallurg ical and chemical p ro cesses.

As do ne with the Shewart charts, process d ata is identified and d efines the desired no rmal o p erating co nd itio ns (NOC d ata). An analysis is then p erfo rmed , that d oes no t iso late certain variab les, so as to ensure that the co rrelatio ns b etween the variab les are also cap tured . The major b enefit o f MSPC co mp ared to univariate mo nito ring is that the co rrelatio n b etween the o riginal variab les is also included in the analysis, which then d ecreases the chance to o mit an o ut-o f-co ntro l situatio n due to the correlatio n inherent in the d ata (Thissen et al., 2001).

There are many advantag es of multivariate as compared to univariate statistical process control, some of which have already been outlined in the foregoing

(18)

d iscussio n. Multivariate analysis can simp lify the work o f p ro cess o p erato rs, in that it can sho w all the p ro cess variab les, includ ing the relatio nships that canno t be d etected when using univariate statistics. As a result, there is no need to construct p rocess co ntro l charts fo r each variab le. Such an analysis is ab le to reveal the co rrelatio n b etween p ro cess p arameters and ho w they are related to faults that are d etected in the analysis. MSPC, therefo re, assists in understanding the interaction b etween variab les, which makes it p ossible to create mo dels that can p redict effects o n the p ro cess, b efo re actually imp lem enting these chang es.

2.2 Process Monitoring and Fault Diagnosis

2.2.1 Feature extraction process fault diagnosis

In mo d ern chemical and metallurg ical plants, p rocess d ata p ro vid es the b asis fo r the monitoring o f p ro d uct q uality, p ro cess co ntrol and impro vement. With all the ad vances in instrumentatio n and data management techno lo g y, larg e vo lumes o f p rocess d ata is co llected and sto red on plant servers. There is a great d eal o f correlated info rmatio n in these p rocess variables that are b eing measured and stored. As a result, the info rmatio n that is in these sto red d ata sho uld b e extracted in such a way that the essential info rmatio n is retrieved .

To ensure that d ata that is co llected and stored in p rocess and chemical p lants is utilized fo r p ro cess co ntro l and o p timization, it is crucial that the sig nificant features in these d ata is extracted and analysed. This ap p roach o f extracting features fro m hig h d imensio nal d ata enab les p lant eng ineers and m etallurg ists to b etter und erstand the p ro cess. Princip al co mponent analysis is co mmo nly used fo r this purpose, as are other techniques such as partial least square, Sammon maps and multidimensio nal scaling (Zhang, 2009).

(19)

Process fault d iag no sis can b e viewed as a series of map ping s of measured p rocess variab les. The first m ap p ing is the transformatio n that is d o ne from the p ro cess measurement sp ace, i.e. no rmal op erating d ata, to the feature sp ace (it sho uld be noted that this is no t necessarily the usual case, b ut in this p articular instance, it is). Seco nd ly, a learning alg o rithm o r metho d is then used to map this feature sp ace unto a d ecisio n sp ace. The map p ing that is d o ne from the feature sp ace to a d ecision sp ace is mad e in such a way that it meets so me o bjective function. There are two catego ries o f metho d s o f d evelo p ing the feature sp ace fro m the measurement sp ace, namely, the feature selectio n and the feature extractio n metho d s. In feature selectio n, o ne simp ly selects a few important measurements o f the o rig inal measurement sp ace (Venkatasub ramanian et al., 2003c).

While in the feature extraction, it is the transformatio n o f hig h d imensio nal d ata into a rep resentatio n that is useful, b ut o f reduced d imensio nality using many d ifferent techniq ues d ep end ing o n the applications. The technique may b e linear, as in p rincipal co mp o nent analysis, b ut many o ther nonlinear techniques d o exist that can also b e utilised . Dim ensio nality red uction can b e illustrated as fo llows:

Non-linear dimensionality reduction

Assume that X is a d ataset rep resented in a 𝑛 × 𝐷 matrix that consists of n data vectors with d imensio nality D, with intrinsic d imensio nality o f d< D. The intrinsic d imensio nality o f d ata is d efined as the minimum numb er o f p arameters that are need ed to acco unt fo r the o b served prop erties o f the p ro cess data (Van der Maaten et al., 2009). During this d imensio nality red uction, the reduced d imensio nality d contains the features that are extracted and used in p ro cess mo nito ring . This feature space must retain the g eometry of the original data as much as possible, and hence contains the sig nificant features that represent the original data.

(20)

2.2.2 Fault detection characteristics

In ord er for one to select a d esired feature extraction method to use in fault d etectio n and d iag nosis, d ifferent appro aches are co mp ared. In this comp ariso n, certain characteristics o r stand ard s are used to show ho w these metho d s p erform. All these characteristics are no t meant to b e satisfied b y a fault detectio n metho d, but rather to g ive an ind icatio n o n ho w different ap pro aches are co mp arab le. Some o f the d esirab le characteristics lo o ked fo r in fault detectio n and d iag no stic alg o rithms are (Venkatasub ramanian et al., 2003c):

o Quick detection and diagnosis:

An alg o rithm sho uld be q uick to detect and diag no se faults in a p ro cess co ntro l system. The time taken to d etect these faults no rmally d ep ends o n the p ro cess that is b eing analysed , as the retention time of pro cesses d iffer. Nevertheless, it is imp ortant that the q uick d etectio n o f the faults d o es no t g enerate many false alarms, as that b eco mes a nuisance in the system.

o Adaptability:

Processes have a tend ency to change and evo lve as a result o f changes in external inp uts, p ro d uctio n q uantities, and q uality o f consumab les. The d iagno stic system sho uld b e ab le to ad ap t to these changes, and it has to b e d esig ned such that chang es in o p erating p arameters can b e cap tured and up d ated.

o Explanation facility:

Besides the ab ility to id entify the so urce of the fault, a d iag no stic system sho uld also p ro vid e exp lanatio ns o n the o rig in of the fault that is id entified. If the so urce of the fault is known, ways on how to take corrective actions, and design impro vements can then be investigated.

(21)

o Modelling requirements:

The amo unt o f time and reso urces sp ent on mo delling has to be kept as minimal as p o ssible fo r fast and easy d ep lo yment o f the fault d etectio n scheme. A system that uses a lo t o f reso urces in mo delling may no t b e ideal, as mo re time and resources will be sp ent on the system than on imp ro ving the p ro cess.

In the next sectio n, Princip al Co mp o nent Analysis is d iscussed , as it is the b enchmark in multivariate statistical p ro cess co ntro l.

2.2.3 Principal Component Analysis

Principal Co mp o nent Analysis (PCA) is a linear multivariate statistical method , g enerally used fo r d ata co mp ression and info rmation extractio n b y p ro jecting hig h-d imensional h-d ata o nto a sp ace with significantly lo wer h-dim ensio ns. Sp ecifically, PCA transfo rms a set o f hig hly co rrelated variab les, into a smaller set o f new, uncorrelated variab les called p rincip al co mp o nents (PC). PCA takes ad vantage o f red und ant info rmatio n that exists in hig hly co rrelated variab les to red uce the d imensio nality. Mathematically, PCA relies o n an eig envector deco mpo sitio n of the co variance o r correlation matrix o f the p ro cess variables.

Principal co mp o nents are o rtho g onal to each other and are a linear comb inatio n o f the o rig inal variab les. Princip al co mp o nents are trad itionally o rd ered in d ecreasing o rd er o f eig envalue; are ro tated in the directions o f maximum variance. In most cases, o nly the first few p rincip al comp o nents that explain mo st o f the variatio n in the d ata are retained in the analysis. In order to hand le variab les with d ifferent amplitude and freq uency, all the p rocess m easurements are usually mean centred and scaled before PCA analysis is done (Ro sen & Lenno x, 2001). This is standard practise in p rocess mo nito ring and fault diagnosis.

(22)

The co lumns o f the m atrix P are known as loadings while elements of the matrix T are called sco res. The sco res are the values o f the o riginal pro cess variab les that are map p ed into the red uced d imensional sp ace vecto rs. In the co ntext o f feature extractio n, the sco re vecto rs o b tained from projecting the pro cess variab les onto the p rincipal co mp o nents can b e co nsid ered as the extracted features. The numb er o f p rincipal co mp o nents to use in calculating the features can b e d etermined by investig ating the cumulative variance accounted fo r b y includ ing ad d itional p rincip al comp o nents (Zumo ffen & Basuald o , 2008).

The sco res can then b e transfo rmed into the original vecto r as fo llo ws:

𝑿� = 𝐓𝐏𝐓 _Eqn.4

The resid ual matrix R, is no w evaluated as

𝑹 = 𝑿 − 𝑿� Eqn.5

Finally, the orig inal inp ut d ata can be calculated as

𝑿 = 𝐓𝐏𝐓_{+ 𝐑} _Eqn.6

Principal Component Analysis

• For a data set X (n observations by m variables), create a covariance matrix E

o 𝑬 = 1

𝑛−1𝑿𝑇𝑿 Eqn.1

• Calculate the eigenvectors V and eigenvalues Λ for the covariance matrix E

using eig envalue d eco mp o sition

o 𝑬 = 𝑽Λ𝑽𝑇 Eqn.2

• Determine the reduced dimensionality a, that captures significant variance.

• Define the loading matrix (principal components) P as the first a

eig envecto rs of V

• Calculate the principal component scores

o 𝑻 = 𝐗𝐏 Eqn.3

(23)

• Process Monitoring with PCA

After the PCA mod el that is b ased o n histo rical d ata is co nstructed, multivariate co ntro l charts that are b ased o n Hotelling’s T2 and sq uare p red iction erro r (SPE), or Q, can no w b e p lo tted . The p ro cess mo nitoring scheme is then red uced to o nly the two variables, T2 and Q, which characterizes two o rtho go nal sub sets o f the o rig inal d ata sp ace. Ho telling ’s T2 rep resents the majo r variatio n in the d ata and Q rep resents the rando m no ise that is in the o riginal d ata (Garcia-Alvarez et al., 2009). Hence, T2 exp lains the variatio n within the sco re space by using all the retained PCs. Ho telling ’s T2 value is calculated as

𝑇2 _{= 𝑿}𝑇_𝑷Λ α

−1_𝑷𝑻_𝑿 _Eqn.7

where Λ_α is the sq uare m atrix that is formed b y the first a ro ws and co lumns o f Λ The p ro cess will b e co nsid ered to b e no rmal if:

𝑇𝛼2 ≤�𝑛

2_−1�𝑎

𝑛(𝑛−𝑎) 𝐹𝛼(𝑎, 𝑛 − 𝑎) Eqn.8

where 𝐹_𝛼(𝑎, 𝑛 − 𝑎) is the Fisher-Snedecor distribution with 𝑎, 𝑛 − 𝑎 degrees of freedom and α the level of significance.

The Q statistic o r sq uared p red iction error (SPE) measures the variab ility that b reaks the no rmal p ro cess co rrelatio n in the d ata. Mathematically, Q is o b tained as the sum o f the sq uared erro rs in the resid ual space o r the sum of variatio ns in the residual sp ace, which is d efined as

𝑸𝒋 = (𝑿𝒋− 𝑿�𝒋)2 Eqn.9

for the jth samp le

The Q statistic is thus a measure o f the amo unt o f the variatio ns in each sample that is not captured by the retained PCA model.

(24)

The d etectio n thresho ld s fo r the squared predictio n erro rs can b e calculated as: 𝑸𝜶 = 𝜃1�ℎ0𝑐𝛼_𝜃�2𝜃2 1 + 𝜃2ℎ0(ℎ0−1) 𝜽_𝟏𝟐 � 1 _ℎ 0 � Eqn.10 where 𝜃_𝑖 = ∑𝑛_𝑗=𝑎+1𝜆_𝑗2, ℎ₀ = 2𝜃1𝜃3

3𝜃₂2 and 𝑐𝛼is the value of the no rmal d istrib utio n, α is

the level o f sig nificance and 𝜆_𝑗 is the jth eig envalue o f E (Alcala & Jo e Qin, 2011).

The values o f these two statistics are also calculated for the new d ata set. If, at a specific p o int, T2 o r Q fo r the new d ata set is outsid e the calculated co ntro l limits, the process is said to b e o ut o f co ntro l at that po int. In fact, this may mean that a fault has occurred at that po int. When any fault has b een d etected using any o f the T2 or Q statistics, it is crucial to id entify the cause o f that fault. This can be d o ne b y using contributio n p lo ts o f the o rig inal d ata. In a PCA mo del, two typ es of co ntrib ution p lots are used to id entify the fault since two typ es of co ntrol charts are used , i.e., a chart for resid uals and o ne fo r Ho telling’s (Tep p o la et al., 1998).

The resid ual p lo ts sho w the Q resid ual values plotted ag ainst the samp les, and this sho ws the time when the fault o ccurs. The contribution p lo ts are comp uted so that it can b e d etermined what typ e o f fault is d etected . The contrib utio n plo ts are calculated b y co mp uting the means o f the columns o f the resid ual matrix R that is b ased o n the faulty d ata set (Ralsto n et al., 2004). The contributio n p lo ts are then used to d etermine which variab les are associated with the faults that are detected .

In d etermining whether the ind ivid ual variable co ntrib utio n to the T2 value is sig nificant o r no t, o ne can calculate the contro l limits for the contrib utio n p lo ts. It is also p ossib le to co mp are the size of the variab le’s co ntrib utio n d uring the faulty conditions with the size o f the same variable’s contribution under the desired normal operating co nditions. Therefore, the variables with the largest contribution to the T2 value normally indicate the source of the fault (Johnson & Wichern, 2007).

(25)

Principal co mp o nent analysis has b een ap p lied in many d ifferent areas such as science, bio lo g y, eng ineering, etc., b ut desp ite all these ap plicatio ns, it has its difficulties as well. Limitatio ns o f the PCA method o lo g y include its lack of explo itatio n o f auto co rrelatio n (Venkatasub ramanian et al., 2003b) and its linear nature. In o rd er to ad d ress these and other drawb acks o f PCA, several extensio ns o f PCA have b een d evelo p ed, so me o f which are discussed in the next section.

2.3 Developments in Nonlinear Feature Extraction Fault Detection

In o rder to cap ture the no nlinear nature o f measured p ro cess d ata fo r fault d iag no sis, many feature extractio n strateg ies have been investig ated and studied o ver the last co up le o f years. The o verview of these no nlinear feature extractio n method s is no t meant to b e exhaustive, b ut o nly to hig hlight the d ifferent app ro aches to nonlinear feature extractio n availab le in literature. The b o d y o f literature is also relatively larg e and hence o nly a brief review is given in this sectio n.

2.3.1 Neural Networks

A neural netwo rk is an architecture that is made up o f large numb ers o f units that are called neuro ns. An examp le o f a neuro n is sho wn in Fig ure 2.2 (Page 17). The neuro n that is sho wn co nsists of n inp uts;𝑥₁, 𝑥₂, 𝑥₃, … , 𝑥_𝑛. These inp uts co me from a variety o f sources, no t limited to the netwo rk structure in which they o rig inate fro m other units, o r may even b e fro m so me external so urces (Po llard et al., 1992). The o utp ut o f the unit y in this netwo rk is g iven as:

𝑦 =_1+𝑒1_−𝐴 Eqn.11

where

𝐴 = 𝑡ℎ𝑒 𝑒𝑙𝑒𝑚𝑒𝑛𝑡 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 = ∑𝑛_𝑖=1w_ix_i Eqn.12 𝑤𝑖 𝑖𝑠 𝑎 𝑤𝑒𝑖𝑔ℎ𝑡 𝑓𝑎𝑐𝑡𝑜𝑟

(26)

These units o r neuro ns are arrang ed in layers, as sho wn, Fig ure 2.2. The netwo rk that is shown has three layers. The first layer consists o f neuro ns that have inp uts to the netwo rk; these inp uts co me fro m external sources. This is the layer that interacts with the o utsid e enviro nment. These neurons in the first layer then act as inputs to the seco nd layer. In the same manner, the neurons in the third layer g et their inputs fro m the seco nd layer, and the third layer is the outp ut o f the entire netwo rk to the o utside enviro nment in case o f a sing le hidden layer.

1 2 . . . n x x x Output Neuron 1 2 . . . n w w w Input 1 1 A y e− = +

Figure 2.2 Artificial neuron

Since the seco nd layer has no d irect connections with the environment as it o nly interacts with the first (inp ut) and the third (outp ut) layers, it is called the hid d en layer. The numb er o f hid d en layers can b e more than one, as network structures chang e, and d ep end ing o n the intend ed use o f the netwo rk.

Output L ayer Hidden L ayer(s) Input L ayer 1 , input x 2 x (n1) x − (n) x . . . . . . Output, y , i adjustable weights w neuron

Figure 2.3 An example of a neural network with a single hidden layer

(27)

The reason why we req uire these neural networks is to co nstruct a mapp ing fro m a vecto r X to a vecto r Y. The size o f the inp ut layer X and o utp ut layer Y remains fixed by the numb er o f neuro ns that are co ntained in them. As fo r the hid den layer, its size dep end s o n the user’s req uirements and the purpo se fo r which the netwo rk is b eing used fo r. Since the netwo rk map s the input values to the outp ut, the erro r between the p redicted and the o b served values should b e as small as p o ssib le. During the training p hase o f the netwo rk, the network is p resented with examp les o f the typ e o f map p ing that is req uired . These training examp les are referred to as training vecto rs, and they are p airs that co nsist o f the inp ut and the outp ut (Po llard et al., 1992).

There has b een an interest in literature in the ap p licatio n o f neural networks in o rd er to have a so lutio n to the fault d iagno sis p ro blem. The neural netwo rks that have b een stud ied can be classified into the fo llowing categ o ries:

(i) the architecture o f the network such as sig mo id al, and

(ii) the learning alg o rithm in fo rm of either sup ervised o r unsup ervised (Venkatasub ramanian et al., 2003b ).

The stand ard o f ap p lying neural netwo rks in fault d iag nosis is used to classify the p rocess d ata acco rd ing to the o p eration o f the p rocess. The classificatio n metho d uses the ind ivid ual measurement p atterns in the p ro cess d ata, and has no info rmatio n ab out the d irectio n o f changes fo und in the p ro cess m easurements. The classificatio n o f these ind ivid ual measurement patterns is a very straightfo rward fault diag nosis metho d . When there is sufficient process measurem ents that are availab le, this classificatio n can b e d o ne. This classification metho d is an o ff line fault diag no sis scheme, where p ro cess d ata is co llected and the faults are pro perly d efined . A classifier is then desig ned, and exposed to so me test d ata. After this, the classifier is then used in the p rocess (Sorsa & Ko ivo , 1993).

(28)

2.3.2 Nonlinear PCA with auto associative neural networks

No nlinear p rincip al co mp o nent analysis (NLPCA) is a no nlinear g eneralizatio n of the stand ard p rincip al comp o nent analysis that was d iscussed earlier. This is used in multivariate d ata, and g eneralizes the principle co mp o nents from straig ht lines to curves. Therefo re, the sub sp ace in the o rig inal d ata sp ace which is describ ed b y all nonlinear co mp o nents is also curved (Scho lz et al., 2007). The NLPCA is used to id entify and remo ve co rrelatio ns found within the p ro b lem variab les, thereb y assisting with the fault d iag no sis p ro b lem. The main difference of NLPCA co mp ared to PCA is that b o th linear and no nlinear correlatio ns are unco vered within the d ata.

Figure 2.4 Auto associative neural network architecture (from Scholz et al., 2007)

No nlinear PCA can b e imp lem ented by using a neural netwo rk (Fig ure 2.1). NLPCA o p erates b y training a feed fo rward neural netwo rk to perfo rm the id entity map p ing , where the netwo rk inp uts are reproduced at the o utp ut layer (Kramer, 1991). Ho wever, we find in the mid d le o f the netwo rk a layer that wo rks as a b o ttleneck where dimensio nality red uctio n o f the data is app lied . This b o ttleneck layer ensures that the netwo rk d evelo p s a rep resentatio n o f the inp ut d ata, and that all the features in d ata are extracted in this layer.

(29)

2.3.3 Kernel PCA

When d ealing with neural netwo rks fo r feature extractio n, there are d ifficulties that are enco untered . So me o f these d ifficulties arise because one has to p red etermine the numb er o f features that must be extracted (Lee et al., 2004). An alternative to netwo rk b ased feature extractio n that ad d resses so me o f the d ifficulties is Kernel PCA. What KPCA d o es is that it tend s to transform the o rig inal d ata into a hig her d imensio nal feature sp ace, in which linear PCA can be ap p lied , and then o nly the sig nificant co mp o nents are retained .

The calculatio n o f the Kernel p rincip al com ponents is an eig envalue p ro b lem. The number o f co mp o nents that is retained is determined b ased on the variance d eco mpo sitio n (Cho et al., 2005). Fig ure 2.5 illustrates the way that Kernel principal co mp o nents analysis is p erfo rmed.

Figure 2.5 Steps of KPCA projection and reconstruction (Lee et al., 2004; Auret, 2010)

The setback that KPCA has is that no explicit demapping function is available to reconstruct the nonlinear principal components to the original input d ata. Some limitations to the KPCA app ro ach includes the computational expense of calculating

(30)

the required d o t p ro d uct fo r larg e samp le d ata sets, as well as the lack of interp retab ility o f no nlinear co mp o nents in the o rig inal inp ut sp ace (Cho et al., 2005).

2.3.4 Random forests

A d evelo p ment in statistical learning is the emerg ence o f ensemb les o f learning machines. “An ensemb le is d escrib ed as a co mbinatio n o f a co llection of classifiers in o rd er to enhance the p erfo rmance of the o verall classifier” (Valentini & Masulli, 2002).

It has b een sho wn (Valentini & Masulli, 2002) that these ensem bles o f classifiers no rmally p erfo rm b etter than the ind ivid ual classifiers, even in cases where the b ase classifiers are co nsid ered weak. By co nstructing an ensemb le of classifiers, mo re tho ro ug h exp lo ratio n o f hyp o theses can be accomp lished (Valentini & Masulli, 2002).

Rando m fo rests are no nlinear regression models that co nsist o f ensemb les of reg ressio n trees, in which each tree depends on a rand om vecto r that is samp led ind ep end ently fro m the p ro cess d ata (Auret & Ald rich, 2010a). The rand om forest mod el is an examp le o f ensemb le methods, with the b ase classifiers co nsisting o f unp runed d ecisio n tree classifiers (Breiman, 2001). A d ecisio n tree is a recursive sub sp ace p artitio ning classifier, that wo rks in such a way to red uce the class imp urity of successive subsets (Breiman et al., 1993).

The p revalent use o f the rand o m forest algorithm can b e due to its hig h accuracy and fast co mp utations (Breiman & Cutler, 2003). The tree ensembles such as these rand om forests can further p ro vid e an add ed functionality in which one can interp ret the variable impo rtance (Breiman & Cutler, 2003) as well as partial dependence analysis (Friedman, 2001). The random forest feature extraction was applied to unsupervised fault diagno sis for p rocess data, and compared to linear and nonlinear

(31)

metho d s. Rando m fo rest results were co mp arab le to the existing techniq ues (Auret & Ald rich, 2010a; Auret, 2010).

2.3.5 Biplots

Gabriel (1971) intro d uced the co ncep t o f the b ip lot, which is d efined as a g raphical d isp lay which consists o f a vecto r for each ro w and a vecto r fo r each column o f a matrix that has a rank o f two . The bip lo t is a multivariate eq uivalent o f the scatter p lo t (used fo r univariate). An element of this matrix is then represented b y the inner p rod uct o f the vecto rs that co rrespond to b oth its row and its co lumn (Gardner et al., 2005). Ald rich et al., 2004 and Gard ner et al., 2005 p rop o sed a related statistical pro cess mo nito ring ap p ro ach that emphasizes on the visualizatio n o f process co rrelatio ns and variatio ns in the p ro cess variab les by using the b ip lo t. In add ition, this ap proach o f b ip o t p ro vid es fo r an auto matic d etection and then the visualization o f the p ro cess d isturb ances b y use o f b agp lots (Ro usseeuw et al., 1999)

(32)

CHAPTER 3 RESTRICTED BOLTZMANN MACHINES

This chap ter g ives the theo retical framewo rk of Restricted Bo ltzmann Machines, the netwo rk architecture and the training algorithm. The auto encod er is also d iscussed, in which the netwo rk is p re-trained with Restricted Bo ltzmann Machines. A review, o n the use o f RBMs fo r feature extractio n is d iscussed in this chap ter as well.

3.1 Boltzmann Machines

The Boltzmann machine is a co llection o f symmetrically co nnected , neuro n-like, sto chastic b inary units (Fig ure 3.1, p ag e 24). Each unit in the netwo rk selects to b e o n or o ff b y co nsid ering the to tal inp ut that it receives from all the o ther units.

Fo r any training set o f state vecto rs, the weig hts and b iases in a Bo ltzmann machine can b e ad justed to assig n hig h p ro b ability to vectors in the training d ata. The units in a Bo ltzmann machine can b e p artitioned into two sub sets, namely, visib le and hid d en unit. The visib le units are tho se units of the netwo rk who se states can b e o b served , while the hid d en units are tho se with unob served states. These visib le neuro ns p rovid e an interface b etween the network and the enviro nment in which the network o p erates (Haykin, 1999).

In this stud y, the fo cus is in a sp ecial typ e o f the Boltzmann Machine, in which there are no co nnectio ns within layers (with no visib le-visib le o r hid d en-hid d en connections). This sp ecial typ e is called the Restricted Bo ltzmann Machine.

(33)

Hid d en (h)

Weig ht (W)

Visib le (v)

Figure 3.1: The Boltzmann Machine

3.1.1 The Restricted Boltzmann Machine

A Restricted Bo ltzmann Machine (RBM) (Sejno wski, 1986) is a two -layer neural netwo rk that contains a layer o f visible, b inary sto chastic units, co nnected to a layer o f hid d en, b inary sto chastic units, without co nnectio ns within each layer, i.e. no visib le-visib le and no hid d en-hid den co nnectio ns, as sho wn in Fig ure 3.2. The co nnectio ns are symmetric, meaning that they have the same weight in b o th d irectio ns (Hinton, 2010). b ias b ias Hid d en (h) Weig ht (W) Visib le (v)

Figure 3.2: Restricted Boltzmann Machine

(34)

A co nfig uration (𝒗, 𝒉) of both the visible and hidden units has the following energy:

𝐸(𝒗, 𝒉) = − ∑𝑖∈𝑣𝑖𝑠𝑖𝑏𝑙𝑒𝑎𝑖𝑣𝑖 − ∑𝑗∈ℎ𝑖𝑑𝑑𝑒𝑛𝑏𝑗ℎ𝑗 − ∑ 𝑣𝑖,𝑗 𝑖ℎ𝑗𝑤𝑖𝑗 Eqn.13

where 𝑣_𝑖, ℎ_𝑗 are b inary states o f visib le unit i and hid d en unit j,𝑎_𝑖, 𝑏_𝑗 are their resp ective b iases and 𝑤_𝑖𝑗 is the weig ht.

This netwo rk then assig ns the fo llo wing pro bability function for every p ossib le pair o f visib le and hid d en vecto r

𝑝(𝒗, 𝒉) = _𝑍1𝑒−𝐸(𝒗,𝒉) _Eqn.14

where the p artitio n functio n, Z, is g iven b y summing o ver all p o ssib le p airs o f visib le and hid d en vecto rs:

𝑍 = ∑ 𝑒−𝐸(𝒗,𝒉)

𝒗,𝒉 Eqn.15

The p rob ab ility that the netwo rk assigns to a visib le vecto r, v, is given by summing o ver all the p o ssib le hid d en vecto rs:

𝑝(𝒗) = 1_𝑍∑ 𝑒−𝐸(𝒗,𝒉)

𝒉 Eqn.16

3.1.2 Training the RBM

The derivative o f the log probab ility of a training vector with respect to a weight (as sho wn in Eq n. 16) is simp lified as:

𝜕𝑙𝑜𝑔𝑝(𝒗)

𝜕𝑤𝑖𝑗 = 〈𝑣𝑖ℎ𝑗〉𝑑𝑎𝑡𝑎− 〈𝑣𝑖ℎ𝑗〉𝑚𝑜𝑑𝑒𝑙 Eqn.17

where 〈 〉_{𝑑𝑎𝑡𝑎} is the value that is expected of that distribution, and 〈 〉_{𝑚𝑜𝑑𝑒𝑙} is then the value that is exp ected of Bo ltzmann sampling vectors.

(35)

Since there are no d irect co nnectio ns between the hid d en units in a Restricted Bo ltzmann Machine, it b eco mes easy to ob tain an unbiased rand om samp le o f 〈v_ih_j〉_data. For a rand o mly selected training data from the inp ut sp ace, v, the b inary state, h_j,for each of the hid d en unit, j, is set to 1 with a pro b ab ility o f

𝑝�𝒉𝒋= 𝟏 | 𝒗� = 𝜎�𝑏𝑗+ ∑ 𝑣𝑖 𝑖𝑤𝑖𝑗� Eqn.18

where 𝜎(𝑥) = 1

1+𝑒−𝑥 is the lo g istic sig moid function. 𝑣𝑖ℎ𝑗 is then an unb iased samp le

(Hinto n et al., 2012).

Similarly, since there are no d irect co nnections b etween visib le units in an RBM, it is also very easy to g et an unb iased random samp le o f the state o f a visib le unit, p rovid ed the hid d en vecto r is kno wn

𝑝(𝐯𝒊 = 𝟏 | 𝐡) = 𝜎�𝑎𝑖+ ∑ ℎ𝑗 𝑗𝑤𝑖𝑗� Eqn.19

To get an unb iased samp le o f 〈𝑣_𝑖ℎ_𝑗〉_{𝑚𝑜𝑑𝑒𝑙} is very d ifficult, and requires ad justments in the training p ro ced ure. To samp le fro m 〈𝑣_𝑖ℎ_𝑗〉_{𝑚𝑜𝑑𝑒𝑙} still req uires m ultiple iteratio ns that alternates b etween the up dating o f all the hid d en units, and then upd ating all o f the visib le units and bo th up dates are d o ne in parallel. Ho wever, this learning still wo rks very well if 〈𝑣_𝑖ℎ_𝑗〉_{𝑚𝑜𝑑𝑒𝑙} is rep laced b y the correspo nd ing 〈𝑣𝑖ℎ𝑗〉𝑟𝑒𝑐𝑜𝑛. A much faster learning p ro cedure was p ro p osed b y Hinto n (Hinto n, 2002)

which ensures that 〈𝑣_𝑖ℎ_𝑗〉_{𝑟𝑒𝑐𝑜𝑛} is o b tained as follows:

a. Starting with a training d ata vecto r o n the visible units, set the states o f the visib le units to a training vecto r. Then up d ate all o f the hid den units in p arallel. b. Up d ate all of the visib le units in parallel to g et a “reco nstructio n”.

c. Update all the hidden units again

(36)

Fro m Fig ure 3.3, it can b e seen that if an input vecto r,𝒙_𝑖 is used as a training vector, then the hid d en units are all up d ated in p arallel. Then up d ate the visib le units ag ain to g et a reco nstructio n, which is sho wn by the vecto r, 𝒙_𝑖′ .

Reconstruction Hidden L ayer(s) Input L ayer 1 , input x 2 x (n1) x − ( )n x . . . . . . , _i

adjustable weights w _neuron

. . . ' 1 , recon x 2 ' x ( 1) ' n x ₋ ( ) ' n x

Figure 3.3: Training the Restricted Boltzmann Machine

After all the up d ates, the chang e in weig ht is d erived as

∆𝑤𝑖𝑗= 𝜀�〈𝑣𝑖ℎ𝑗〉𝑑𝑎𝑡𝑎− 〈𝑣𝑖ℎ𝑗〉𝑟𝑒𝑐𝑜𝑛� Eqn.20

where 𝜀 is the learning rate

This learning p ro ced ure that is exp lained in the fo rego ing d iscussio n approximates g radient descent in what is kno wn as Co ntrastive Divergence (CD).

3.2 Stacked Restricted Boltzmann Machines

3.2.1 Dimensionality reduction using auto encoders

A multilayer auto encoder is a feed forward neural network that has more than one hidden layer in the network structure. This netwo rk attemp ts to reconstruct the input

(37)

d ata at the o utp ut layer o f the netwo rk (Hinto n et al., 1997). The targ ets at the o utput layer o f the netwo rk are no rmally the sam e as what you find at the inp ut layer, therefo re, the sizes o f the inp ut and outp ut layers are the same. Since the hid d en layer is smaller co mp ared to the inp ut d ata in terms of its size, the d imensio nality of the o rig inal inp ut d ata is red uced to a smaller d imensio nal sp ace at this hid d en layer (Vishnubho tla et al., 2010). The hidden layer gives a smaller d imensional rep resentatio n o f the d ata that p reserves as much structure as in the o rig inal d ata. This ensures that the lo w d imensional, no nlinear structure o f the d ata is revealed (Hinto n & Salakhutdino v, 2006).

Real wo rld d ata such as; speech sig nals, p ro cess d ata, d ig ital p ho to g rap hs, usually has a hig h d im ensio nality. In o rd er to hand le that typ e and nature o f d ata effectively, there is a need to red uce its d imensio nality to rather a level much lo wer than the o riginal d ata. After this transfo rmatio n, the red uced rep resentatio n sho uld have a dimensio nality that co rresp o nd s to the intrinsic d imensionality of the d ata. The intrinsic d imensio nality o f d ata is the minimum numb er o f p arameters that are req uired to acco unt fo r the o b served p ro perties of the d ata (Van d er Maaten et al., 2009). PCA is wid ely used in red ucing the d imensio nality of p ro cess d ata, b ut, as discussed earlier, its linear nature is a d rawback. A neural netwo rk that has at least o ne hid d en layer in its netwo rk structure can give a no nlinear map p ing fro m inp ut to o utput layer. Ho wever, the no rmal neural networks are usually unab le to reduce the d imensionality o f training d ata to the same extent as that o f PCA (Tan & Eswaran, 2008).

Hig h d imensio nal d ata can b e converted to low d imensio nal sp ace by training a multilayer netwo rk with a small central layer that reco nstructs hig h d imensio nal inp ut vectors. As Hinton and Salakhutdinov did in (Hinton & Salakhutdinov, 2006), they d escrib e a way that initializes the weig hts that can allow deep auto encoder networks to learn low d imensio nal space that works b etter to reduce the d im ensionality o f the

(38)

training d ata. This is a no nlinear g eneralisation of p rincipal comp o nents analysis. It uses a multilayer enco d er netwo rk to transform the hig h d imensio nal training d ata into a lo w d imensio nal sp ace, and then also uses a sim ilar d ecod er netwo rk to reco ver the d ata fro m the red uced space, see Fig ure 3.4 (p ag e 30).

3.2.2 Stacked Autoencoder with RBM pre training

In training this netwo rk, first start with a stand ard o ne hid d en layer auto enco d er. The weig hts are trained with the Restricted Boltzmann Machine. The o utp uts fro m this first RBM are used as the inp uts fo r the next enco d er. The same training p rocess is do ne in which the hidd en layer is trained, and the outputs used as the inp ut for the next network in the stack. This training process is rep eated fo r as many layers as need ed ; thereb y creating a stack o f auto enco ders.

After the p re training o f multip le layers, the mo del is unfo ld ed (Fig ure 3.4, p ag e 30) to p ro duce the enco d er and d eco d er networks that use the same netwo rk weig hts that it has learned d uring the training. The fine tuning stag e of the netwo rk then rep laces the sto chastic activities b y d eterministic, real valued pro b ab ilities and then uses backp ro p ag ation throug h the whole auto enco d er in ord er to fine tune the weights. A multilayer auto enco d er is a feed fo rward neural netwo rk which has mo re than o ne hid d en layer in the entire network structure. This structure uses RBM p re training fo r each of the hid d en layers.

(39)

Figure 3.4 Autoencoder with RBM pre-training (Hinton & Salakhutdinov, 2006)

3.2.3 Stacked RBM network architecture

After each RBM has been trained, a new layer is ad d ed in which the inp ut is the o utput o f the trained RBM. This new layer is trained as a sep arate RBM using the normal training p ro cess. In the g reedy training proced ure, o ne layer is ad d ed o n top o f the netwo rk at each stag e, and only that to p layer is trained (Hinto n, 2007) (as an RBM, see Fig ure 3.5).

Figure 3.5: Stages of the learning of layers of RBM’s (Hinton, 2007).

Decoder W3 RBM 500 1000 W4 Top RBM 30 500 W2 RBM 1000 2000 W1 RBM 2000 Encoder Unrolling Pretraining Fine-tuning 500 1000 2000 500 1000 2000 W1 W2 W3 W4 30 Code layer 30 500 1000 2000 500 1000 2000

(40)

Using the layer-b y-layer learning algo rithm of section 3.1 (p ag e 29), first learn a stack of RBM’s. After the learning is co mplete, the stochastic activities o f the b inary units in each layer are rep laced b y d eterministic, real valued p ro b abilities and the auto enco d er is then used to initialize a multilayer, no nlinear map ping as sho wn in Fig ure 3.6. This learning is treated as a pre training stag e that cap tures a lo t o f the higher o rd er structure in the inp ut d ata. In Fig ure 3.6, g reed y training a stack o f RBM’s where samp les fro m the lo wer level RBM are used as the d ata fo r training the next RBM is sho wn. The co rresp o nd ing d eep belief netwo rk that is fo rmed after the learning is sho wn in Fig ure 3.7 (p ag e 32).

Figure 3.6: Learning a stack of RBMs

(41)

Figure 3.7: A deep multilayer network

Successful ap p licatio ns o f these netwo rks have b een app lied amo ng o thers, in classificatio n p ro b lems (Beng io et al., 2007), reg ressio n analysis (Salakhutd ino v & Hinto n, 2008), d imensio nality red uction (Hinto n & Salakhutd ino v, 2006; Salakhutd inov & Hinto n, 2007), mo d elling textures (Osind ero & Hinto n, 2008), info rmatio n retrieval (Krizhevsky & Hinto n, 2011), rob o tics (Had sell et al., 2008) and natural lang uag e p ro cessing (Co llo b ert & Westo n, 2008). With a few excep tio ns (Sutskever & Hinton, 2007; Hinton & Bro wn, 2000), the literature o n RBMs is confined to mo delling static d ata. Therefo re, in the next section, a review o f so me o f the applicatio ns o f Restricted Bo ltzmann machines is g iven.

3.3 Review of Applications of Restricted Boltzmann Machines

This sectio n b riefly d escrib es the review o f the applicatio ns o f Restricted Bo ltzmann Machines. As already hig hlighted, the use of RBMs in feature extraction is important, as this will determine its usefulness in process monitoring.

(42)

3.3.1 Reconstruction of images

Reco nstructio n o f face and d ig ital images using auto enco ders is d iscussed in this sectio n. The training o f the auto enco der using Restricted Bo ltzmann Machine as build ing blo cks is d iscussed in section 3.1.1. The first step to co nsid er when d ealing with imag e reco nstructio n is to start b y training the auto enco der. This auto enco d er will have an inp ut layer, a hid d en layer and the o utput layer.

The sizes o f the hid d en layers are set as d esired in the exp erim ent. The training imag es are set as fed into the auto encod er netwo rk, which then reduces the d imensio nality o f the training d ata in the mid dle hid d en layer. During this d imensio nality red uctio n, the training data is rep resented into a smaller co d e space, which is then reco nstructed b ack into the im ages. The o utp ut o f the hidd en layer in this netwo rk is then used as the input to train the next auto encod er netwo rk. This p ro cess is then rep eated fo r the next network. The o utput layer always reco nstructs the imag e as inp ut thro ug h the training and testing phase.

The exp erim ents were co nd ucted o n the ORL (Olivetti Research Lab orato ry) face d ata set. The training d ata had 400 images. The training imag es are rescaled to size 37 × 30 b y using the nearest neighb o ur interp o latio n. The p ixel values of these imag es are then no rmalised to b e in the rang e from 0 to 1. The d ataset is then divid ed into 200 training imag es of two sets, o ne co ntains the first five (5) imag es and the other sub set the last five imag es o f each p erso n.

The netwo rk that was trained remained in such a way that the d eep est hidd en layer had 30 neuro ns. The d eep est hid d en layer in the network uses a linear activatio n function, whereas all the other layers use sigmoid activation functions (Tan & Eswaran, 2010). All the layers in the netwo rk were fully connected after the training was co mpleted.

(43)

A stand ard o ne hid d en layer stacked auto enco d er netwo rk is initialised with small rand o m weig hts and b iases that rang e fro m 0 to 0.1. Fo r the architecture in the exp eriments co nd ucted , the weig hts and b iases were p re-trained using RBM fo r 50 ep o chs, the to tal numb er o f ep o chs used for the training b eing 230. The MSE fo r the testing p hase after 230 ep o chs was 6.8, which p erfo rmed b etter than auto enco d ers witho ut RBM p re training which had the reco nstruction erro r o f 9.1. Fro m these exp eriments, auto enco d ers were successfully used to reco nstruct imag es, as was seen fro m the reco nstructio n erro rs in which they o utp erformed tho se witho ut RBM p re-training.

The sam e ap p ro ach was co nsid ered but in this case using the MNIST d ataset o f handwritten d ig its (Tan & Eswaran, 2010). The training and testing sets were divid ed acco rd ing to mo st o f the o ther b enchmarking exp eriments carried out b y o ther researchers, in o rd er to make it easier to do comp ariso ns. By using similar network architectures, the MSE fo r the auto encod er with RBM pre-training was 1.21 co mp ared to 1.685 fo r the o ne without.

It was sho wn in this exp eriment, that auto enco d ers with RBM p re-training can b e used successfully in imag e reco nstructio n, and o utp erformed the netwo rks witho ut RBM p re-training . Since the MNIST d atabase is a large d ataset (with 6000 training imag es) co mp ared to the ORL (with 400 images), the trained auto enco d er has b etter g eneratio n since a g o o d co nverg ence is achieved at the end o f the training p hase. The reco nstructio n erro rs are sho wn fo r both d ata sets in Tab le 1.

Table 1: MSE for MNIST & ORL datasets for whole image

Model ORL MNIST

Autoenco der 9.1 1.685

Autoenco der with RBM pre-training 6.8 1.210

(44)

3.3.2 Using Auto encoders for Mammogram Compression

The ap p licatio n o f auto enco d ers fo r med ical imag e co mp ressio n was consid ered b y Tan & Eswaran (2009). The p ap er p resents the results ob tained fo r med ical imag e compression using auto enco d er neural netwo rks. These exp erim ents sho w that auto enco d ers can b e trained effectively b y using imag e p atches instead o f the entire image, and still yield results that are comp arab le to o ther app ro aches (Tan & Eswaran, 2009).

The p erfo rmance o f the auto encoder is b ased o n the p arameters mean sq uared erro r (mse) and structural similarity (ssim ) index. MSE is the o ne measure o f disto rtion used fo r im ag es. “The MSE averages the squared intensity differences o f comp ressed and o rig inal imag e p ixels” (Co sman et al., 1994). The ssim ind ex varies b etween o and 1, with 0 b eing wo rst as it represents no n-identical imag es and 1 rep resents id entical im ag es.

Exp eriments were co nd ucted o n Imag es fro m Dig ital Database fo r Screening Mammo graphy (DDSM), a mammo gram d ataset for b reast cancer d iag nosis. Three categ o ries o f mammo g rams that co nsist of 100 p atients with no rmal b reasts, 80 patients with b reast cancer and 70 p atients with b enig n were selected . The results for the MSE are sho wn in Tab le 2. The p erform ances also d ep end on the size o f the hidd en layers, since smaller hid den layers decrease the p erfo rmance as the reconstructio n erro rs are hig her. The auto encod er with RBM p re training manag ed to g et the ssim ind ex o f 0.98, as comp ared to the o ne without p re training with an ssim ind ex of 0.89.

Table 2: MSE for different network architectures

Network architectures training

Auto enco der 0.1206

Autoencoder with RBM pre-training 0.00974

(45)

3.3.3 Face Recognition

The face reco g nitio n p ro b lem is addressed using an auto encod er with RBM p re-training . The reco gnitio n p ro b lem using the auto encod er can b e imp lemented using a numb er of step s that are d iscussed (Tan & Eswaran, 2010). As with many o f these ap p licatio ns, the first step no rmally involves training the auto encoder. After the auto enco d er is trained using the imag es, feature cod es are then o b tained from the test imag es.

In the exp eriments cond ucted , the feature co d es from the d eep est hid d en layer are extracted fo r classificatio n. These exp eriments were co nd ucted b ased o n the two d atasets, namely the MNIST and ORL face dataset. Tab le 3 sho ws the reco g nitio n rates that were o b tained in the exp eriments. Fro m the results, it is evid ent that the auto enco d er with RBM p re training yielded g o od results with reco g nitio n rates o f 86% on the ORL d ataset and 93.1% on the MNIST d atab ase.

Table 3: Recognition rates (%) of different network architectures

Models ORL MNIST

Auto enco d er 80.5 92.6

Auto enco d er with RBM pre-training 86.0 93.1

3.3.4 Classification & filtering

Co llabo rative filtering is the p ro cess o f filtering fo r p atterns (o r information) b y using techniq ues invo lving co llab o ration among viewp o ints, d ata sources, etc. This invo lves very large d atasets, and can includ e, b ut no t limited to , sensing and mo nitoring d ata, financial data, and movie ratings. A widely used approach to collaborative filtering is to assign a low d imensional feature vector to each user and a low dim ensional feature vecto r to each movie. This is done in order that the rating that each user assigns to each movie is then modelled by the scalar product of the two feature