Dependability for high-tech systems: an industry-as-laboratory approach

(1)

Dependability for high-tech

systems:

an

industry-as-laboratory approach*

Ed Brinksma

Embedded Systems Institute,

Eindhoven

University

of Twente, Enschede

The Netherlands

ed.brinksma@esi.nl

Abstract

The dependability of high-volume embedded systems, such a consumer electronic devices, is threatened by a combination ofquickly increasing complexity, decreasing time-to-market, and strong cost constraints. This poses challenging research questions that are investigatedin the Trader project, following the industry-as-lab approach. We present the main vision of this project, which is based on amodel-based controlparadigm, and the current status of theproject results.

1. Introduction

High-tech systems by definition are constructed using cutting edge technology, and consequently embedded sys-temtechnology plays amajor and ofteneven decisive role in suchsystems, whethertheyare mobilephones, HDTVs, medicalequipment, cars,airplanes, etc. Theintegration of embedded hardware and software into larger systems has lifted the issue ofdependability of the embedded compo-nents to the level of theembedding high-tech system. At that level the integral system dependability isnot only af-fected by the dependability of its individual components, butmainlyanemergentqualityof the interactions between these components and the system environment. Control-ling the complexity of these interactions isoneof themajor challengesofhigh-techsystemdesign,and thepresence de-pendability problemsinalmost allapplicationdomains isa

well-documented fact of life.

In this paper we want to report on amodel-based

ap-proachtosystemdependability. Although model-based de-signhas alreadybeen advocated foraconsiderable timeby *This work has beencarriedout aspartof the Traderprojectunder the

responsibilityof the EmbeddedSystemsInstitute. Thisprojectispartially supported bythe DutchMinistryof Economic Affairs under the Bsik pro-gram.

Jozef Hooman

Embedded Systems Institute, Eindhoven

Radboud University Nijmegen

The Netherlands

jozef.hooman

@esi.nl

(mainly) the academic communityas a way forwardto ad-dresscomplex embeddedsystemsengineering tasks, the in-dustrial uptake is lagging behind. To ensure thepractical relevance of the research, it is being carried outfollowing an industry-as-laboratory approach

[5].

This means that

concrete cases arestudied in their industrialcontext to pro-motethe applicability and scalability of solution strategies under the relevantpractical constraints.

Theconcrete casethatwediscuss is basedonthe collab-orative researchproject Trader of the Embedded Systems Institute (ESI) withNXPand other academic and industrial

partners on systemdependabilityforhigh-volumesystems.

High-volumesystems arecharacterizedby the fact that

be-cause of their production in large quantities, the cost per

item should be (very) low. Thisseriously restricts the

pos-sibilitytoaddressdependability by classicalmeans,suchas

over-dimensioning of criticalcomponents. Oneof the main ideas in the project is to use concepts from model-based controltoachievedependability.

Therestof thepaperisorganizedasfollows: Sect.2

con-tains an outline of the project, Sect. 3 outlines the model-basedphilosophy of theproject, and Sect. 4 reports of the

current statusof the results. We drawourconclusions and listsomefuture work in Sect. 5.

2. Outline of

the Trader

Project

In the Trader project, academic and industrialpartners

collaborate to optimize the dependability ofhigh-volume products, suchas consumerelectronic devices. Theproject

partners involved are: NXP Semiconductors, NXP

Re-search, ESI, TASS, IMEC (Belgium), Twente University, the TechnicalUniversity ofDelft,theUniversityofLeiden, and theDesign TechnologyInstitute(DTI)atthe Eindhoven University ofTechnology. Theproject started September 2004,withaduration of fiveyears, and includessevenPhD students andtwopostdocs. The so-calledCarrying Indus-trial Partner (CIP) of thisprojectis NXPSemiconductors, providingtheprojectwithafocusonmultimediaproducts.

(2)

NXPhas provided the problemstatementandproposes

rel-evant case studies, which in the case of Trader are taken from theTV domain. The problem statementis basedon

the observation that the combination of increasing complex-ity ofconsumerelectronic products and decreasing time-to-market will make it extremely difficult to produce totally reliable devices thatmeetthe dependability expectations of

customers.

Acurrenthigh-endTVis alreadya verycomplex device whichcanreceiveanalog and digital input frommany

pos-siblesources andusingmanydifferentcoding standards. It canbe connectedtovarioustypesofrecording devices and includes many features such a picture-in-picture, teletext, sleep timer, child lock, TV ratings, emergency alerts, TV

guide, and advanced image processing. Moreover,there is

agrowing demand for features that shared with other do-mains, suchasphoto browsing,MP3playing,USB, games,

databases, and networking. Asaconsequence, theamount

of software inTVshasseen anexponential increase from1 KBin 1980to morethan 20MBincurrenthigh-endTVs.

Also the hardware complexity is increasing rapidly, for instance for the supportof real-time decoding and

process-ing of high-definition images for largescreensand multiple

tuners. To meet the hard real-time requirements a TV is designedas asystem-on-chip with multipleprocessors,

var-ioustypesofmemory,and dedicated hardware accelerators. Atthesametime, there isa strong pressure to decrease time-to-market. To be ableto realizeproducts with many newfeaturesquickly,componentsdeveloped by others have

tobeincorporated. This includes so-called third-party com-ponents, e.g., for audio and video standards. Moreover, there isaclear trend towards theuseof downloadable com-ponents to increase product flexibility and to allow new

business opportunities (selling new features, games, etc.). Given thelarge number of possibleusersettings andtypes

ofinput, exhaustive testing is impossible. Also, the product

must be able to tolerate certain faults in the input.

Cus-tomersexpect,forinstance, that productscancopewith de-viations fromcodingstandardsorbadimage quality.

Although companies invest alot of attention and effort

toavoid faults in releasedproducts,it isexpectedthat

with-out additional measures both internal and external faults

are serious threats to product dependability. The cost of non-quality, however, is high, because it leadsto many

re-turned

products,

itdamagesbrand

image,

and reduces

mar-ket share.

The main goal of the Trader project is to improve the user-perceived dependabilityofhigh-volume products.The aim istodevelop techniquesthatcancompensateand mask faults in released

products,

such that they satisfy user

ex-pectations. The mainchallengeistorealize this without in-creasing developmenttimeand, giventhe domain of high-volumeproducts, with minimal additional hardware costs

and ithout degrading performance. Hence, classical fault-tolerance techniques that relyalotonadditional redundancy andresources(e.g.,duplicationor eventriplication of

hard-wareand software)are notsuitable in this domain.

Inourpresentation the terminology of [I]is adopted. A

failure ofa systemwithrespect to anexternal specification is an eventthat occurs whena state change leads to a run

thatno longer satisfies the external specification. An error

is the part of the system state thatmay lead to a failure.

Forinstance, a wrong memory valueor a wrongmessage

in a queue. Afault is the adjudgedorhypothesized cause

ofan errorwhich isnotpartof the system state. Examples of faultsareprogramming mistakes (e.g., divide by zero)or

unexpected input.

3. Model-Based

Approach

Looking at anumber of failures ofconsumerelectronic devices, it is often thecasethata user canimmediately

ob-servethatsomething is wrong,whereas thesystemitself is completelyunawareof theproblem. Systemsareoften real-ized ina waythat correspondstothe open-loop approach in controltheory; foracertaininput, the required actions are

executed, but it isneverchecked whether these actions have the desired effectonthe systemand whether the systemis still inahealthystate.

The main approach of the Trader project isto"close the loop" andtoaddakind of feedback controltoproducts. By

monitoring thesystem andcomparingsystem observations withamodel ofthe desired behaviouratrun-time, the sys-temgets aform of run-timeawarenesswhich makes it

pos-sibletodetect that itscustomer-perceived behavior is (or is likelytobecome)erroneous. Inaddition, the aim isto

pro-vide thesystemwithastrategy to correctitself.

The main ingredients of sucharun-time awareness and correctionapproacharedepicted in Fig. 1.

input

>

=WP

output

systemstatem

Figure 1. Addingawarenessat run-time

Wediscuss the mainparts,giving examplesfrom theTV

domain:

(3)

* Observation: observe relevant inputs, outputs and internal system states. For instance, for a TV we may want to observe keys presses from the remote

control, internal modes of components (dual/single

screen, menu, mute/unmute, etc), load ofprocessors

and busses, buffers, function callsto audio/video out-put,soundlevel,etc.

* Errordetection: detecterrors, based on observations of the system anda model of the desired system be-haviour.

* Diagnosis: in case of an error, find the most likely

causeof theerror.

* Recovery: correct erroneous behaviour, based onthe diagnosis results and information about the expected impactontheuser.

Important partof the approach depicted in Fig. 1 is the

use of models atrun-time. Note that for complexsystems

it will be infeasibletoincludeacomplete model of desired

systembehaviour, but the approach allows theuseofpartial models, concentratingonwhat ismostrelevant for theuser.

Moreover,we canapply this approach hierarchically and in-crementallytoparts of thesystem, e.g.,tothird-party com-ponents. Typically, there will be severalawarenessmonitors in a complex systems, for differentcomponents, different

aspects,and different kinds of faults.

Theanalogy between self-controlling software and

con-trol theory has already been observed in [10]. Garlan et

al [9] havedevelopedan adaptation framework where sys-tem monitoring might invoke architectural changes.

Us-ing performance monitorUs-ing, this framework has been

ap-pliedtotheself-repair of web-based client-serversystems.

Related work that also takes costlimitations into account can be found in the research on fault-tolerance of large-scale embedded systems [13]. They apply the autonomic computing paradigmto systems with many processors to

obtain a healing network, also using a kind of controller-plant feedback loop. Related workonaddingacontrolloop

to an existing system is described in the middleware

ap-proach of[14]wherecomponents arecoupled viaa publish-subscribe mechanism.

4. Current Status of

Trader

In this section, we give a brief description of the

re-search activities and thecurrent statusof the Traderproject. First we discuss the research on the ingredients of the global awareness vision depicted in Fig. 1:

observation,

(Sect. 4.1), modeling system behaviour (Sect. 4.2), error

detection (Sect. 4.3), diagnosis (Sect. 4.4), and recovery

(Sect. 4.5). Finally, we mention research on reliability

improvements during development and userperception in Sect. 4.7and Sect. 4.6, respectively.

4.1. Observation

Toobserve relevantaspectsof thesystem,both hardware and software techniquesareinvestigated. Hardware-related work in Tradercurrently aimsatexploiting mechanisms al-ready available in hardware, suchasthe on-chip debug and

trace infrastructure,to monitor values forrangechecking, call stacks (functions, parameters, and resultvalues), and

memory arbiters. The observation of software behaviour is mainly done by code instrumentation using aspect-oriented techniques, partly based on results from ESI-project Ide-alsproject [6,7]. A specialized aspect-oriented framework calledAspectKoala [19] has been developedontop of the

componentmodel Koala which is usedatNXP.

4.2. Modeling Desired System Behaviour

Important partof the model-based approach described in Sect. 3is theuseofamodel of desiredsystembehaviourat

run-time. Experience in Trader and other ESI projects

indi-catesthat such modelsareusuallynotavailable inindustry and that it is often difficulttoobtain such models.In indus-trial practice, systemrequirements are usually distributed

overmany documents and databases. Hence, part of the ESIresearch in Trader explicitly addresses the construction ofahigh-levelsystemmodel.

Since theTVdomain is the mainsourceofcasestudies in Trader,wehavedevelopedahigh-level model ofaTVfrom the viewpoint of theuser. It captures the relation between

userinput, via theremotecontrol, andoutput,viaimageson

thescreenand sound.Afew firstexperiments indicated that the use ofstate machines leads to suitable models for the control behaviour of theTV. Butit also revealed that itwas very easy to makemodeling errors, for instance, because therearemanyinteractions between features. Examplesare

relations between dualscreen, teletext and varioustypesof

on-screendisplays thatremove orsuppresseach other.

Toallowquick feedbackontheuser-perceived behaviour andtoincrease the confidence in thefidelity of themodel, Matlab/Simulink [11] is used to obtain executable mod-els. Stateflow isexploitedfor the controlpart, whereas the streamingpartofaTV is modeledbymeansof the Image and VideoProcessingtoolbox of Simulink. Externalevents canbegenerated by clicking on apictureofa remote

con-trol. Output is visualizedbymeansof Matlab's videoplayer andascopefor the volume level. This visualization of the

userviewoninputandoutputof the model turnedout tobe

veryusefultodetectmodelingerrorsand undesired feature interactions. Inaddition, weinvestigate the possibilities of

(4)

formal model-checking and test scripts to improve model quality.

4.3. Error Detection

Various techniques for error detection are investigated such as hardware-based deadlock detection and range

checking. An approach which checks the consistency of internal modes ofcomponents turned out tobe successful

todetect teletext problems dueto aloss of synchronization betweencomponents [17].

Toenable quick experimentation with model-basederror

detection,wehavedevelopedaframework which allows the

use of modelsatrun-time. The framework has been imple-mentedon topof Linux, to comply with the trend towards

open-source software and theuse of Linux inTVs. Inthe framework,one canincludeaparticularSystemUnder Ob-servation (SUO) and a specification model of the desired

systembehaviour. Thedesign of theawarenessframework is shown in Fig. 2. The SUO and theawarenessmonitorare

llnputEvent , suo -ProcessBoundary IControl lErrorNotify Comparator _

Figure 2. Design of the awareness framework

which is obtainedfrom theOutputObserver. The Controller

initiates and controls allcomponents, exceptfor the Config-urationcomponentwhich is controlled by the Model Execu-tor.

Experiments with earlier versions of the framework in-dicated that the Comparator shouldnotbetoo eager to re-port errors; small delays in system-internal communication might easily leadtodifferences duringashort time interval.

Hence,inthecurrentframework theuserof the framework

can specify, for each observable value: (1)athreshold for the allowed maximal deviation betweenspecification model andsystem,and (2) amaximum for the number of

consec-utive deviations thatareallowed beforean errorwill be

re-ported.

Another relevantparameteris the frequency with which time-based comparison takes place. Thiscanbe combined with event-basedcomparison by specifying in the specifi-cation model whencomparison should take place and when

not (e.g., when the system is in an unstable state between certainmodes). Observe thatwe haveto makeatrade-off betweentakingmoretimetoavoid falseerrorsandreporting

errorsfasttoallowquick repair.

Relatedto ourapproach isamethodto wrapCOTS com-ponentsand monitor themusing specifications expressedas aUML state diagrams is presented in [16]. Other related work consists of assertion-based approaches such as

run-time verification [4]. For instance, monitor-oriented

pro-gramming [3] supports run-time monitoring by integrating specifications in theprogramvialogical annotations. In our

approach,we aimatminimal adaptation of the software of thesystem, tobe abletodeal withthird-party software and legacy code. Moreover, wealso monitor real-time

proper-ties, whichare notaddressedby the techniques cited above. Closely related in thisrespectis the MaC-RT system [15] which also detects timeliness violations. Main difference with our approach is the use of a timed version of

Lin-earTemporal Logictoexpressrequirements specifications, whereaswe useexecutable timedstatemachinestopromote

industrialacceptanceand validation.

separate processes and Unix domain sockets are used for

inter-process communication. The SUO hastobe adapted slightly, to send messages with relevant input and output

events(whichmayalso include internalstates)toInput and Output Observers. Anexecutablespecification model of the SUOinStateflowcanbe includedby using the code

genera-tionpossibilities of Stateflow. The generated C-codecanbe

included easily, allowing quick experiments with different models. It is executed using the Model Executer

compo-nent, basedoneventnotifications from theInput Observer.

Information about relevantinput andoutputeventsis stored in theConfigurationcomponent. TheComparator

compo-nent compares relevant model output with system output

4.4.

Diagnosis

The diagnoses techniques developed within Trader are

basedonso-calledprogram spectra[20]. Theapproachhas alreadybeenappliedto afewexamplesin theTVdomain. As anillustration, wedescribeoneof the firstexperiments

onTVsoftware in whichateletexterrorhas beeninjected. First the C code is instrumentedtorecord which blocksare

executed. Intheexampletherewere 60 000 blocks. Next, for eachsequenceofkey presses, a so-called scenario, for each block it is recorded whether it has been executed or notbetween twokeypresses. This leads to avector, a

so-calledspectrum,for each block. In ourexampleitturns out

(5)

that duringascenario of 27 keypresses13796 blockswere

executed. Moreover,basedon some errordetection mecha-nism, it is recorded for each key press whether it leads to error or not. In the example, this leads to an error vec-tors of length 27. Next, the similarity between the error vectorand the spectrais computed. Finally, the blocks are

ranked according their similarity. In the particular

exper-iment with the teletexterror, the block which contains the fault appearedonthe first place in the ranking. Also in other

casestudies theapplication results of this techniqueare

en-couraging.

4.5. Recovery

Part of therecovery research concentrates on load bal-ancing. ProjectpartnerIMEChas demonstrated the possi-bilitytomigrateanimage processing task fromone proces-sor to another, which leads to improved image quality in

caseof overload situations (e.g., duetointensive error

cor-rection on abad input signal). NXPResearch investigates the possibility to make memory arbitration more flexible such that itcanbeadaptedatrun-timetodeal withproblems concerningmemory access. AtTwenteUniversity,a frame-workforpartialrecoveryhas beendeveloped which allows independent recovery ofparts of the system, the so-called recoverable units. The framework includes a communica-tion manager, which controls the communication between recoverable units, anda recovery manager, whichexecutes

therecovery actions suchaskilling and restarting units. To

realize theseconcepts, areusable fault tolerance library has been implemented. A few first experiments in the multi-media domain show that after somerefactoring of the sys-tem,independentrecoveryofpartsof thesystemispossible withoutlarge overhead.

4.6.

User

Perception

Theuserperception of reliability is addressed by project

partner DTI. The aim is to capture user-perceived failure severity, to get anindication of the level of user-irritation causedby a product failure. By means of controlled

ex-perimentswithTV users,theimpactof characteristics such

as product usage, user group, and function importance is investigated. During experiments, it turned outthat also failure attribution has a significant impact. For instance,

users, when asked, rank bothimage quality and a

motor-izedswivel, whichcanbe usedto turnthe TV,asimportant. Underobservation, however,usersoftenturn out tobevery

tolerant concerningbadimage quality (whichis attributed

to external sources), butgetirritated if the swivel doesnot workcorrectly.

4.7. Improvements During Development

Partof the Trader research is also relatedto dependabil-ity improvements during development. This includes the

useof code analysistoprioritize the warnings ofasoftware inspection tool suchas QA-C [2] and reliability analysis at

the architectural level [18]. The stresstesting approach of

TASS artificially takesawaysharedresources,suchas CPU orbusbandwidth,tosimulate theoccurrenceoferrors orthe addition ofanadditionalresource user. Thestudy of the ef-fect of such overload situationsonthesystembehaviour and its fault-tolerant mechanisms has showntobeveryuseful in the TV domain. A so-called CPU eater, which consumes

CPU cycles attheapplication level in software, is already included in the current development software and can be activatedbysystem testers.

5. Conclusion and future Work

Although the Trader project has still sometimeto go,it is already clear that its particular model-based approachto systemdependability isverypromising. Theuseof models

assystem components togive thesystem acertaincapacity

tomonitor andcorrectitsbehaviour, implements ideas from feedback controlatthe level ofintegrated systems. It

con-stitutes a paradigm switch from the best-effort, open-loop approach that is traditional in software-related design, to

theclosed-loop control-based approach. The latter is much

more suitable for thereality of high-tech systemsin which

errors areunavoidableemergentfeatures of thesystem

com-plexity.

Theconceptof model-basedsystemlevel control is also quite flexible, in thesensethatone canvarybetween light-weight models with limited corrective capacities, andmore

elaborate models with stronger feedback mechanisms. In

the high-volumecontext, the constraint to minimize

over-head islimiting factor. Certainly, muchmoreresearch will be neededto obtaina morecomplete understanding of the potentialand limitations of thisapproachintheapriorivast rangeof differentapplication domains.

The choice foran industry-as-laboratory format for the Traderproject has helped alot infocussing ontechniques and approaches that have a high potential for being ab-sorbed by industry. Alreadynow, some of the

intermedi-ateresults have found their way intoindustry. We firmly believe in thepotentialof this research formattoachieve a

productive combination between real research and innova-tion.

Futureactivities in the Traderproject will address further developmentof theawarenessframework. Our Linux-based

awareness framework, has been validated by means of model-to-model experiments. That is, wehavecompareda

(6)

SUO.Currently, the framework is used for awareness exper-iments with the open source media player MPlayer [12], in-vestigating both correctness and performance issues. Next our approach will be applied in the TV domain at NXP, following the industry-as-lab paradigm. Important topic of research concerns the optimal integration of various tech-niques for observation, error detection, diagnosis, and re-covery.

Inparallel, the model-based run-time awareness concept is also exploited in the domain of printer/copiers at the com-pany Oce in the context of the ESI-project Octopus [8], which startedrecently.

References

[1] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr.

Ba-sic concepts and taxonomy ofdependableandsecure

com-puting.IEEETransactionsonDependable and Secure

Com-puting,1(1): 11-33, 2004.

[2] C. Boogerd and L. Moonen. Prioritizing software inspection

resultsusing static profiling. InSCAM '06: Proc. Workshop

onSourceCodeAnalysisandManipulation,pages149-160.

IEEEComputer Society, 2006.

[3] F. Chen, M. D'Amorim, and G. Rosu. A formal

monitoring-based framework for softwaredevelopmentandanalysis. In

Proceedings ICFEM 2004, volume 3308 of LNCS, pages

357-372.Springer-Verlag, 2004.

[4] S.Colin and L. Mariani. Run-time verification. In

Proceed-ingsModel-Based TestingofReactiveSystems, volume 3472

ofLNCS, pages 525-555.Springer-Verlag, 2005.

[5] C.Potts.Software-engineeringresearch revisited.IEEE

Soft-ware,19(9):19-28, 1993.

[6] P. Durr, G. Guilesir, L. Bergmans, M. Aksit, and R. van

Engelen. Applying AOP inanindustrial context: An

ex-perience paper. In Proc. Workshop on Best Practices in Applying oriented Software Development.

Aspect-Oriented SoftwareAssociation,2006.

[7] Embedded Systems Institute. The Ideals project, 2007.

http://www.esi.nl/ideals/.

[8] Embedded SystemsInstitute. The Octopus project, 2007.

http://www.esi.nl/octopus/.

[9] D.Garlan,S.Cheng,and B. Schmerl.Increasingsystem

de-pendability througharchitecture-basedself-repair. In

Archi-tectingDependable Systems, volume 2677 ofLNCS,pages

[10] M. M. Kokar, K. Baclawski, and Y. A. Eracar. Control

theory-basedfoundations ofself-controllingsoftware.IEEE

Intelligent Software,pages37-45, 1999.

[11] The Mathworks. Matlab/Simulink, 2007.

http://www.mathworks.com/.

[12] MPlayer. Open source media player, 2007. http://www.mplayerhq.hu/.

[13] S. Neema, T. Bapty, S. Shetty, and S. Nordstrom.

Auto-nomic fault mitigationinembedded systems. Engineering

Applications ofArtificial Intelligence, 17:711-725,2004.

[14] J. Parekh, G.Kaiser, P. Gross, and G. Valetto. Retrofitting

autonomiccapabilities ontolegacysystems. Cluster

Com-puting,9(2):141-159, 2006.

[15] U. Sammapun,I. Lee, and0. Sokolsky. Checking

correct-ness atruntime using real-time Java. In Proc. 3rd Workshop

on JavaTechnologies for Real-time and Embedded Systems

(JTRES'05),2005.

[16] M. E. Shin and F. Paniagua. Self-management of COTS

component-based systems using wrappers. In Computer

Software and Applications Conference (COMPSAC 2006),

pages33-36.IEEEComputer Society, 2006.

[17] H. Sozer, C. Hofmann, B. Tekinerdogan, and M. Aksit. De-tecting mode inconsistencies in component-based embedded

software. InDSN Workshop onArchitecting Dependable

Systems,2007.

[18] H. Sozer, B. Tekinerdogan, and M. Aksit. Extending

fail-uremodes and effectsanalysis approach for reliability

anal-ysisatthe software architecturedesign level.InArchitecting Dependable Systems IV, volume 4615 of LNCS, pages

[19] P. van de Laar and R. Golsteijn. User-controlled reflection

onjoin points. Journal ofSoftware, 2(3): 1-8, 2007.

[20] P. Zoeteweij, R. Abreu, R. Golsteijn, and A. van Gemund.

Diagnosis of embedded software using program spectra. In

Proc. 14th Conference and Workshop on the Engineering of