Dependability for high-tech
systems:
anindustry-as-laboratory approach*
Ed Brinksma
Embedded Systems Institute,
Eindhoven
University
of Twente, Enschede
The Netherlands
ed.brinksma@esi.nl
Abstract
The dependability of high-volume embedded systems, such a consumer electronic devices, is threatened by a combination ofquickly increasing complexity, decreasing time-to-market, and strong cost constraints. This poses challenging research questions that are investigatedin the Trader project, following the industry-as-lab approach. We present the main vision of this project, which is based on amodel-based controlparadigm, and the current status of theproject results.
1. Introduction
High-tech systems by definition are constructed using cutting edge technology, and consequently embedded sys-temtechnology plays amajor and ofteneven decisive role in suchsystems, whethertheyare mobilephones, HDTVs, medicalequipment, cars,airplanes, etc. Theintegration of embedded hardware and software into larger systems has lifted the issue ofdependability of the embedded compo-nents to the level of theembedding high-tech system. At that level the integral system dependability isnot only af-fected by the dependability of its individual components, butmainlyanemergentqualityof the interactions between these components and the system environment. Control-ling the complexity of these interactions isoneof themajor challengesofhigh-techsystemdesign,and thepresence de-pendability problemsinalmost allapplicationdomains isa
well-documented fact of life.
In this paper we want to report on amodel-based
ap-proachtosystemdependability. Although model-based de-signhas alreadybeen advocated foraconsiderable timeby *This work has beencarriedout aspartof the Traderprojectunder the
responsibilityof the EmbeddedSystemsInstitute. Thisprojectispartially supported bythe DutchMinistryof Economic Affairs under the Bsik pro-gram.
Jozef Hooman
Embedded Systems Institute, Eindhoven
Radboud University Nijmegen
The Netherlands
jozef.hooman
@esi.nl
(mainly) the academic communityas a way forwardto ad-dresscomplex embeddedsystemsengineering tasks, the in-dustrial uptake is lagging behind. To ensure thepractical relevance of the research, it is being carried outfollowing an industry-as-laboratory approach
[5].
This means thatconcrete cases arestudied in their industrialcontext to pro-motethe applicability and scalability of solution strategies under the relevantpractical constraints.
Theconcrete casethatwediscuss is basedonthe collab-orative researchproject Trader of the Embedded Systems Institute (ESI) withNXPand other academic and industrial
partners on systemdependabilityforhigh-volumesystems.
High-volumesystems arecharacterizedby the fact that
be-cause of their production in large quantities, the cost per
item should be (very) low. Thisseriously restricts the
pos-sibilitytoaddressdependability by classicalmeans,suchas
over-dimensioning of criticalcomponents. Oneof the main ideas in the project is to use concepts from model-based controltoachievedependability.
Therestof thepaperisorganizedasfollows: Sect.2
con-tains an outline of the project, Sect. 3 outlines the model-basedphilosophy of theproject, and Sect. 4 reports of the
current statusof the results. We drawourconclusions and listsomefuture work in Sect. 5.
2.
Outline of
the Trader
Project
In the Trader project, academic and industrialpartners
collaborate to optimize the dependability ofhigh-volume products, suchas consumerelectronic devices. Theproject
partners involved are: NXP Semiconductors, NXP
Re-search, ESI, TASS, IMEC (Belgium), Twente University, the TechnicalUniversity ofDelft,theUniversityofLeiden, and theDesign TechnologyInstitute(DTI)atthe Eindhoven University ofTechnology. Theproject started September 2004,withaduration of fiveyears, and includessevenPhD students andtwopostdocs. The so-calledCarrying Indus-trial Partner (CIP) of thisprojectis NXPSemiconductors, providingtheprojectwithafocusonmultimediaproducts.
NXPhas provided the problemstatementandproposes
rel-evant case studies, which in the case of Trader are taken from theTV domain. The problem statementis basedon
the observation that the combination of increasing complex-ity ofconsumerelectronic products and decreasing time-to-market will make it extremely difficult to produce totally reliable devices thatmeetthe dependability expectations of
customers.
Acurrenthigh-endTVis alreadya verycomplex device whichcanreceiveanalog and digital input frommany
pos-siblesources andusingmanydifferentcoding standards. It canbe connectedtovarioustypesofrecording devices and includes many features such a picture-in-picture, teletext, sleep timer, child lock, TV ratings, emergency alerts, TV
guide, and advanced image processing. Moreover,there is
agrowing demand for features that shared with other do-mains, suchasphoto browsing,MP3playing,USB, games,
databases, and networking. Asaconsequence, theamount
of software inTVshasseen anexponential increase from1 KBin 1980to morethan 20MBincurrenthigh-endTVs.
Also the hardware complexity is increasing rapidly, for instance for the supportof real-time decoding and
process-ing of high-definition images for largescreensand multiple
tuners. To meet the hard real-time requirements a TV is designedas asystem-on-chip with multipleprocessors,
var-ioustypesofmemory,and dedicated hardware accelerators. Atthesametime, there isa strong pressure to decrease time-to-market. To be ableto realizeproducts with many newfeaturesquickly,componentsdeveloped by others have
tobeincorporated. This includes so-called third-party com-ponents, e.g., for audio and video standards. Moreover, there isaclear trend towards theuseof downloadable com-ponents to increase product flexibility and to allow new
business opportunities (selling new features, games, etc.). Given thelarge number of possibleusersettings andtypes
ofinput, exhaustive testing is impossible. Also, the product
must be able to tolerate certain faults in the input.
Cus-tomersexpect,forinstance, that productscancopewith de-viations fromcodingstandardsorbadimage quality.
Although companies invest alot of attention and effort
toavoid faults in releasedproducts,it isexpectedthat
with-out additional measures both internal and external faults
are serious threats to product dependability. The cost of non-quality, however, is high, because it leadsto many
re-turned
products,
itdamagesbrandimage,
and reducesmar-ket share.
The main goal of the Trader project is to improve the user-perceived dependabilityofhigh-volume products.The aim istodevelop techniquesthatcancompensateand mask faults in released
products,
such that they satisfy userex-pectations. The mainchallengeistorealize this without in-creasing developmenttimeand, giventhe domain of high-volumeproducts, with minimal additional hardware costs
and ithout degrading performance. Hence, classical fault-tolerance techniques that relyalotonadditional redundancy andresources(e.g.,duplicationor eventriplication of
hard-wareand software)are notsuitable in this domain.
Inourpresentation the terminology of [I]is adopted. A
failure ofa systemwithrespect to anexternal specification is an eventthat occurs whena state change leads to a run
thatno longer satisfies the external specification. An error
is the part of the system state thatmay lead to a failure.
Forinstance, a wrong memory valueor a wrongmessage
in a queue. Afault is the adjudgedorhypothesized cause
ofan errorwhich isnotpartof the system state. Examples of faultsareprogramming mistakes (e.g., divide by zero)or
unexpected input.
3.
Model-Based
Approach
Looking at anumber of failures ofconsumerelectronic devices, it is often thecasethata user canimmediately
ob-servethatsomething is wrong,whereas thesystemitself is completelyunawareof theproblem. Systemsareoften real-ized ina waythat correspondstothe open-loop approach in controltheory; foracertaininput, the required actions are
executed, but it isneverchecked whether these actions have the desired effectonthe systemand whether the systemis still inahealthystate.
The main approach of the Trader project isto"close the loop" andtoaddakind of feedback controltoproducts. By
monitoring thesystem andcomparingsystem observations withamodel ofthe desired behaviouratrun-time, the sys-temgets aform of run-timeawarenesswhich makes it
pos-sibletodetect that itscustomer-perceived behavior is (or is likelytobecome)erroneous. Inaddition, the aim isto
pro-vide thesystemwithastrategy to correctitself.
The main ingredients of sucharun-time awareness and correctionapproacharedepicted in Fig. 1.
input
>
=WP
outputsystemstatem
Figure 1. Addingawarenessat run-time
Wediscuss the mainparts,giving examplesfrom theTV
domain:
* Observation: observe relevant inputs, outputs and internal system states. For instance, for a TV we may want to observe keys presses from the remote
control, internal modes of components (dual/single
screen, menu, mute/unmute, etc), load ofprocessors
and busses, buffers, function callsto audio/video out-put,soundlevel,etc.
* Errordetection: detecterrors, based on observations of the system anda model of the desired system be-haviour.
* Diagnosis: in case of an error, find the most likely
causeof theerror.
* Recovery: correct erroneous behaviour, based onthe diagnosis results and information about the expected impactontheuser.
Important partof the approach depicted in Fig. 1 is the
use of models atrun-time. Note that for complexsystems
it will be infeasibletoincludeacomplete model of desired
systembehaviour, but the approach allows theuseofpartial models, concentratingonwhat ismostrelevant for theuser.
Moreover,we canapply this approach hierarchically and in-crementallytoparts of thesystem, e.g.,tothird-party com-ponents. Typically, there will be severalawarenessmonitors in a complex systems, for differentcomponents, different
aspects,and different kinds of faults.
Theanalogy between self-controlling software and
con-trol theory has already been observed in [10]. Garlan et
al [9] havedevelopedan adaptation framework where sys-tem monitoring might invoke architectural changes.
Us-ing performance monitorUs-ing, this framework has been
ap-pliedtotheself-repair of web-based client-serversystems.
Related work that also takes costlimitations into account can be found in the research on fault-tolerance of large-scale embedded systems [13]. They apply the autonomic computing paradigmto systems with many processors to
obtain a healing network, also using a kind of controller-plant feedback loop. Related workonaddingacontrolloop
to an existing system is described in the middleware
ap-proach of[14]wherecomponents arecoupled viaa publish-subscribe mechanism.
4.
Current Status of
Trader
In this section, we give a brief description of the
re-search activities and thecurrent statusof the Traderproject. First we discuss the research on the ingredients of the global awareness vision depicted in Fig. 1:
observation,
(Sect. 4.1), modeling system behaviour (Sect. 4.2), error
detection (Sect. 4.3), diagnosis (Sect. 4.4), and recovery
(Sect. 4.5). Finally, we mention research on reliability
improvements during development and userperception in Sect. 4.7and Sect. 4.6, respectively.
4.1.
Observation
Toobserve relevantaspectsof thesystem,both hardware and software techniquesareinvestigated. Hardware-related work in Tradercurrently aimsatexploiting mechanisms al-ready available in hardware, suchasthe on-chip debug and
trace infrastructure,to monitor values forrangechecking, call stacks (functions, parameters, and resultvalues), and
memory arbiters. The observation of software behaviour is mainly done by code instrumentation using aspect-oriented techniques, partly based on results from ESI-project Ide-alsproject [6,7]. A specialized aspect-oriented framework calledAspectKoala [19] has been developedontop of the
componentmodel Koala which is usedatNXP.
4.2.
Modeling Desired System Behaviour
Important partof the model-based approach described in Sect. 3is theuseofamodel of desiredsystembehaviourat
run-time. Experience in Trader and other ESI projects
indi-catesthat such modelsareusuallynotavailable inindustry and that it is often difficulttoobtain such models.In indus-trial practice, systemrequirements are usually distributed
overmany documents and databases. Hence, part of the ESIresearch in Trader explicitly addresses the construction ofahigh-levelsystemmodel.
Since theTVdomain is the mainsourceofcasestudies in Trader,wehavedevelopedahigh-level model ofaTVfrom the viewpoint of theuser. It captures the relation between
userinput, via theremotecontrol, andoutput,viaimageson
thescreenand sound.Afew firstexperiments indicated that the use ofstate machines leads to suitable models for the control behaviour of theTV. Butit also revealed that itwas very easy to makemodeling errors, for instance, because therearemanyinteractions between features. Examplesare
relations between dualscreen, teletext and varioustypesof
on-screendisplays thatremove orsuppresseach other.
Toallowquick feedbackontheuser-perceived behaviour andtoincrease the confidence in thefidelity of themodel, Matlab/Simulink [11] is used to obtain executable mod-els. Stateflow isexploitedfor the controlpart, whereas the streamingpartofaTV is modeledbymeansof the Image and VideoProcessingtoolbox of Simulink. Externalevents canbegenerated by clicking on apictureofa remote
con-trol. Output is visualizedbymeansof Matlab's videoplayer andascopefor the volume level. This visualization of the
userviewoninputandoutputof the model turnedout tobe
veryusefultodetectmodelingerrorsand undesired feature interactions. Inaddition, weinvestigate the possibilities of
formal model-checking and test scripts to improve model quality.
4.3. Error Detection
Various techniques for error detection are investigated such as hardware-based deadlock detection and range
checking. An approach which checks the consistency of internal modes ofcomponents turned out tobe successful
todetect teletext problems dueto aloss of synchronization betweencomponents [17].
Toenable quick experimentation with model-basederror
detection,wehavedevelopedaframework which allows the
use of modelsatrun-time. The framework has been imple-mentedon topof Linux, to comply with the trend towards
open-source software and theuse of Linux inTVs. Inthe framework,one canincludeaparticularSystemUnder Ob-servation (SUO) and a specification model of the desired
systembehaviour. Thedesign of theawarenessframework is shown in Fig. 2. The SUO and theawarenessmonitorare
llnputEvent , suo -ProcessBoundary IControl lErrorNotify Comparator _
Figure 2. Design of the awareness framework
which is obtainedfrom theOutputObserver. The Controller
initiates and controls allcomponents, exceptfor the Config-urationcomponentwhich is controlled by the Model Execu-tor.
Experiments with earlier versions of the framework in-dicated that the Comparator shouldnotbetoo eager to re-port errors; small delays in system-internal communication might easily leadtodifferences duringashort time interval.
Hence,inthecurrentframework theuserof the framework
can specify, for each observable value: (1)athreshold for the allowed maximal deviation betweenspecification model andsystem,and (2) amaximum for the number of
consec-utive deviations thatareallowed beforean errorwill be
re-ported.
Another relevantparameteris the frequency with which time-based comparison takes place. Thiscanbe combined with event-basedcomparison by specifying in the specifi-cation model whencomparison should take place and when
not (e.g., when the system is in an unstable state between certainmodes). Observe thatwe haveto makeatrade-off betweentakingmoretimetoavoid falseerrorsandreporting
errorsfasttoallowquick repair.
Relatedto ourapproach isamethodto wrapCOTS com-ponentsand monitor themusing specifications expressedas aUML state diagrams is presented in [16]. Other related work consists of assertion-based approaches such as
run-time verification [4]. For instance, monitor-oriented
pro-gramming [3] supports run-time monitoring by integrating specifications in theprogramvialogical annotations. In our
approach,we aimatminimal adaptation of the software of thesystem, tobe abletodeal withthird-party software and legacy code. Moreover, wealso monitor real-time
proper-ties, whichare notaddressedby the techniques cited above. Closely related in thisrespectis the MaC-RT system [15] which also detects timeliness violations. Main difference with our approach is the use of a timed version of
Lin-earTemporal Logictoexpressrequirements specifications, whereaswe useexecutable timedstatemachinestopromote
industrialacceptanceand validation.
separate processes and Unix domain sockets are used for
inter-process communication. The SUO hastobe adapted slightly, to send messages with relevant input and output
events(whichmayalso include internalstates)toInput and Output Observers. Anexecutablespecification model of the SUOinStateflowcanbe includedby using the code
genera-tionpossibilities of Stateflow. The generated C-codecanbe
included easily, allowing quick experiments with different models. It is executed using the Model Executer
compo-nent, basedoneventnotifications from theInput Observer.
Information about relevantinput andoutputeventsis stored in theConfigurationcomponent. TheComparator
compo-nent compares relevant model output with system output
4.4.
Diagnosis
The diagnoses techniques developed within Trader are
basedonso-calledprogram spectra[20]. Theapproachhas alreadybeenappliedto afewexamplesin theTVdomain. As anillustration, wedescribeoneof the firstexperiments
onTVsoftware in whichateletexterrorhas beeninjected. First the C code is instrumentedtorecord which blocksare
executed. Intheexampletherewere 60 000 blocks. Next, for eachsequenceofkey presses, a so-called scenario, for each block it is recorded whether it has been executed or notbetween twokeypresses. This leads to avector, a
so-calledspectrum,for each block. In ourexampleitturns out
that duringascenario of 27 keypresses13796 blockswere
executed. Moreover,basedon some errordetection mecha-nism, it is recorded for each key press whether it leads to error or not. In the example, this leads to an error vec-tors of length 27. Next, the similarity between the error vectorand the spectrais computed. Finally, the blocks are
ranked according their similarity. In the particular
exper-iment with the teletexterror, the block which contains the fault appearedonthe first place in the ranking. Also in other
casestudies theapplication results of this techniqueare
en-couraging.
4.5.
Recovery
Part of therecovery research concentrates on load bal-ancing. ProjectpartnerIMEChas demonstrated the possi-bilitytomigrateanimage processing task fromone proces-sor to another, which leads to improved image quality in
caseof overload situations (e.g., duetointensive error
cor-rection on abad input signal). NXPResearch investigates the possibility to make memory arbitration more flexible such that itcanbeadaptedatrun-timetodeal withproblems concerningmemory access. AtTwenteUniversity,a frame-workforpartialrecoveryhas beendeveloped which allows independent recovery ofparts of the system, the so-called recoverable units. The framework includes a communica-tion manager, which controls the communication between recoverable units, anda recovery manager, whichexecutes
therecovery actions suchaskilling and restarting units. To
realize theseconcepts, areusable fault tolerance library has been implemented. A few first experiments in the multi-media domain show that after somerefactoring of the sys-tem,independentrecoveryofpartsof thesystemispossible withoutlarge overhead.
4.6.
User
Perception
Theuserperception of reliability is addressed by project
partner DTI. The aim is to capture user-perceived failure severity, to get anindication of the level of user-irritation causedby a product failure. By means of controlled
ex-perimentswithTV users,theimpactof characteristics such
as product usage, user group, and function importance is investigated. During experiments, it turned outthat also failure attribution has a significant impact. For instance,
users, when asked, rank bothimage quality and a
motor-izedswivel, whichcanbe usedto turnthe TV,asimportant. Underobservation, however,usersoftenturn out tobevery
tolerant concerningbadimage quality (whichis attributed
to external sources), butgetirritated if the swivel doesnot workcorrectly.
4.7. Improvements During Development
Partof the Trader research is also relatedto dependabil-ity improvements during development. This includes the
useof code analysistoprioritize the warnings ofasoftware inspection tool suchas QA-C [2] and reliability analysis at
the architectural level [18]. The stresstesting approach of
TASS artificially takesawaysharedresources,suchas CPU orbusbandwidth,tosimulate theoccurrenceoferrors orthe addition ofanadditionalresource user. Thestudy of the ef-fect of such overload situationsonthesystembehaviour and its fault-tolerant mechanisms has showntobeveryuseful in the TV domain. A so-called CPU eater, which consumes
CPU cycles attheapplication level in software, is already included in the current development software and can be activatedbysystem testers.
5.
Conclusion and future Work
Although the Trader project has still sometimeto go,it is already clear that its particular model-based approachto systemdependability isverypromising. Theuseof models
assystem components togive thesystem acertaincapacity
tomonitor andcorrectitsbehaviour, implements ideas from feedback controlatthe level ofintegrated systems. It
con-stitutes a paradigm switch from the best-effort, open-loop approach that is traditional in software-related design, to
theclosed-loop control-based approach. The latter is much
more suitable for thereality of high-tech systemsin which
errors areunavoidableemergentfeatures of thesystem
com-plexity.
Theconceptof model-basedsystemlevel control is also quite flexible, in thesensethatone canvarybetween light-weight models with limited corrective capacities, andmore
elaborate models with stronger feedback mechanisms. In
the high-volumecontext, the constraint to minimize
over-head islimiting factor. Certainly, muchmoreresearch will be neededto obtaina morecomplete understanding of the potentialand limitations of thisapproachintheapriorivast rangeof differentapplication domains.
The choice foran industry-as-laboratory format for the Traderproject has helped alot infocussing ontechniques and approaches that have a high potential for being ab-sorbed by industry. Alreadynow, some of the
intermedi-ateresults have found their way intoindustry. We firmly believe in thepotentialof this research formattoachieve a
productive combination between real research and innova-tion.
Futureactivities in the Traderproject will address further developmentof theawarenessframework. Our Linux-based
awareness framework, has been validated by means of model-to-model experiments. That is, wehavecompareda
SUO.Currently, the framework is used for awareness exper-iments with the open source media player MPlayer [12], in-vestigating both correctness and performance issues. Next our approach will be applied in the TV domain at NXP, following the industry-as-lab paradigm. Important topic of research concerns the optimal integration of various tech-niques for observation, error detection, diagnosis, and re-covery.
Inparallel, the model-based run-time awareness concept is also exploited in the domain of printer/copiers at the com-pany Oce in the context of the ESI-project Octopus [8], which startedrecently.
References
[1] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr.
Ba-sic concepts and taxonomy ofdependableandsecure
com-puting.IEEETransactionsonDependable and Secure
Com-puting,1(1): 11-33, 2004.
[2] C. Boogerd and L. Moonen. Prioritizing software inspection
resultsusing static profiling. InSCAM '06: Proc. Workshop
onSourceCodeAnalysisandManipulation,pages149-160.
IEEEComputer Society, 2006.
[3] F. Chen, M. D'Amorim, and G. Rosu. A formal
monitoring-based framework for softwaredevelopmentandanalysis. In
Proceedings ICFEM 2004, volume 3308 of LNCS, pages
357-372.Springer-Verlag, 2004.
[4] S.Colin and L. Mariani. Run-time verification. In
Proceed-ingsModel-Based TestingofReactiveSystems, volume 3472
ofLNCS, pages 525-555.Springer-Verlag, 2005.
[5] C.Potts.Software-engineeringresearch revisited.IEEE
Soft-ware,19(9):19-28, 1993.
[6] P. Durr, G. Guilesir, L. Bergmans, M. Aksit, and R. van
Engelen. Applying AOP inanindustrial context: An
ex-perience paper. In Proc. Workshop on Best Practices in Applying oriented Software Development.
Aspect-Oriented SoftwareAssociation,2006.
[7] Embedded Systems Institute. The Ideals project, 2007.
http://www.esi.nl/ideals/.
[8] Embedded SystemsInstitute. The Octopus project, 2007.
http://www.esi.nl/octopus/.
[9] D.Garlan,S.Cheng,and B. Schmerl.Increasingsystem
de-pendability througharchitecture-basedself-repair. In
Archi-tectingDependable Systems, volume 2677 ofLNCS,pages
61-89.Springer-Verlag, 2003.
[10] M. M. Kokar, K. Baclawski, and Y. A. Eracar. Control
theory-basedfoundations ofself-controllingsoftware.IEEE
Intelligent Software,pages37-45, 1999.
[11] The Mathworks. Matlab/Simulink, 2007.
http://www.mathworks.com/.
[12] MPlayer. Open source media player, 2007. http://www.mplayerhq.hu/.
[13] S. Neema, T. Bapty, S. Shetty, and S. Nordstrom.
Auto-nomic fault mitigationinembedded systems. Engineering
Applications ofArtificial Intelligence, 17:711-725,2004.
[14] J. Parekh, G.Kaiser, P. Gross, and G. Valetto. Retrofitting
autonomiccapabilities ontolegacysystems. Cluster
Com-puting,9(2):141-159, 2006.
[15] U. Sammapun,I. Lee, and0. Sokolsky. Checking
correct-ness atruntime using real-time Java. In Proc. 3rd Workshop
on JavaTechnologies for Real-time and Embedded Systems
(JTRES'05),2005.
[16] M. E. Shin and F. Paniagua. Self-management of COTS
component-based systems using wrappers. In Computer
Software and Applications Conference (COMPSAC 2006),
pages33-36.IEEEComputer Society, 2006.
[17] H. Sozer, C. Hofmann, B. Tekinerdogan, and M. Aksit. De-tecting mode inconsistencies in component-based embedded
software. InDSN Workshop onArchitecting Dependable
Systems,2007.
[18] H. Sozer, B. Tekinerdogan, and M. Aksit. Extending
fail-uremodes and effectsanalysis approach for reliability
anal-ysisatthe software architecturedesign level.InArchitecting Dependable Systems IV, volume 4615 of LNCS, pages
409-433.Springer-Verlag, 2007.
[19] P. van de Laar and R. Golsteijn. User-controlled reflection
onjoin points. Journal ofSoftware, 2(3): 1-8, 2007.
[20] P. Zoeteweij, R. Abreu, R. Golsteijn, and A. van Gemund.
Diagnosis of embedded software using program spectra. In
Proc. 14th Conference and Workshop on the Engineering of