Empirical research methods for technology validation: Scaling up to practice

(1)

ContentslistsavailableatScienceDirect

The

Journal

of

Systems

and

Software

jo u r n al h om e p a g e :w w w . e l s e v i e r . c o m / l o c a t e / j s s

Empirical

research

methods

for

technology

validation:

Scaling

up

to

practice

Roel

Wieringa

∗

DepartmentofElectricalEngineering,Mathematics,andComputerScience,UniversityofTwente,TheNetherlands1

a

r

t

i

c

l

e

i

n

f

o

Articlehistory:

Received30November2012 Receivedinrevisedform 11November2013 Accepted16November2013 Availableonlinexxx Keywords:

Empiricalresearchmethodology Technologyvalidation Scalinguptopractice

a

b

s

t

r

a

c

t

Beforetechnologyistransferredtothemarket,itmustbevalidatedempiricallybysimulatingfuture prac-ticaluseofthetechnology.Technologyprototypesareﬁrstinvestigatedinsimpliﬁedcontexts,andthese simulationsarescaleduptoconditionsofpracticestepbystepasmorebecomesknownaboutthe tech-nology.Thispaperdiscussesempiricalresearchmethodsforscalingupnewrequirementsengineering (RE)technology.

Whenscalinguptopractice,researcherswanttogeneralizefromvalidationstudiestofuturepractice. Ananalysisofscalinguptechnologyindrugresearchrevealstwowaystogeneralize,namely induc-tivegeneralizationusingstatisticalinferencefromsamples,andanalogicgeneralizationusingsimilarity betweencases.Botharesupportedbyabductiveinferenceusingmechanisticexplanationsofphenomena observedinthesimulations.IllustrationsoftheseinferencesbothindrugresearchandempiricalRE researcharegiven.Next,fourkindsofmethodsforempiricalREtechnologyvalidationaregiven,namely expertopinion,single-casemechanismexperiments,technicalactionresearchandstatistical difference-makingexperiments.AseriesofexamplesfromempiricalREwillillustratetheuseofthesemethods,and theroleofinductivegeneralization,analogicgeneralization,andabductiveinferenceinthem.Finally, thefourkindsofempiricalvalidationmethodsarecomparedwithlistsofvalidationmethodsknown fromempiricalsoftwareengineering.Thelistsarecombinedtogiveanoverviewofsomeofthemethods, instrumentsanddataanalysistechniquesthatmaybeusedinempiricalRE.

1. Introduction

Empiricalassessmentoftechnologycomesintwoﬂavors,which inthispaperwillbecalledtechnologyvalidationandtechnology evaluation,respectively.Technologyvalidationisdeﬁned hereas theassessmentofasimulationofthetechnologyinasimulation ofitsintendedcontextofuse,inordertopredictwhatwould hap-penifthetechnologywereactuallyusedbystakeholdersinthis intendedcontext.Wetaketheterm“simulation”inaverywide senseastherepresentationof thefunctioningofonesystemor processbymeansofthefunctioningofanother.2_For_example,_a

newrequirementsprioritizationtechniquemaybetestedby exper-imentingwithitinaclassroom.Thisisavalidationiftheclassroom experimentrepresentssomeaspectsofwhatwouldhappenifthe techniquewasusedinpractice.

Validationalwaysinvolvesscalinguptopractice,whichmeans that successive tests take place under increasingly realistic

∗ Tel.:+31534894189.

E-mailaddress:r.j.wieringa@utwente.nl 1_{http://www.cs.utwente.nl/}_roelw_. 2_{http://www.merriam-webster.com}_.

conditions.Forexample,theinventorofarequirements prioritiza-tiontechniquemayusethistechniqueinareal-worldproject.This validationwouldresemblereal-worlduseofthetechniquemore thana classroomexperiment,exceptthatit isstilltheinventer herselfwhousesthetechnique.

Atechnologyhasbeentransferredtopracticeifithasbeen pack-aged,marketed,distributed,soldorotherwisemadeavailableto users,andisnowbeingusedindependentlyfromthecontextin whichitwasinventedortested.Aftertransfertopracticeother peo-plethanitsinventorsareusingit,andtheyareusingittoachieve theirowngoals,withouthelporotherkindofinterventionfromits inventors,andafterinvestmentoftheirowntimeand/ormoneyto learntousethetechnology.

Technologyvalidationistobecontrastedwithtechnology eval-uation,deﬁnedhereastheempiricalassessmentofatechnology asandwhenusedinpractice.Forexample,anREresearchermay studyhowaprioritizationtechniqueisusedinreal-worldprojects bymeansofobservationalcasestudies.Whereavalidationstudy aimstomakepredictions,basedonsimulations,abouthowa tech-nologywouldperformiftransferredtopractice,anevaluationstudy assesseswhathashappenedintheactualuseofthetechnology after ithas beentransferred inpractice. Thisfollows terminol-ogycommonlyusedinthesocialsciences, whereanevaluation studyisanempiricalassessmentofsomesocialinterventionthat 0164-1212/$–seefrontmatter©2013ElsevierInc.Allrightsreserved.

(2)

Fig.1.ThenewdrugdevelopmentandreviewprocessoftheU.S.FoodandDrug Administration.

hasbeen performed, suchas a recently implemented teaching methodinschools, toinvestigateitsimpactin practice(Babbie, 2007).

Technologyvalidationisaprocessofscalinguptopracticeinall engineeringsciences.Forexample,theinventorsofthejetengine validatedtheirdesignsbybuildingincreasinglyrealisticprototypes andtestingtheminincreasinglyrealisticenvironments(Constant, 1980).In this paper Iwill summarizeand analyzethe ways in whichwecanscalerequirementsengineering(RE)technologyup topractice.

TheREtechnologybeingvalidatedcouldbetechniques, meth-ods, notations, algorithms, etc. used for various requirements engineeringtaskssuchasrequirementselicitation,goalanalysis, requirementsspeciﬁcation,requirementsprioritization, traceabil-itymanagement,requirementsmaintenance,etc.Requirementsin this paperaredeﬁned asdesired properties ofa system.Goals, bycontrast,arestatesoftheworlddesiredbystakeholders,and for which the stakeholders have committed a budget (time or money)toachievethem.Allrequirementsaregoalsbecausethey aredesiredbystakeholders,andstakeholdershavecommitteda budgettoachievethem.Butnotallstakeholdergoalsaresystem requirements.Stakeholdershavemanygoalsnotstatedintermsof desiredsystempropertiesatall.

DavisandHickey(2004)proposedusingthemethodologyof NewDrugDevelopmentforscalingupREtechnology.Iwill pur-suethis ideafurtherinSection2 andfocus inparticularonthe inferencesusedinNewDrugDevelopmentwhengeneralizingfrom theobjectofvalidationresearchtoinstancesofreal-worlduseof thetechnology,andshowthattheseinferencescanbeusedinRE researchtoo.In Section3, Ipresent fourmethods for empirical technologyvalidation, and show how thegeneralization meth-odsidentifiedinSection2canbeusedinthem.Thisisillustrated bya seriesof examplesfromempirical requirements engineer-ing.Finally,inSection4,Ireviewtheempiricalsoftwarevalidation methodsidentified byZelkowitzand Wallace(1997, 1998) and byGlassetal.(2001)andshowhowtheyfitintotheframework presentedinthispaper,andaddalistofexamplesoftechniques for measurement and data analysis used in empirical software engineering.Section5endsthepaperwithabriefsummaryand outlook.

2. Scalingup

2.1. Scalingupindrugresearch

DavisandHickey(2004)weretheﬁrsttoapplytheNewDrug DevelopmentandReviewProcessoftheU.S.FoodandDrug Admin-istration toREtechnologytransfer. Isummarize theprocess in Fig.1.Thefollowingdescriptionisbasedoninformationprovided

bytheFDA3_and_the_explanations_given_by_Davis_and_Hickey₍₂₀₀₄₎_,

Cowan(2002)and MolzonandPharm(2005).Myanalysisgoes beyondthatofDavisandHickeybyanalyzingthethreekindsof inferenceusedinthisprocess.Iwillindicatetheanalogyofeach stageoftheNewDrugDevelopmentprocesswithastageinscaling upREtechnology.

2.1.1. Pre-clinicalresearch

Pre-clinicalresearchistheexplorationandvalidationofdrugs beforetestingitonpeople.Itconsistsofasynthesisandpuriﬁcation task,andoftestingthedrugonanimals.

Insynthesisandpurification,achemicalisidentifiedinthe lab-oratoryasapotentiallyeffectivedrug,basedonearlierexperience reportedintheliterature,biochemicalknowledge,andknowledge ofthehumanbody.Atheoryaboutwhythiscouldbeaneffective drugtoimproveamedicalconditionispostulated.Thisistheinitial versionofadesigntheorythatwillbetestedandelaboratedinthe followingstagesoftheNewDrugDevelopmentprocess.It corre-spondsinREresearchtotheinitialideaforanewREtechniqueand theinitialjustificationthatthisideamightworktosolvesomeRE problem.Therestoftheprocessaimstovalidateandelaboratethis designtheoryinincreasinglyrealisticcontexts.

Animaltestsaredonetoshowthatthedrugwouldbesafeto usein peopleand toinvestigateinmore detailthebiochemical mechanismsthatproducethedrug’seffects.Ifthereareno nega-tiveeffectsintheinvestigatedcontexts(i.e.inanimals),andifthe mechanismsfoundinthesecontextsareexpectedtobesimilarto thoseinhumanbodies,thenthisisevidencethatthedrugis prob-ablysafeforhumans,andarequestissubmittedtoinstitutional reviewboardsforpermissiontotestthedruginhumans.Usually twodifferentanimalspeciesaretaken,becauseadrugusuallyhas differenteffectsindifferentspecies(Cowan,2002).Shortterm test-inginanimalscantakeuptothreeyearsbutontheaveragetakes 18months.

Theanimalsareusedfortestingasnaturalmodelsoftheintended real-worldcontextofthedrug,namelythehumanbody.Theanalog inREresearchwouldbetestinganewREtechniqueonstudentsin alaboratory,tostudytheeffectsofthetechniqueandthe mech-anismsbywhichtheseeffectsareproduced.Althoughthegoalof thisresearchwouldnotbetoestablishevidenceforsafetyofthe technique,thepurposewouldstillbetoassesswhetherthebeneﬁt ofusingthetechniqueinpracticewouldoutweighthecostandrisk ofdoingso.

Long-termanimalresearchinvestigatesthelong-termeffectsof usingadrug,andmaycontinueintothepost-availabilitystage.This isanexampleofvalidationresearchthatcontinuesaftertransferof thenewtechnologyintopractice.Long-termanimalresearchhas thelogicofvalidationresearch,asitsimulatestheeffectsofadrug byusingamodel,andisusedtopredictwhatwouldhappento humanbodies.ThiscanbedoneinREresearchtoo.Forexample, theeffectofusingtheUMLonthecomprehensionofprogramscan beinvestigatedinthelaboratory,usingstudentsassubjects,long aftertheUMLhadbeentransferredtopractice.Insightsfromsucha studycouldbeusedtopredicttheeffectofUMLoncomprehension ofprogramsinpractice.

2.1.2. Clinicalresearch

Inclinicalresearch,thedrugistestedonpeople.Itconsistsof threephases.

• Inphase1,randomsamplesof20–80healthysubjectsareused totestthedrug.Thegoalistoinvestigatesideeffectsandthe

3_{http://www.fda.gov/Drugs/DevelopmentApprovalProcess/} SmallBusinessAssistance/ucm053131.htm.

(3)

so-calledmechanismofactionofthedrug,whichisthe biochem-ical interactionby a which a drug produces pharmacological effects.Ifpossible,theeffectiveness(positiveeffects)ofthedrug isinvestigatedtoo.Thisphasemaylastseveralmonthsandends whenresearchersaresufﬁcientlycertainthatthedrugissafeto useinpatients.

• Inphase2,severalhundredpatientsareusedtoinvestigatethe effectofthedrugincontrolledstudiesbymeansofrandomized controlledtrials.Thegoalistoinvestigatetheeffectivenessin patientswithaspeciﬁcdiseaseorcondition,andtoinvestigate short-termside-effectsandidentifypossiblerisks.Phase2may lastseveralmonthstotwoyearsandendswhenknowledgeabout effectiveness,side-effectsandrisksisdeemedsufﬁcientlywell establishedtodolarge-scaletestsinphase3.

• Inphase3,controlledanduncontrolledtrialswithseveral hun-dredtoseveralthousandpatientsaredonetogatheradditional evidenceabouteffectivenessandsafety.Thegoalistoﬁndoutif theeffectivenessandsafetyclaimscanbegeneralizedtothe pop-ulationofallpossiblepatients.Phase3maylastonetofouryears andendswhenresearchersthinktheclaimsabouteffectiveness andsafetyofthedrugcanbegeneralizedtothepopulation.

As we will see below, when validating RE technology, similar researchgoalsexist,andtheyarepursuedjustasindrugresearch byexperimentalresearchinthelabandintheﬁeld.

2.1.3. Post-availabilitystudies

Afteradrughasbeenapprovedandismadeavailabletoa mar-ket,assessmentcontinuesinso-calledphase4studies,forexample bysurveysorobservationalcasestudiesofpatientsusingthedrug. Inourterminology,theseareevaluationstudies.Post-availability studiesaredoneinREresearchtoo,forexamplewhenresearchers investigatetheeffectofusingtheUMLoncodingerrorsin real-worldprojects.

2.2. Designtheories:artifactincontextproduceseffectsby mechanisms

Iwillnowanalyzethelogicofdrugvalidationinmoredetail, withaviewtodrawingconclusionsaboutthelogicofREtechnology validation.Inotherwords,Itreatthedrugdevelopmentprocessasa modeloftheREtechnologytransferprocess,thatwecaninvestigate tolearnsomethingabouttheREtechnologyvalidationprocess,just aswecanuseanimalsasmodelsofpeopletolearnsomethingabout howpeoplerespondtodrugs.Toabstractfromwhetherwetalk aboutdrugsorREtechnology,Iwillcallthetechnologytobescaled upanartifact.

InwhatfollowsIpresentanumberofobservationsoftheprocess describedinSection2.1.Thefirstobservationisthatthevalidation tasksinnewdrugdevelopmentaredividedintothreestages: Con-ceptualvalidation,modeling,andfieldtests.Inconceptualvalidation (correspondingtosynthesisandpurification),anartifactistested by observing itsbehavior in a very artificialcontext suchas a test tube.Mostofthevalidation is doneonpaperand consists ofcomputations,workedexamples,mathematicalproofs,informal argumentstestedoutwithcolleagues,etc.Inthemodelingstage (correspondingtoanimaltesting andphase-1clinicalresearch), theartifactistestedoutonamodel.Indrugdevelopment,theseare animalsfirstandhealthypeoplenext.Thereareimportantethical constraintsinbothkindsoftestsandtheNDAprocessrecognizes theneedforanethicalreviewboardatleastwhentransitioning totestswithpeople.In thefield testingstage(correspondingto phase-2andphase-3clinicalresearch),real-worldcasesareused totesttheartifacton.Thesereal-worldcasesaretreatedasmodels ofarbitrarypopulationelements.

Mysecondobservationofdrugvalidationisthatwhatis val-idatedisnottheartifactbuttheartifactinacontext,e.g.adrug inabody.Validationistheattempttotestthefollowing predic-tion(Wieringa,2009):

[Artifact_×Context]willproduceEffects_.

Effectsmaychangeifthecontextchanges,andsotheartifact mustbeinvestigatedindifferentcontextsuntilitisclearinwhat rangeofcontextswhatrangeofeffectsisproduced.Forexample, anewtechniqueforelicitingrequirementsmaybetestedonits effectsinsmallprojects,largeprojects,embeddedsystemsprojects, informationsystemsprojectsetc.andbefoundtobeeffectivein somebutnotallofthesecontexts.

Third,whenvalidatinganartifactincontext,researchersshould notonlyaimatidentifyingtheregularproductionofaneffectin certaincontexts,theyshouldalsoaimtoexplainthiseffectinterms ofunderlyingmechanisms.Indrugresearchthesearecalledthe mechanismsofaction.Thistermindicatestheinteractionsbywhich adrugproducesa pharmacologicaleffect,includingthebinding ofthedrugtomoleculartargets,itseffectonthesetargets,and theeffectonbiochemicalpathwaysinthebody.Forexample, caf-feine hasseveral mechanismsof action, two of which are that itantagonizesabiochemicalcompound(adenosine)thatinhibits neurotransmitters,andthatitincreasestheactivityof neurotrans-mitterssuchasdopamine(Nehligetal.,1992).Thesemechanisms explainwhycaffeinehasapsychostimulanteffect.

Theconceptof amechanism ofactionis similartothat ofa principleofoperationusedinengineeringmethodology(Vincenti, 1990),whichisthetop-leveltheoryofthemechanismbywhich anartifactproducesaneffectinacontext.Forexample,the prin-cipleofoperationofanairplaneisthatbytheshapeofitswings, airabovethewingflowsfasterrelativetothewingthanairbelow it,whichaccordingtoBernoulli’sprincipleproducesupwardlift. Butwheretheprincipleofoperationisthehighest-levelviewof howanartifactproducessomeeffectinacontext,amechanism ofactionistheactualrealizationofthisprincipleinthe interac-tionsbetweencomponentsofalow-levelarchitecturesofrealized artifactinarealcontext.Theprincipleofoperationofanairplane explainswhyairplanesfly.Themechanismofactionofa particu-lartypeofairplaneconsistsofthedetailed,low-levelinteractions amongaircraftcomponentsandthesurroundingairthatactually producethecapabilityofthistypeofairplanetofly.The mecha-nismsofactionexploittheBernoulliprinciple,butdoalotmore.

InREtoo,mechanismsofactioncanbeidentiﬁedthatexplain observedeffects.Forexample,DamianandChisan(2006)describe theintroductionofREtechniquesinanorganizationandidentify cross-disciplinarygroupmeetings,andtheirinteractionwithother partsofthesoftwareengineeringorganization,asamechanismthat causesfewerdefects,lessrework,andimprovedeffortestimates.

Validationresearchthusaimsatmakingpredictionsoftheform [Artifact×Context]willproduceEffectsbyMechanisms.

Wewillcallthisadesigntheory(Wieringaetal.,2011). Whenresearchersdesignandvalidateanartifact,theystartfrom aninitialideaabouttheprincipleofoperationoftheartifact,which isasolutionidea,butnotyetanimplementedandworkingsolution. Thisstatesthetop-levelprincipleofoperation.Whenscalingupan artifacttoconditionsofpractice,thisinitialtheoryistestedand elaborated,untilﬁnallyastreet-testedarchitecturewith mecha-nismsofactionisdelivered,thatimplementsthetop-levelprinciple ofoperation.

Thetheoryofanimplementedartifactmaybeincompleteabout themechanismsthatproducetheeffects,andintheextremecase betotally silentabout them. For example,engineers mayhave foundwhatthedetailedstructureandtextureofawingsurface

(4)

• Designtheory: [ArtifactX Context]producesEffectsby Mechanisms

• Valuetheory: [EffectsX Stakeholders]producesValuation.

Fig.2.Thestructureofdesigntheoriesandofvaluetheories.

isthatismostconducivetofuelefﬁciency,withoutunderstanding theprecisemechanismsbywhichthishappens.Ifmechanismsare notunderstood,aslightlydifferentdesignoranexistingdesignin anew,previouslyun-encounteredcontextmayfailforunknown reasons.Forthisreason,inthehealthsciences,evidenceof regu-larityisnotgoodenoughtoclaimregularproductionofaneffect: Knowledgeoftheunderlyingmechanismsisneededtoo(Russoand Williamson,2007).Inengineering,intheabsenceofknowledgeof underlyingmechanisms,safetyrisksaremanagedbytestingdesign changesand sensitivitytocontext onlyin smallsteps(Petroski, 1994).

Afourthobservationofthedrugdevelopmentprocessisthat thereisasecondtheory,thatisstakeholder-related(Wieringaetal., 2011):

[Effects×Stakeholders]willproduceValuation.

Iwillcallthisavaluetheory.Thetheorystatesthatvariouskinds ofstakeholderswhoexperiencetheeffectswillattachapositive, negative,orindifferentvaluetoit.

Thegoalofdrugresearchisnotonlytoidentifyeffectsand mech-anismsofanartifactincontext,butalsotoidentifythevalueofthese effectsforstakeholders.Stakeholderslikesomeeffectsanddislike others.Effectsthatarelikedarecalled“beneﬁts”andeffectsthatare dislikedareoftencalled“side-effects”.Contextpropertiesthattend toproduceeffectsthataredisliked,arecalled“contra-indications” indrugresearch.

Finally,asallscientiﬁctheories,designtheoriesaswellasvalue theoriesarefallible.Theresearcherisnottotallycertainaboutthem andmuststatetheextentofhisorheruncertainty.Theuncertainty withwhicheffects,beneﬁts,costsandside-effectscanbepredicted, arecalled“risks”.

AlloftheseconceptsarerelevantforREresearchtoo.For exam-ple,theuseofmobileREtechnologytoelicitrequirementshasthe benefitthatuserrequirementsmaybemoreconcrete,detailedand completethanispossiblebyotherelicitationtechniques.Butthat isnotcertain,andthisisarisk(ofabenefitnotmaterializing).It mayalsohaveassideeffectthattheusermayhavetheexpectation thateachandeveryneedsheenters,willbesatisfiedinthenear future.Thistooisarisk(ofanegativeoutcome).Also,mobileRE technologymayresultinhugeamountsoftextualandmultimedia datathatmustbeanalyzedmanually,whichisacost.Inshort,when validatingREtechnology,theREresearcherisnotonlyinterested intheeffectsofanartifactincontextandthemechanismsthat pro-duceit,butalsointhebenefit,costandriskofusingthistechnology insomecontexts(Fig.2).Bothkindsoftheoryareimportant,butin therestofthispaper,Iwillfocusonthedevelopmentandvalidation ofdesigntheories.

2.3. Inferencesthatsupportdesigntheories

Lookingoncemoreatthedrugdevelopmentandreviewprocess inFig.1,weseethattwokindsofinferencesareusedtogeneralize fromexperimentstothepopulationofpotentialpatients:Inductive generalizationandanalogicgeneralization.Inductive generaliza-tionisthestatisticalinferencefromasampleoftestsubjectsto thepopulationofsubjects.Analogicgeneralizationistheinference frommodels(suchasanimals,andhealthyvolunteers)topatients. ThisisrepresentedinFig.3,whereinductivegeneralizationisthe horizontaldimensionand analogicgeneralizationisthevertical dimension.

Fig.3.Twokindsofinferenceswhenscalinguptopractice:inductive generaliza-tionfromsamplestopopulation(horizontal)andgeneralizationbyanalogyfrom experimentalcasestoreal-worldcases(vertical).

Wediscussthesegeneralizationsinthenexttwoparagraphs. Next,wediscussathirdkindofinference,calledabductive infer-ence, thatcanbeusedtosupportanalogicaswellasinductive inference.Finally,wecombinetheseinferences intwo kindsof reasoning:

• Incase-basedreasoning,analogicgeneralizationaboutcasesis supportedbyabductiveinference(verticaldimensionofFig.3); • Insample-basedreasoning,inductivegeneralizationabout

sam-plesissupportedbyabductiveinference(horizontaldimension ofFig.3).

2.3.1. Inductivegeneralization

Inductivegeneralizationisthegeneralizationfromsamplesto populationsusingstatisticalinference,suchasstatistical hypoth-esis testing or statistical parameterestimation(Hacking, 2001). Sample sizes in drugresearch start at about 30 elements, and increasetohundredsoreventhousandsofelements.Thelargerthe sample,thelargerthepoweroftheexperimenttodiscernsmall effects.

Icallthiskindofinferenceinductive.Theterm“induction”is given different meaning by different people,but here I follow Douven(2011)incallinganinferenceinductiveifitisbasedpurely onstatisticaldata.Inthecontextofthis paper,thismeansthat inductiveinferenceisstatisticalinference,inwhichsampledata areusedtoestimateastatisticalpopulationparameterortotest astatisticalhypothesis aboutapopulationparameter.Inductive inferenceisthehorizontaldimensioninFig.3.

2.3.2. Generalizationbyanalogy

TheverticaldimensionofFig.3isgeneralizationbyanalogyofthe objectofstudy(OoS)tothereal-worldpopulationunitstowhich theresearcherwishestogeneralize.The objectofstudy hasthe structure

(modeloftheartifact)×(modelofthecontext),

andis theentitystudiedbytheresearcher.SeealsoFig.4.The modeloftheartifactisoftenanartifactprototype,andthemodel ofthecontextcanbeasimulatedcontextinthelaboratory.Inﬁeld research,themodel ofthe contextis a real-worldcontext that standsasmodelforotherreal-worldcontexts.Thetreatmentand measurementelementsofFig.4willbediscussedlater.

Generalizationbyanalogyreasonsaboutcases.Forexample,if inoneagiledevelopmentprojectperformedforasmallcompany, wehaveobservedthatthecompanylackedtheresourcestoputa customeron-site,wemayinferthatinsimilarcases,asimilarthing mayhappen.Eachgeneralizationbyanalogyreasonsfromoneor moresimilarsourcecasestooneormoresimilartargetcases.

(5)

Fig.4.Thestructureofvalidationresearch.

This contrasts with inductive generalization, which reasons fromasampleofcasestothepopulationofallcases.Forexample, ifwehaveobservedthatinarandomsampleof100agileprojects performedforsmallcompanies,90%ofthecompaniesdonotputa customeron-site,wemayestimatefromthisaconﬁdenceinterval fortheproportionofthepopulationinwhichnocustomerisput on-site.

Toaddmoreillustrationswediscusstheroleofanalogic gener-alizationintheNewDrugDevelopmentprocess.Insynthesisand puriﬁcation,researchersbuildaprototypeofthedrug,interacting withsomebiochemicalprocesses,thathassufﬁcientsimilaritywith processesinthehumanbodytobeabletodrawsomepreliminary conclusionabouttheeffectofthedruginthehumanbody.InRE research,thiswouldbesimilartohand-testinganewtechnique,or formallyprovingsomepropertiestoshowwhatthetechniquecan doinanidealizedcontext.

Next,drugresearchersexperiment withthedruginanimals, usedasnaturalmodelsofthehumanbody.Aspointedoutearlier, thiswouldbeanalogoustotheuseofnewREtechniquesin stu-dentprojects,whicharethenusedasnaturalmodelsofreal-world projectswithpractitioners.Thereisdetailedresearchinsomefields ofdrugresearchtoassess whichanimal speciesarevalid mod-elsofhumanswithrespecttowhichresearchquestions,andfor whichresearchquestionstheyarenot(Willner,1991).Wefindthe samekindof“similarityresearch”inengineeringresearchtoo.For example,tobeabletoreasonfromobservationsofamodelina windtunneltothebehaviorofairfoilsinrealflight,theremustbea theoryofsimilaritybetweenwind-tunnelmodelsandreal-world flight(Vincenti,1990).Tomyknowledge there islittleresearch inthisareainRE,buttherehasbeensomesimilarityresearchin softwareengineeringthatstudieswithrespecttowhichresearch questionsstudentbehaviorinstudentprojectsissimilaror dissim-ilartothebehaviorofprofessionalsoftwareengineersinsoftware projects(Höstet al., 2000;Runeson, 2003;Sjoberg et al.,2003; Svahnbergetal.,2008).

Intheclinicalphase,drugresearchersstartwithhealthy peo-ple,andthencontinuewitheverlargersamplesofpatients.InRE researchthiswouldcorrespondtousingnewtechnologyﬁrstin pilotprojectsincompanieswithamatureREprocess,continuing witheverlargersamplesofpilotprojectsincompanieswithlow levelsofREmaturity.

Generalizationbyanalogyalsoincludesreasoningbyextreme cases,inwhichonecaseisknowntobesimilartoothercasesin somerelevantaspects,butextremelydifferentinanotheraspect. Forexample,fromtheobservationsthatanREtechniqueiseasy tounderstandandusebyMaster’sstudentsinsoftware engineer-ing,onemight concludethat itwillalsobeeasy tounderstand andusebyexperiencedsoftwareengineers.Master’sstudentsand softwareprofessionalsaresimilarinsomerespects,buttheyare dissimilarintheextentofexperienceinsoftwareengineeringthat theyhave.Studentsareanextremecasew.r.t.extentofexperience. Thetheoryofsimilarityusedtosupportthisanalogicinferenceis thatincreaseinexperienceofotherwisesimilarsubjects,preserves understandabilityandusabilityofatechnique.

In general,generalization by analogy mustbe supportedby sometheoryofsimilaritybetweentheOoSandallpopulation ele-ments,thatexplainswhyaphenomenonobservedinamodelcould leadtoconclusionsaboutpopulationunits.Whattheoryisneeded, dependsonthequestionasked.Studentsmaybegoodmodelsof practitionerswhenvalidatingeffortestimationtechniquesbutnot when validating multi-stakeholderprioritizationtechniques (cf. theexperimentbyHöstetal.,2000).

2.3.3. Abductiveinference

There is a third kind of inference used in drug validation research,calledabductive,andthatcanbeusedtosupportboth inductiveandanalogicgeneralization.Abductiveinferenceis rea-soningfromobservedphenomenatowhatisconsideredthebest explanation ofthephenomena(Douven, 2011).There aremany kindsofabduction,andhereIdeﬁneonekind,thatIcall mech-anistic abduction, in which observed phenomena are explained intermsofcomponent-basedmechanismsthatproducedthem.I deﬁneacomponent-basedmechanism,inturn,asarepeatable pro-cessinwhichsystemcomponentsinteracttorespondtoastimulus. Thisconceptofmechanismisknowninobject-orientedsoftware engineering, where a UML collaboration diagram represents a mechanismconsistingofsoftwareobjectsthatpasseachother mes-sageswhenrespondingtoastimulus(CookandDaniels,1994).But component-basedmechanismscanoccurinanykindofsystem,as wesawwhenwediscussedtheconceptofmechanismofaction ofadrug.Component-basedmechanismsareusedtoexplain bio-logicalphenomenaintermsoftheinteractionsbetweencellsand chemicalsubstances,ortheinteractionsbetweentheorgansofan organism(BechtelandRichardson,2010;BechtelandAbrahamsen, 2005).In thesocial sciences,component-basedmechanismsare usedtoexplainsocialphenomenaintermsofinteractionsbetween people,organizations,institutionsand othersocial systemsand theircomponents(Bunge,2004;HedströmandYlikoski,2010).

InREtoo,component-basedmechanismscanexplaintheeffects ofanREtechnologyintermsofinteractionsbetweencomponents ofasocial,technical,physical,anddigitalsystems.Ialready men-tionedthemechanismidentiﬁed byDamianandChisan (2006), bywhichcross-disciplinarygroupmeetings,andtheirinteraction withotherpartsofthesoftwareengineeringorganization,resulted infewerdefects,less rework,andimprovedeffortestimates.As anotherexample,Seyffetal.(2010)identiﬁedtwomechanismsthat reducetheuseofaudiorecordinginmobileRE:Participantsfelt uncomfortableiftheyvoicerecordedtheirneedsinapublicplace, andpublicplacesoftencontainedtoomuchbackgroundnoiseto dotherecording.

To sum up, abductive inference is the identiﬁcation of component-basedmechanismsthatexplaineffects.Theycomplete theprediction

[Artifact×Context]willproduceEffects.

withanexplanationofthemechanismsbywhichtheeffectsare produced.Asindicatedearlier,researcherswillnotalwaysbeable toexplainallmechanismsofinteractionbetweenanartifactand aconcrete,practicalcontext.Totheextentthatlessmechanisms areknown,thereislessconﬁdencethatstatisticalregularitiesin behaviorarestableunderchangesincontext.

2.3.4. Case-basedandsample-basedreasoning

Adding abductive inferences toanalogic and inductive gen-eralization,respectively,wegetcase-basedreasoning(CBR)and sample-basedreasoning(SBR)(Fig.5).

Case-basedreasoningistheverticaldimensionofourdiagram ofscalingup(Fig.3).Itconsistsoftwosteps,namelyabduction and generalizationbyanalogy.In theﬁrststep, asinglecase is

(6)

Observaons Generalizaons Explanaons CBR: (2) Analogy SBR: (2) Abducon CBR: (1) Abducon SBR: (1) Inducon SBR: (3) Analogy

Fig.5.Reasoningfromobservationstoexplanationsandgeneralizationsin case-basedreasoning(CBR)andinsample-basedreasoning(SBR).

analyzedto identifyan architectureof thecase in terms of its componentsandtheirinteractions,sothatthisarchitecturemay provideanexplanationofobservedeffectsintermsof component-basedmechanisms.Forexample,ifinacaseofagiledevelopment performedbyanindependentdeveloperforasmallcompany,no clientrepresentativeison-site,thenthiscanbeexplainedbythe limitedresourcesofthesmallcompany.Thisisanabductive infer-ence.

Next,inCBRwecangeneralizebyanalogybypostulatingthatin caseswiththesameorasimilararchitecture(independent devel-oper doing agile development for a small company), the same effectswillbeproducedbythesameorsimilarmechanisms.The theoryofsimilaritythatsupportsthegeneralizationhereisthat smallcompanieshavelimitedresourcesandwillprefertotrustthe developerratherthanspendtheirscarceresourcesonagile require-mentsengineering.InCBR,thetheoryofsimilaritythatsupportsthe analogicgeneralizationisstatedintermsofacomponent-based mechanism.

Sample-basedreasoning,thehorizontaldimensionin our dia-gramofscalingup(Fig.3),ismorecomplex.Itconsistsofthreesteps (Fig.5).Fromobservationsofastatisticallymeaningfulsampleof thepopulation(e.g.arandomsampleofatleast30elements),the researcherinfersstatisticallythatthepopulationhassome charac-teristics.Forexample,fromanexperimentwithasampleofstudent projects,theresearchermaybeabletoinferstatisticallythatthere isacorrelationbetweentheuseofsomerequirementsnotation andthequalityofrequirementsspeciﬁcationsinthepopulationof studentprojects.Thisinferenceisfallible,becausethesamplemay coincidentallyshowapatternthatisabsentfromalmostallother samplesfromthispopulation.

Second,theresearchermaythenlistallpossiblecausesofsuch acorrelation,andbeabletoarguethatthebestexplanationisthat thatthenotationcausedthedifferenceinquality.Thisisan infer-encetothebestexplanation,i.e.anabduction.Iftheresearchercan explainthepostulatedimpactofnotationonqualitybysome inter-mediatecognitivemechanism,thatispostulatedbyapreviously establishedtheory,thenthisisasecond,mechanisticabduction, thatincreasesthesupportfortheﬁrstone.

Third,theresearchermaywanttogeneralizetheclaimaboutthe populationfurthertosimilarpopulations,byanalogy.For exam-ple,froma statisticalgeneralizationaboutstudentprojects,the researchermaywant togeneralizefurthertothepopulationof allreal-worldprojectswithjuniorsoftwareengineers,andjustify thisgeneralizationbythesimilarityofthearchitectureand mech-anismsofthestudentprojectstothearchitectureandmechanisms ofreal-worldprojectswithjuniorsoftwareengineers.Thisanalogic generalizationtoomaybesupportedbya mechanistic explana-tion,ifthemechanismthatexplainsthephenomenoninstudent projects,canalsoexplainthatphenomenoninprofessional soft-wareengineeringprojects.

Double support for causal claims, in statistical evidence provided by statistical difference-making experiments, and in independently verifiedmechanismsthat can explainthe causal relationshipsinferredfromthestatisticalexperiments,seemsto becommonpracticeinthehealthsciences(RussoandWilliamson, 2007).Thus, sample-based reasoning and case-based reasoning haveausefulsupplementaryrelationship.Afterprovidingsupport foraninductivegeneralizationabouttheeffectofanartifact,the researchermaydosomecasestudies,orsomesingle-case mech-anismexperimentsasdescribedlater,inanattempttofindand understandthemechanismsthatproducesthiseffect.Or,theother wayaround,afterfindingthatamechanismhasproducedaneffect inafewcases,theresearchermaydoastatisticaldifference-making experimenttosupporttheclaimthatthiseffectcanbe general-izedstatisticallytothepopulation.Thus, thetwogeneralization dimensionsinthediagramofscalingup(Fig.3)mustbetraveled together.

2.4. Validityofinferences

Allthreekindsofinferencesdiscussedarefallible,meaningthat theirconclusionscouldbefalseeveniftheirpremisesaretrue.The researchermustthereforespelloutthereasonsthatsupportthe conclusion,andalsosummarizethereasonswhytheconclusions couldbefalseafterall.Thisiscalledadiscussionofvalidityofthe conclusions.Since“validity”suggests“justiﬁable”oreven“truth”, thistermismisleading.Alessmisleadingtermwouldhavebeen “plausibility”or“support”. However,Iwillsticktotheaccepted terminology.

InTable1wecanseethatthethreekindsofinferences corre-spondtothreewell-knownkindsofvalidity.Conclusionvalidityis thesupportfortheconclusionofastatisticalinference.Threatsto conclusionvalidityincludelowpower,smallsample,non-random sample,non-randomallocation,violationofassumptionsof statis-ticalalgorithms,etc.Notethatevenifconclusionvaliditywouldbe sufﬁcientlywellargued,itstillpossiblethattheexperimentisone ofthe5%experimentsthatshowsastatisticallysigniﬁcant differ-encebychance,i.e.withouttherebeingamechanismthatproduces thedifference.

Internal validity is thesupport for an explanation of a phe-nomenonbycausalmechanismsthatproducedthephenomenon.A majorthreattointernalvalidityisthatoutcomesofanexperiment maynotonlybeexplainedbyamechanismthatleadsfrom treat-menttooutcome,butbyothermechanismstoo.Forexample,ifthe OoScontainspeople,thenhistory,maturation,andattritionmay inﬂuencetheoutcome,inadditiontotheinﬂuenceofinstruments, tests,theexperimenter,semanticambiguities,etc.inthe experi-ment(Shadishetal.,2002,page54ff.).Forthereaderofaresearch reporttoassessthesupportfortheabductiveinferencethatthe observedoutcomeisproducedbysomemechanisms,alternative explanationsmustbelistedexplicitly.

ExternalvalidityappearsintwoﬂavorsinFig.5:incase-based reasoningandinsample-basedreasoning.Incase-basedreasoning, externalvalidity isthevalidityoftheanalogicinferencefroma single-caseexplanationtoallsimilarcases.Forexample,a mecha-nismobservedin[(artifactprototype)×(simulatedcontext)]inthe laboratoryisgeneralizedtoall[artifact×context]casesinthereal world.Insample-basedreasoning,externalvalidityisthevalidityof theanalogyofonepopulationtoanotherpopulation.Forexample, aconclusionaboutthepopulationofstudentprojectsis general-izedtoaconclusionaboutthepopulationofreal-worldprojects.In bothﬂavors,externalvalidityisthevalidityoftheinferencefrom thestudiedOoStoallsimilarcasesintherealworld.Asobserved byGigerenzer(Gigerenzer,1984),determiningexternalvalidityis anempiricalquestion.Ifconclusionsfromanexperimentin con-textAaregeneralized,fallibly,tocontextB,thenonecantestthis

(7)

Table1

Thethreekindsofinferenceandsomevalidityconsiderations.

Inductiveinference Estimationofapopulationparameter,ordecision aboutastatisticalhypothesisaboutthepopulation, basedonobservationsofasample.

Conclusionvalidity Aretheassumptionsofthestatisticalalgorithmssatisﬁed? Randomsample?Homogeneoussample?Random allocation?Statisticalpowerandeffectsize?Reliable measures?Logicalerrorsinreasoningfromsample statisticstopopulationhypotheses?Etc. Abductiveinference Explainingaphenomenonbyidentifyingthecausal

mechanismsthatproducedit.

Internalvalidity Aretherealternativeexplanations?Isthereacommon causethatcouldexplainthephenomena?Canthecontext oftheexperiment,thebehavioroftheexperimenter,or phenomenainthesampleofsubjectexplaintheoutcome oftheexperiment?Etc.

Analogicinference Concludingthatatargetwillhavethesameproperties asasource(theexperiment)becauseofsome similaritybetweenthem.

Externalvalidity Isthereatheoryofsimilarity?Doesthetheoryofsimilarity justifytheconclusions?Arethemechanismsinthetarget thesameasthoseinthesource?Arethereother mechanismsthatcouldinterferewiththemechanismof interest?Istheeffectcontext-sensitive?Etc.

generalizationbyrepeatingtheexperimentincontextB.Thisisin

factwhatisdonewhenscalingupfromthelabtotherealworld.

Threatstoexternalvalidityaresensitivityoftheeffectsofan

artifacttothecontextinwhichitisused,dissimilarityofthe

treat-mentusedinthelabtotreatmentsusedinpractice,interference

ofothermechanismswiththemechanismofinterest,absenceof

a theory of similaritythat couldjustify thegeneralization, etc.

Shadishetal.(2002,pages86ff.)andWohlinetal.(2012,page 110)providedetaileddiscussions.

3. Methodsforvalidationresearch

We will discuss theempirical validation methods usingthe structureofFig.4.Wehaveusedthisstructureearliertomakea checklistforempiricalresearchreports(Wieringaetal.,2012).The researcherusesanobjectofstudy(OoS)torepresentelementsof thepopulation,whereinourcasethepopulationelementshavethe structure[artifact×context].Therefore,theOoShasthisstructure too,consistingofamodeloftheartifactandamodelofthecontext. TheOoSisamodelofanarbitrarypopulationelementinthesense thatitissimilartopopulationelements,andcanbestudiedbythe researchertolearnsomethingaboutpopulationelements(Apostel, 1961).AnexamplewouldbeanOoSthatconsistsofaprototypeofa softwareproduct,interactingwithasimulationofaproblem con-text;oranREtechnique(theartifact)interactingwithastudent project(thecontext).

Instatisticalresearch,theresearcherstudiesasampleofOoS’sof statisticallymeaningfulsize.Incaseresearch,theresearcher stud-iesasmallsampleorevenasingleOoS.

Inexperimentalresearch,theresearcherappliesa treatment to an OoS and then measures what happens. In observational research,theresearchermeasurewhathappens,butdoesnotapply a treatment.Measurement aswellastreatmentusuallyrequire instruments.

InthediagramofFig.4,allinteractionsarebidirectional:One cannottreatanOoSwithouttheOoSexertingsomeinﬂuenceon thetreatmentinstrument,andonecannotmeasureanOoSwithout exertingsomeinﬂuenceontheOoS.

Theconceptoftreatmentneedssomeexplanation.Sofarwe havetakenacomponent-basedviewoftheworld,inwhichtheworld ismodeledasahierarchyofsystems,thatcontainsubsystems,that containsub-subsystems,etc.Thus,thepopulationconsistsof arti-factsinteractingwitha context,andresearchhasa structureof componentsasshowninFig.4.Inthis view,a treatmentisthe insertionofacomponentinacontext.Forexample,adoctortreats apatient(thecontext)bygiventhemadrug(theartifact),anda consultanthelpsasoftwareengineeringorganization(thecontext) byinsertinganimprovedREtechnique(theartifact).

Notethattheartifactnotonlyconsistsofaproduct,forexample adrugoranREtool,butalsoofaprocess,forexamplethe proto-colfortakingthedrugortheprocedurebywhichtousethetool. Theexperimentaltreatmentthenconsistsofmakingthisproduct availableandgivinganinstructionintheprocess.

Wecandescribethesameexperimentalsituationalsointhe moretraditionalview,inwhichatreatmentissettingthelevelof anindependentvariable.Thisisamore abstract,variable-based viewofexperiments,thatwillbeconvenienttouseinSection3.4 onstatisticalexperiments.Untilthen,itismoreilluminatingifwe usethecomponent-basedviewofFig.4.Table2listsfourgroupsof validationresearchmethodsthatwewilldiscussinthefollowing sections.

3.1. Expertopinion

Intheconceptualstageofvalidation,beforetheartifactistested onmodelsorintheﬁeld,theresearchercanelicittheopinionof expertsaboutthepossibleusabilityandusefulnessoftheartifact. Thisisobservationalempirical research,becausetheresearcher doesnotinterveneinanobjectofstudy.Theresearcherelicits opin-ions.Itisalsonotastatisticalsurveywiththeaimtoestimatethe distributionofopinionsintheentirepopulationofexperts.Rather, itisanattempttogetearlyinformationaboutexpectedusability

Table2

Validationresearchmethods.

Methods Examples

Researchingexpertopinion •Elicitingexpertopinionusing interviews,

•Questionnaires,or •Focusgroups Single-casemechanism

experiments

•Testinganartifactprototype onasimpleexampleinthelab

•Testinganartifactprototypeonarealistic exampleinthelab

•Testinganartifactprototypeonarealistic exampleintheﬁeld

Technicalactionresearch •Usinganartifactprototypeto helpaclient

•Teachingtheuseofanartifactprototypetoa clientbywhichtheycansolvesomeoftheir problems

Statisticaldifference-making experiments

•Comparingtheeffectof prototypesoftwoormore artifactsonasampleof simulatedcontextsinthelab

•Comparingtheeffectofprototypesoftwoor moreartifactsonasampleofcontextsinthe ﬁeld

(8)

andusefulnessoftheartifactinreal-worldcontexts.Thus, statisti-callymeaningfulsamplesizesarenotneeded;usefulopinionsare needed.

IntermsofFig.4,thepopulationisnotthesetofallpossible expertsbutthesetofallpossible[artifact×context]elements.So theobjectofstudyisnottheexperteither.Rather,theexpertisan instrumenttomeasureanimaginaryOoS,namelyamentalimage thattheexperthasformedofreal-world[artifact×context] ele-ments.Thisisanunreliableinstrument,butonethatnevertheless cangiveusefulinformation.

Positiveor uncriticalopinionsofexpertsarenotveryuseful, becauseexpertsmaybemotivatedbythedesiretoﬁnishthe inter-viewquickly,ortobenicetotheresearcher.Negativeorcritical opinionsontheotherhandareveryuseful,especiallyiftheexpert canindicatewhichelementof theartifactdesignwould notbe usableorusefulinwhichcontext,andwhy.

Example 1. Al-Emran et al. (2010) present an optimization methodforproductrelease planning.Inputtotheoptimization methodisasetofproductreleaseplans,consistingofasequence offeaturestobeimplementedinsubsequentreleasesofaproduct. Theoptimizationmethodthenﬁndstheplanthatismostrobust,in termsoftime-to-market,resourceassignment,andtaskschedule, withrespecttodifferencesintaskworkloadanddeveloper produc-tivity.Thatis,itselectsthereleasestrategythatisleastinﬂuenced bydifferencesinworkloadandproductivity.

Theresearcherstestedtheoptimizationmethodamong oth-ersbysubmittingittoexperts,askingtheiropinionaboutit.The researcherssentaquestionnaireaboutthemethodto25product developmentexpertsandreceived13responses.Manyresponses wereuninformativeinthattherespondentsthoughtthemethod wasusable and useful.Some respondents, though, complained thatusingthisoptimizationdecreasedtheirunderstandingofthe releaseplanningprocess,andotherscomplainedthattheyrequired morejustiﬁcationoftheresultbeforetheywouldadoptthe recom-mendation.Theseremarkspointatpotentialimprovementneeds ofthemethod.

Collectingexpertopinioncombinesthetwodimensionsof scal-ingup.Expertsimagineasampleofcases(informalsample-based reasoning)andimaginewhatmechanismswouldoccurineachof thosecases(informalcase-basedreasoning).Becauseofthe infor-mality of theirreasoning, theiropinions must be treated with caution,butneverthelesstheymustbetreatedseriously.

3.2. Single-casemechanismexperiments

Iuse the termsingle-case mechanism experimentto indicate experimentsinwhichtheresearcherinvestigatesoneOoSinorder totesttheeffectofsomemechanismthattheresearcherbelieves tobepresentintheOoS.Softwareengineersdothiswhenthey testasoftwareprototypebyfeedingitinputscenariosthat repre-sentpossiblescenariosintheintendedcontextofuse.Aeronautical engineersdoitwhentheytestanairfoilinawindtunnel.

IgivesomeexamplesfromREresearchbeforeIdiscussthelogic ofsingle-casemechanismexperiments.

Itisnotmypurposeheretojudgethequalityoftheanalogic generalizationsorofexplanationsgivenbyauthors,butmerelyto illustratewhattheroleofanalogyandofmechanisticexplanations invalidationexperimentsis.

Example2. Gacituaetal.(2011)proposeanewalgorithmforthe identiﬁcationofsingle-and multi-wordabstractionsin require-ments documents, and describe an experiment in which they comparetheperformanceofthisalgorithmwithhumanjudgment. Thealgorithmcomparesthefrequencyofaterminadocument withitsfrequencyinareferencedocumentofthelanguageusedin

therequirementsdocument,suchasacorpusofstandardEnglish. Termsthatarerareinthereferencedocument,butoccurfrequently intherequirementsdocumentarelikelytoindicateimportant con-ceptsinthedomain.

Totestthisalgorithm,theauthorsselectedabookonatechnical domain,anduseditsbodyasifitwerearequirementsdocument toanalyze.Theindexofthebookwasusedasareferencelistof domainconcepts.Thus,theartifacttobetestedisthealgorithm, andthebookanditsindexisthecontext;bothmakeuptheOoS. Thetreatmentinthisexperimentistherequesttoidentify abstrac-tionsinabook.Themeasurementisthemeasurementofrecalland precisionwithrespecttotheindexterms.

Concerningexternalvalidity,theauthorsarguethatthebook’s domainissimilartothedomainofREdocuments,thatthesizeof thedocumentissimilartodocumentsinREprojects,andthatthe conceptabstractionscenariosimilartothatofarequirements engi-neerwhohastofamiliarizeherselfwithanewdomain.Also,they arguethatthehierarchicalstructureoftheindexisrepresentative ofthestructureofmulti-wordtermsintheintendedpopulation.

Internalvalidityisthequestionwhetherthemechanismsbuilt intotheartifactexplaintheobservedeffects.Inthisexperiment, thefrequency-basedmechanismyieldedlowrecallandprecision. Theauthors’explanationisthattheidentiﬁcationofabstractions bypeopledoesnottakeplacebyafrequency-basedmethod,and thatfrequencyingeneralisnotasufﬁcientlypowerfulmechanism toidentifyabstractions.

Example3. Seyffetal.(2010)testedatoolformobile require-mentsengineeringintheﬁeld.Theygavemobilephonesrunning thetool tonine subjects,whoused itfor a few daystogather requirementsfor a systemthat supports dailycommuting, and requirementsforasystemthatsupportsshoppingactivities.The requirementswerestatedintextoraudio.Aftertheexperiment, subjects were debriefed, and researchers transcribed recorded needsintosystemrequirements.

Inthisexample,theOoSconsistsofanartifactprototype, inter-actingwitharealisticcontext.Thecontextconsistsofthemobile phoneonwhichtheprototyperuns,theusersusingtheprototype, andtheenvironmentinwhichtheusersmove.Thetreatment con-sistsoftheinstructions totheuserstousethetoolfor twoRE purposes.Thetreatmentinstrumentistheinstructionsessionin whichtheuserswereinstructed.Themeasurementsconsistofthe data(textoraudio)enteredbytheuseraswellastheanswersof theuserstoresearchers’questionsinthedebrieﬁngsession.

ThesimilaritybetweentheseOoS’sand theenvisaged popu-lationoffuturereal-worldmobileREprocessesisthreatenedby potentialdifferencesbetweenfuturetoolsandtheoneusedinthis experiment,andpossiblyalsobydifferencesinelicitationmethods. RememberthattheartifactinthisexampleconsistsofamobileRE toolplustheprocessforusingit.

Theresearchersmadeamechanism(amobileREtool)available touserstotestifitproducedtheexpectedeffects(recorded contex-tualend-userneeds).Themechanismhadtheexpectedeffectinall nineinvestigatedcases.Athreattothevalidityofthisobservation isthatthesubjectsmayhavewantedtobenicetotheresearchers, whichwouldbeafactorco-producingtheexpectedeffect.Other users,withoutafriendlydispositiontotheresearchers,mayhave failedtoproducetheeffectswheninteractingwiththetool.

TheseexperimentsallowanalogicinferencesfromtheOoSto thepopulation,andcanbeplacedalongtheverticaldimensionof ourdiagramofscalingup(Fig.3).Aswesaw,analogic generaliza-tionsmustbesupportedbya theoryofsimilarity,thatexplains whyanobservationonthesourceoftheanalogycanleadtoa con-clusionaboutthetargetoftheanalogy.Inthecomponent-based viewof the worldthat we take, the similaritybetween source andtargetofananalogymustbearchitectural,andthetheoryof

(9)

similaritymustindicatesomecomponent-basedmechanismthat producedaresponseintheexperiment,andcanproducea simi-larresponseinthereal-worldcasesofthepopulation.Hencethe name“mechanism-basedexperiments”.Theseexperimentsdonot usestatisticalinferencetosupportclaimsaboutthepopulation,but theyuseatheoryaboutmechanismstosupportclaimsaboutthe population.

Wedistinguishmechanismsintheartifactfrommechanismsin thecontext.

• Theartifactisbydesignacollectionofcomponent-based mech-anismsthatrespondstoinput.Partofsoftwaretestingconsistof validationwhetherthesemechanisms,ifimplementedcorrectly, indeedhavethedesiredeffects. Thismayleadtosurprises,in thesensethatunexpectedphenomenamayturnup(e.g.bugs) thataretheresultsofunexpectedmechanismsinan implementa-tion.Ingeneral,inalgorithmvalidation,theremaybeunexpected mechanismsintheprogrambecauseourabilitytoprogrammay exceedourabilitytounderstandwhatweprogrammed.

Thisislesslikelytohappenwhentestingamethod. Step-by-stepmethodssuchastheRationalUniﬁedProcessbuilduptheir resultsina simplemanner,byinstructionsoftheform“bring aboutresultX”,which,ifperformedcorrectly,leadstothe cre-ation of resultX. If these methods areperformed bycapable softwareengineersin anidealcontext,weusually donotrun againstunexpectedmechanismsinthemethoditself,because methodsarerelativelysimplestep-by-stepprocedures. • Onceamethodhasbeenshowntobeusablebytheresearcherand

hisorherstudents,theimportantresearchquestioniswhether itstillworksunderconditionsofpractice,i.e.intherealworld. Inreal-worldcontexts,theremaybecomponentsormechanisms thatimpacttheproductionofthedesiredresultofamethodstep inunexpectedways.Anexampleofanunexpectedmechanism inmobileREisthetendencyofuserstobeverybriefinthe tex-tualspecificationoftheirneedswhentheywereinaphysically confinedspace,andenterexplanationsbyaudiolater.Thismay makeneedsanalysismoretimeconsuming,whichinturnmay reducethetimelinessandcost-effectivenessoftherequirements specification.Theresearchersmayusethisinformationtochange themethod.

3.3. Technicalactionresearch

Technical action research (TAR) is a case-based mechanism experimenttoo,butIlistitseparatelybecauseitisalsosomething else:Itisareal-worldconsultancyproject(WieringaandMorali, 2012).InaTARprojecttheresearcherusesanartifactina real-worldprojecttohelpaclient,orgivestheartifacttootherstouse theminareal-worldproject(EngelsmanandWieringa,2012),and usesthisexperiencetolearnabouttherobustnessoftheintended effectsandthemechanismsthatbringthemabout,inuncontrolled conditionsofpractice.

Example4. MoraliandWieringa(2010)describea methodto assessconﬁdentialityriskswhenoutsourcingthemanagementofIT systems.TheythendescribehowMoraliusedthismethodto actu-allyassesstheconﬁdentialityrisksintheoutsourcingrelationship betweenalargemanufacturingcompanyandalargeoutsourcing serviceprovider.

Inthisexample,theartifactisanewriskassessmentmethod, andthecontextconsistsofMoraliapplyingthismethodtoarisk assessmentprobleminalargecompany.Moraliplayedadualrole asresearchergiving aninstructionhowtouseanartifacttoan OoSinwhich sheherselfwastheuseroftheartifact.The mea-surementstakenconsistedofallintermediateworkingdocuments

oftheproject,plusthediaryofMoraliinherroleasuserofthe method.

ThesimilarityofthisTARprojecttothepopulationofallsuchrisk assessmentprojectsisthataconﬁdentialityriskinanIT manage-mentoutsourcingsituationisassessed.Thereisalsoadissimilarity, whichisthatinmostprojectsinthispopulation,Moraliwillnotbe theonedoingtheriskassessment.Thisisathreattoexternal valid-itythatmustbedealtwithbyrepeatingTARprojectslikethiswith otherresearchers.

Internal validityis thequestionwhetherthemethodindeed delivereditsexpectedresults,andwhetheranymechanismsinthe contextinﬂuencedthis.Themethoddiddeliveritsexpectedresults, butonlyrepeatedTARprojectscanshowwhetherornotthisisthe duetothemethodonlyoralsototheuserofthemethod(Morali), thequalityofthedocumentationavailableinthecompany,etc.

TARisaspecialkindofmechanism-basedexperiment,andin theprocessofscalinguptheytaketheresearcherclosertothereal worldintheverticaldimensionoftheprocessofscalingup(Fig.3). TheinferencesinTARareofthesamekindasthoseinother case-basedmechanismexperiments.Eventsinthecaseareexplained in termsof mechanisms,andany generalizationtothe popula-tion issupportedby a theorythat says thatthese mechanisms canoccurinpopulationelementstoo.However,generalizations fromTARprojectshaveanadditionalthreattovalidity,becausethe researchermayhavecontributedpositivelytotheobservedevents inawaythatcannotbereplicated.

TARisusefulasaﬁnalvalidationstagebeforetransferringa tech-nologytopractice,becauseitisclosertoreal-worldpracticethan othercase-basedmechanism experiments.A singleTARproject isnotenoughtojustifytheclaimthatanartifactisapplicablein theentiretargetpopulationofpossibleprojects.Butitdoesjustify theclaimthattheartifactisusableandusefulinsomereal-world projects,anditcanprovideusefulinformationtotheresearcherfor furtherimprovingtheartifact.

3.4. Statisticaldifference-makingexperiments

AstatisticalexperimentisanexperimentwithasampleofOoS’s toinfera statisticalproperty ofthepopulation. For example,it mayestimatethepopulationmeanofavariable,withaconﬁdence interval,fromobservationsofthesamplemean.Oritmaytesta statisticalhypothesisaboutthepopulationmeanbyobservations fromasample.

Incontrasttocase-basedmechanismexperiments,thesample sizeisrelevant,becauseasindicatedinFig.5,inferenceis sample-based,notcase-based.Statisticalexperimentssupportinductive inferencesaboutsamples,which isthehorizontaldimensionof scalingup(Fig.3).Theydonotrequireatheoryof mechanisms togeneralizeinductivelytoapopulation,butaswehaveseenin Section2.3,providingsuchatheorydoesgiveadditionalsupport toaninductivegeneralization,becauseitwoulddecreasethe like-lihoodthattheinductivegeneralizationisbasedonacoincidental patterninthedata.

Todescribestatisticalexperimentsweneedtoswitchfromthe component-basedviewthatwehavetakenuptillnow,inwhichthe worldconsistsofcomponentsandinteractions,toavariable-based viewoftheworld,inwhichtheworldconsistsofvariablesand rela-tionships.Anydescriptionoftheworldinterms ofcomponents andinteractionscanbereplacedbyamoreabstractdescriptionin termsofvariablesandrelationships.Inthisvariable-basedview,a treatmentconsistsofsettingthevalueofanindependentvariable, andeffectsaremeasuredbymeasuringthevaluesofdependent variables.Iftherearetwotreatments,oftenoneiscalledthe “treat-ment” and the other the “control”, dividing the sample into a treatmentgroupandacontrolgroup.

(10)

Inductiveinferencefromsampletopopulationcantakeplacein avarietyofways,dependingonhowOoS’swereselected (samp-ling)andhowtreatmentswereallocatedtoOoS’s.Inarandomized controlledtrial(RCT),thesampleisrandomandtheallocationof treatmentstosampleelementsisrandomtoo(Shadishetal.,2002; Sedgwick,2011).Thismakesit possibletousethecentrallimit theoremtosupporttheinductiveinferencefromsampleto pop-ulation.Therearetwowaystodothis,byhypothesistestingandby confidenceintervals(Hacking, 2001;WonnacottandWonnacott, 1990).Instatisticalhypothesistestingtheexperimentermayobserve adifferenceinthesample,thatwouldbeveryunlikely(probability lessthan5%)tooccurifadifferencedidnotexistinthe popula-tion.Theresearcherwilltheninferthat,plausibly,thedifference existsinthepopulation.Intheestimationofconfidenceintervals, theexperimentermayestimateapopulationdifferencebythe sam-pledifferenceusinga95%confidenceintervalaroundthesample mean.Thisestimationmayberightorwrong,butifshefollowsthis estimationrulealways,shewillbewronginthelongruninonly about5%oftheinferences(Hacking,2001).

Havinginductively(andfallibly)inferredthatthereisa statis-ticalcorrelationbetweenindependentanddependentvariablein thepopulation,theexperimentertriestoabduceacausal explana-tionofthisdifference,anddoesthisbytryingtoexcludeanyother possiblecauseotherthanthedifferencebetweentreatments. Ran-domsamplingandrandomallocationcanonlyintroducechance ﬂuctuationsthatdisappearontheaverageinthelongrun.Butafter allocation,treatmentsmustbeapplied,andoutcomesmeasured, andsotheexperimentermustalsocheckwhetherapplicationor measurement,oranyother eventduringtheexperiment,could havecontributedtothemeasureddifference.Thisisallpartofthe discussionofinternalvaliditysummarizedinTable2.

If all these alternative causes are excluded, the difference betweentreatmentsistheonlyremainingpossiblecauseofthe observeddifferenceindependentvariables.Thisisanabductive inference.Notethatrandomsamplingandallocationisusedboth intheinductiveinferencestep,whereitfacilitatesapplicationofthe centrallimittheorem,andintheabductiveinferencestep,where itfacilitatestheexclusionofothercausesthanthetreatment.

Random sampling is difﬁcult to achieve in practice, so we ﬁndmany quasi-experiments in software engineering and else-where(Kampenesetal.,2009;Shadishetal.,2002;Sjobergetal., 2005),inwhichsamplingisnotrandomorallocationoftreatments toelementsisnotrandom.Forexample,subjectsmayself-select intotreatmentorcontrolgroups,ortheresearchermayallocate treatmentstoelementsaccordingtoapropertyoftheelements. Quasi-experimentscannotusethemathematicaltechniquesbased onthecentrallimittheoremfortheirstatisticalinference,butthere areotherreasoningtechniquesthatcanbeusedforstatistical infer-enceinquasi-experiments(Shadishetal.,2002).

RCTsandquasi-experimentsbothtakeaso-called difference-makingviewoncausality,whichiswhyIcallthemhere difference-makingexperiments.Inthisview,variableXhasacausalinﬂuence onvariableYifXmakesadifferencetoY.Thatis,ifXhadadifferent value,withallotherthingsbeingequal,thenthevalueofYwould bedifferentaswell(Holland,1986;Woodward,2003).

Forexample,supposeinanRCT,asampleofprojectsusing pro-grammingmethodAperformedbetterontheaveragethanasample ofprojectssolvingthesameproblemsusingmethodB,andsuppose thatthisdifferenceisstatisticallysigniﬁcant,i.e.itisunlikelyto beobservedinasampleifitwouldnotexistinthepopulation.So itisunlikelythatthedifferenceistheresultofchancealone.So theresearcherisjustiﬁedtolookforacause.Theremaybemany causesforthedifference,includingtheavailableresourcestothe projects,thecompetenceofprojectpersonnel,andthedifference betweenmethodsAandB.Iftheresearchercanruleoutallcauses otherthan thedifferencebetweenAand B, then thestatistical

differencesupportstheclaimthatthedifferencebetweenmethods AandBisthecauseofthedifferenceinprojectperformance.

Example5. Precheltetal.(2002)describeanexperimentto com-parethedifferencebetweenmaintenancetasksdoneonprograms wheredesignpatternsweredescribedincommentlines,and main-tenancetasksdoneonprogramswheredesignpatternswerenot commented.Theprogramswereidenticalexceptforthepresence ofso-calledPatternCommentLines(PCLs).

Theartifact is here thepresence of PCLs and thecontext is theprogram,maintenancetask,andmaintainer.Thesubjects self-selectedintothesamples,which makesit hardtoknow which hypotheticalpopulationtheyarearandomsampleof,butwewill assumethatitconsistsofcomputersciencestudentsperforming maintenancetasksonprogramsofsimilarsizeandcomplexityas thoseusedintheexperiment.Thetreatmentistheinstructionto performmaintenancetasks.Treatmentswereallocatedrandomly tosubjects.Themeasuredvariablesweretaskcompletiontimeand correctnessofresult.

Theresearchersfoundaslightimprovementoftasktimeand resultcorrectnesswhenPCLswerepresent,thatwasstatistically signiﬁcant.Thismeansthat there is a lowprobability that this observationwouldbemadeinthesample,ifthis improvement wouldbeabsent fromthepopulation.Thissupports the induc-tivegeneralizationthatanimprovementexistsinthepopulation ofallprogramsofthesamesizeandcomplexitybeingmaintained bystudents.Theauthorsdiscusspossiblecausesforthis improve-mentotherthanthepresenceofPCLs,andconcludethatthereis noevidencethatthereareothercauses(Precheltetal.,2002,page 599,Threatstointernalvalidity).

Theyadditionallyidentifyacognitivemechanismthatcouldbe responsibleforthiscausalrelationship(abductiveinference,Fig.5). Thismechanismispostulatedbyatheory,formulatedbyseveral researchersearlier,thatprogramcomprehensionworksbythe for-mationand validation ofhypotheses, of whichthe efﬁciencyis greatlyenhancedbybeacons,whicharehintsaboutfamiliarkinds ofstructures(Precheltetal.,2002,page596).PCLsaresuchbeacons. ThiscouldexplainthecausalinﬂuenceofPCLsmaintainability.It increasesthesupportfortheclaimthattheobservedimprovement issystematicratherthanacoincidentalevent.

Theauthorsarereluctanttogeneralize,byanalogy,fromthe populationofexperimentalmaintenancesituationssimilartothis experiment,tothepopulationofrealmaintenancetasks(Prechelt et al., 2002, page 599). However, they do reason that, if PCLs hadanimprovementeffectforrelativelysmallwell-commented programs, they might have an even better effect on large ill-commented programs(Prechelt et al., 2002, page 604). This is reasoningbyanalogy.

Inaninterestingaside,theauthorsobservethatexperiments comparingdifferentsyntacticformstoexpressthesamemeaning allhavethemethodologicalproblemthatthetwoformsrarelyhave theexactsamemeaning.Theauthorsgivegeneraladviceabouta methodologicallysoundsetupofsuchexperiments.Thisisa case-basedreasoningbyanalogy(Fig.5),inwhichtheirexperimentis anexampleforother,similarexperiments.

Statistical difference-making experiments support reasoning alongthehorizontaldimensionofourdiagramofscalingup(Fig.3). We see in this example ﬁrst an inductive inference and then twoabductions.First,thestatisticalcorrelationbetweentwo vari-ables is inductively inferred to exist in a population, based on observationsinthesample.Thisisinductiveinference.Next,itis arguedthatthisstatisticalcorrelationbetweenindependentand dependentvariableisacausalrelationshipfromindependentto dependentvariable,byrulingoutallotherpossiblecausesthan thedifferenceintreatments(abduction1).Third,thiscausal rela-tionshipwasexplainedbyacognitivemechanismpostulatedby

(11)

apreviouslyestablishedtheory(abduction2).Thisincreases con-ﬁdenceinthecausalconclusionsthattheauthorsdrewfromthe experiment.

4. Relatedwork

Thereasoningschema“[Artifact×Context]→Effectby Mech-anisms”hasbeen proposedin slightly differentforms in social science(PawsonandTilley,1997)andinmanagementscience(Van Aken,2004).Ithassomesimilaritywiththesatisfactionargument asproposedbyJacksonin softwareengineering(Jackson,2000). Wieringa(2003)callsitthesystemsengineeringargument,because itshowshowacomponentmustinteractwithothercomponents toproducedesiredbehaviorofacompositesystem.Itissimpler thanthestructurefordesigntheoriesproposedbyGregorandJones (2007). More discussionis provided elsewhere(Wieringaet al., 2011).

Douven(2011)givesaconvenientintroductiontoabductive rea-soning,alsocalled“reasoningtothebestexplanation”.Mechanistic abductionissimilartotheoreticalmodelabductionasdiscussedby Schurz(2008).

Theconceptofmechanismhasbeenproposedbyphilosophers whoanalyzedthestructureofexplanationinthephysical, biolog-ical,andsocialsciences(Glennan,1996;Machameretal.,2000).It hasbeenadoptedasanexplanatoryconstructinbiology(Bechtel and Richardson, 2010; Bechtel and Abrahamsen, 2005) and in thesocial sciences(Bunge,2004; Hedström and Ylikoski, 2010; Elster,1989).Alloftheseauthorshaveslightlydifferingconcepts ofmechanism.IllariandWilliamsonpresentasurveyand uniﬁca-tion(McKayIllariandWilliamson,2012),whichisverysimilarto theconceptthatIhaveusedhere.

ThereisahugeliteratureoncausalityandIcannotevenbegin tocitetherelevantliteraturehere.Therearetwoviews,onethat causalityisdifference-making,theotherthatacausalrelationship isamechanism,andwithineachviewthereareseveralpointsof view.Forexample,theBayesiantheoryofPearlisanexampleof adifference-makingview,describedina book(Pearl,2009)and summarizedinapaper(Pearl,2009b).Holland(1986)isanolder, exceptionallyclearexpositionofthedifference-makingview, stay-ingwithintheframeworkoffrequency-basedstatistics.Williamson (2011)surveys somemechanistic theories,and comparesthem withdifference-makingviews.

Generalization by analogy as discussed here is one form of analytic induction, propagated by Yin as theway to generalize fromcases(Yin,2003),butactuallyoriginatingfromthe sociolo-gistZnaniecki(1968).Theclearestandmostaccessibledescription ofanalyticinductionisgivenbyRobinson(1951):Theresearcher (1) roughly defines a class ofphenomena and (2) formulates a hypothesisaboutamechanismthatispostulatedtooccurinthese phenomena.This is ourtheoryof similarity,and we have only consideredthecasewherethetheorydescribesastructureof inter-actingcomponentsthat implementamechanism.Next,asingle casethatsatisfiesthedefinitionisinvestigated.Ifobservations fal-sifythehypothesis,theneitherthedefinitionisrefinedtoexclude thecaseathand,orthehypothesisisreformulatedtomatchthe observations.Afterinvestigatinganumberofcases,thedefinition andhypothesismayreachastablestate.Theresearcherthen gener-alizesbyclaimingthatallsimilarcasescontainsimilarmechanisms, whichwillproduce similareffects.Thiskindofcase-based rea-soningmovesusupwardinthediagramofgeneralization(Fig.3). Znanieckilistsafewhistoricalexamplesfrombiology,physicsand sociologywherethiskindofreasoningwasfollowed (Znaniecki, 1968,pages236–237).

Zelkowitzand Wallace(1998)presented asurvey of empiri-calvalidationmethodsinsoftwareengineeringthatIcomparein Table3withthelistinTable2.Intheterminologyofthispaper,their

Table3

ValidationmethodsidentiﬁedbyZelkowitzandWallace(1998)andbyGlassetal. (2001).

Thispaper Zelkowitzand

Wallace(1998)

Glassetal.(2001)

Validationresearchmethods

•Expertopinion •Single-casemechanism

experiment

•Simulation •Fieldexperiment •Dynamicanalysis •Laboratoryexperiment–

Software •Simulation •Technicalaction

research

•Casestudy •Actionresearch •Statistical difference-making experiment •Replicated experiment •Fieldexperiment •Synthetic environment experiment •Laboratoryexperiment– humansubjects

Otherresearchmethods

•Observationalcase study

•Casestudy •Casestudy •Fieldstudy •Fieldstudy

•Meta-researchmethod •Literaturesearch •Literaturereview/analysis

Measurementmethods

•Methodstocollectdata •Project monitoring •Ethnography •Legacydata •Lessonslearned Inferencetechniques •Techniquestoinfer informationfromdata

•Staticanalysis •Dataanalysis •Groundedtheory •Hermeneutics •Protocolanalysis

listcontainsvalidationmethodsbutsomeotherkindsofmethods too.

Theirassertionmethodhasbeenomittedbecause,astheyalso pointout,itisnotaresearchmethod.It isanexperimentaluse ofanewtechnologybythedeveloperinthelaboratory. Simula-tionis executinga productin a simulatedenvironment. Thisis a single-casemechanism experiment becausetheproductisan implemented mechanism to be tested. Dynamic analysis is the executionofaproductundercontrolledconditions,similarto sim-ulationbutnotaimedatsimulatingreal-worldenvironments.Itis asingle-casemechanismexperimenttoo.

Acasestudycouldbetheuseofanewtechnologyinan indus-trialproject(ZelkowitzandWallace,1998,page26),inwhichcase weclassifyitasatechnicalactionresearchproject,oran obser-vationalstudyofaproject(ZelkowitzandWallace,1998,page25), inwhichcaseweclassifyitasanobservationalcasestudy.Case studiesthereforeappeartwiceinTable2.

Replicatedexperiments and syntheticenvironment experiments arestatisticalcomparisonofgroupsofprojects,wherein differ-entgroups,a taskisperformeddifferently. Thesearestatistical difference-makingexperiments,performedintheﬁeld orinthe lab.

The difference between observational case studies and field studiesdefinedbyZelkowitzandWallace(1998,page26)isthat acasestudyisintrusivewhereafieldstudyisnot.Theyareboth classifiedasobservationalcasestudiesintheterminologyofthis paper, because in both, the research method is observational and influence oftheresearcher on theobjectof study isto be minimized.Observationalstudiesare,intheterminologyofthis

(12)

paper,suitableasmethodsforevaluationstudiesofimplemented technology, but not as research methods for validating new technologynotyettransferredtothemarket.

Literaturesearchispartofanyresearchbutmaybeexpandedinto afull-blownresearchmethod,alsocalledasystematicliterature review(Kitchenham,2004).

Projectmonitoringisthecollection,bytheresearcher,ofdata producedduringaprojectandlegacydataisthecollectionof doc-umentssuchassourcecode,speciﬁcations,and testplansafter theprojectisﬁnished.Forafull-blownresearchmethod,weneed adesignofthewaytheresearcher willinteractwiththeobject ofstudy,includingmeasurementmethodsandanyexperimental intervention,andinferencedesign.Intheterminologyofthispaper, projectmonitoringandlegacydataasdescribedbyZelkowitzand Wallace,aremeasurementmethods.

Lessonslearnedisthecollectionandanalysisoflessonslearned documentsfromprojects.InTable2thisisclassiﬁedasa measure-mentmethodtoobutbecauseitalsocontainsanalysis,wecould alsoclassifyitasaformofobservationalﬁeldstudy.

In static analysis the completed product is investigated, for exampletoanalyzeitscomplexity.Itissimilartothestudyoflegacy databutitishereclassiﬁedasaninferencetechniquebecauseit referstoacollectionofanalysismethods.

Glassetal.(2001)listtheempiricalresearchmethodsshownin thethirdcolumnofTable3.Non-empiricalmethodssuchas concep-tualanalysisandmathematicalproofhavebeenomitted,anddesign activities,viz.conceptimplementationandinstrumentdevelopment havebeenomittedtoo.Sinceitisnotclearfromthedescriptionby Glassetal.whetherexperimentsareofthesingle-casemechanism kindorofthestatisticaldifference-makingkind,theyareclassiﬁed asboth.Idiscussthenewentriesinthiscolumn.

Ethnographyisthedetailedcollectionanddescriptionofdaily eventsinasocialgroup,withoutanalysis,whichisclassiﬁedhere asameasurementmethod.Groundedtheoryistheanalysisof tex-tualdataproducedbypeople,toextractthetheoriesheldbythese people(StraussandCorbin,1998).Iconsiderthistobeadescriptive analysismethod.Hermeneuticsisthephenomenonthattointerpret humanbehavior,youhavetounderstandtheirculturaland concep-tualframework,buttheonlywaytounderstandtheirculturaland conceptualframeworkistointerprettheirbehavior.Thisleadsto aninferencestrategyinwhichtheresearcheriteratesoverupdating hisorherconceptualframeworkandinterpretinghumanbehavior inthat framework. Protocolanalysis isthe analysisof thinking-aloudprotocols,usefulforcognitivepsychology.Itisadataanalysis method.

Table3showsthattheselistsofsoftwareengineeringresearch methods are mutually consistent and can be integrated in my frameworkfor validation researchmethods,and extenditwith othermethods. The overview is not complete, happily,as new methodsandinstrumentsforresearchkeepbeingdeveloped.

5. Summaryandconclusion

Empiricalvalidationof technologybeforeitis transferredto practicerequiresinvestigatingtheeffectsoftheinteractionofthe artifactwithitscontext,andexplainingtheseeffectsbymeansof theunderlyingmechanismsthat produces theseeffects.Scaling uptopracticethusproducesadesigntheoryoftheform “[Arti-fact×Context]producesEffectsbyMechanisms”.

Producingsupportforsuchatheoryinvolvestwokindsof infer-ences,alongtheverticalandhorizontaldimensionsoftheprocess ofscalingup.Analogicinferencesintheverticaldimensionreason fromcase-basedmechanismexperimentstoreal-worldinstances of[Artifact_×Context],andstatisticalinferencesalongthehorizontal dimensionreasonfromobservedsamplebehaviortothe popula-tionofallpossibleinstancesof[Artifact×Context].Bothinferences

aresupportedbyabductiveinferences,thatpostulate mechanism-basedexplanationsofcause-effectsinﬂuences.Mechanism-based explanationsrefertothecomponentsof[Artifact×Context]and theirinteractions.

Wediscussedthefollowingresearchmethodstovalidate arti-facts:

• Expertopinion,inwhichexpertsreasoninformallyaboutsamples (horizontally)and mechanisms(vertically),which providesan initialsanitycheckofanartifactdesign;

• Single-case mechanism experiments, in which the researcher reasonsverticallyaboutmechanismsandtheireffectsin increas-inglyrealisticartifactsinincreasinglyrealisticcontexts; • Technicalactionresearch,inwhichtheresearcherreasons

ver-ticallyaboutmechanismsandtheireffectswhenanartifactis appliedinareal-worldprojecttohelpaclient;

• Statistical difference-making experiments, in which the researcher reasons horizontally from effects observed in samplestoeffectsinferredinpopulations.

Thesemethodscanbeusedwithmeasurementinstrumentsand dataanalysismethodsknownfromsoftwareengineeringand else-where.

Thispaperhasgivensomeexamplesofuseoftheseresearch methods,butthisisjustonesteponthewaytoscalingupthese methodstoempiricalREresearch.Increasinguseofthese meth-odswillteachusmoreabouttheusabilityandusefulnessofthese researchmethodsinempiricalvalidationofREtechnology.

Acknowledgements

ThispaperbeneﬁttedfromcommentsbyVincenzoGervasiand WalterTichy.Iwouldliketothanktheanonymousreviewersof thispaperfortheirconstructivecritique.

References

Al-Emran,A.,Pfahl,D.,Ruhe,G.,2010.Decisionsupportforproductrelease plan-ningbasedonrobustnessanalysis.In:Proceedingsofthe18thIEEEInternational RequirementsEngineeringConference(RE2010),IEEEComputerSociety, Syd-ney,Australia,pp.157–166.

Apostel,L.,1961.Towardsaformalstudyofmodelsinthenon-formalsciences.In: Freudenthal,H.(Ed.),TheConceptandRoleoftheModelintheMathematicaland theNaturalandSocialSciences.Reidel,Dordrecht,TheNetherlands,pp.1–37. Babbie,E.,2007.ThePracticeofSocialResearch,11thed.ThomsonWadsworth,

Belmont,USA.

Bechtel,W.,Abrahamsen,A.,2005.Explanation:amechanisticalternative.Studiesin theHistoryandPhilosophyofBiologicalandBiomedicalSciences36,421–441. Bechtel,W.,Richardson,R.,2010. DiscoveringComplexity:Decompositionand LocalizationasStrategiesinScientiﬁcResearch.MITPress,Cambridge, Mas-sachusetts(Reissueofthe1993editionwithanewintroduction).

Bunge,M.,2004.Howdoesitwork?Thesearchforexplanatorymechanisms. Phi-losophyoftheSocialSciences34,182–210.

Constant,E.,1980.TheOriginsoftheTurbojetRevolution.JohnsHopkins,Baltimore. Cook,S.,Daniels,J.,1994.DesigningObjectSystems:Object-OrientedModellingwith

Syntropy.Prentice-Hall,UpperSaddleRiver,NewJersey.

Cowan,C.,2002.Theprocessofevaluatingandregulatinganewdrug:phasesofa drugstudy.AANAJournal70,385–390.

Damian,D.,Chisan,J.,2006.Anempiricalstudyofthecomplexrelationshipsbetween requirementsengineeringprocessesandotherprocessesthatleadtopayoffs inproductivity,qualityandriskmanagement.IEEETransactionsonSoftware Engineering32,433–453.

Davis,A.,Hickey,A.,2004.Anewparadigmforplanningandevaluating require-mentsengineeringresearch.In:2ndInternationalWorkshoponComparative EvaluationinRequirementsEngineering,pp.7–16.

Douven,I.,2011.In:Zalta,A.(Ed.),TheStanfordEncyclopediaofPhilosophy(Spring 2011Edition).http://plato.stanford.edu/archives/spr2011/entries/abduction/ Elster,J.,1989.NutsandBoltsfortheSocialSciences.CambridgeUniversityPress,

Cambridge,UK.

Engelsman,W.,Wieringa,R.J.,2012.Goal-orientedrequirementsengineeringand enterprisearchitecture:twocasestudiesandsomelessonslearned.In: Require-mentsEngineering: Foundationfor SoftwareQuality (REFSQ2012),Essen, Germany,pp.306–320(volume7195ofLecturenotesincomputerscience, Springer).