Cost of a usability evalution: Bootstrap discovery behaviour model

(1)

See discussions, stats, and author proﬁles for this publication at: https://www.researchgate.net/publication/228079045

Cost of a Usability Evaluation: Bootstrap Discovery Behaviour Model

Conference Paper · July 2010

CITATIONS

0

READS

65 3 authors:

Some of the authors of this publication are also working on these related projects:

Nave ItaliaView project

The MATCH ProjectView project Simone Borsci

University of Twente

81PUBLICATIONS 458CITATIONS

SEE PROFILE

Stefano Federici

University of Perugia. Italy

228PUBLICATIONS 1,073CITATIONS

SEE PROFILE

Alessandro Londei

Sapienza University of Rome

27PUBLICATIONS 445CITATIONS

SEE PROFILE

All content following this page was uploaded by Stefano Federici on 29 May 2014.

(2)

(3)

IADIS INTERNATIONAL CONFERENCE

INTERFACES AND HUMAN

COMPUTER INTERACTION

2010

and

IADIS INTERNATIONAL CONFERENCE

GAME AND ENTERTAINMENT

TECHNOLOGIES 2010

part of the

IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND

INFORMATION SYSTEMS 2010

(4)

iii

SECTION I

PROCEEDINGS OF THE

IADIS INTERNATIONAL CONFERENCE

INTERFACES AND HUMAN

COMPUTER INTERACTION 2010

SECTION II

PROCEEDINGS OF THE

IADIS INTERNATIONAL CONFERENCE

GAME AND ENTERTAINMENT

TECHNOLOGIES 2010

part of the

IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND

INFORMATION SYSTEMS 2010

Freiburg, Germany

JULY 26 - 30, 2010

Organised by

IADIS

International Association for Development of the Information Society

(5)

iv

Copyright 2010

IADIS Press

All rights reserved

This work is subject to copyright. All rights are reserved, whether the whole or part of the material

is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other way, and storage in data banks.

Permission for use must always be obtained from IADIS Press. Please contact secretariat@iadis.org

Volume Editor:

Katherine Blashki

Computer Science and Information Systems Series Editors:

Piet Kommers, Pedro Isaías, Dirk Ifenthaler and Nian-Shing Chen

Associate Editors: Luís Rodrigues and Patrícia Barbosa

(6)

v

SECTION I

IADIS INTERNATIONAL CONFERENCE

INTERFACES AND HUMAN

COMPUTER INTERACTION 2010

part of the

IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND

INFORMATION SYSTEMS 2010

(7)

COST OF A USABILITY EVALUTION: BOOTSTRAP

DISCOVERY BEHAVIOUR MODEL

Simone Borsci*, Stefano Federici** and Alessandro Londei*

*ECoNA - Interuniversity Centre for Research on Cognitive Processing in Natural and Artificial Systems, University of Rome ‘La Sapienza’, IT;

**Department of Human and Education Sciences, University of Perugia, IT;

ABSTRACT

The international debate on the costs of usability evaluation is mainly focused on the return on investment (ROI) model of Nielsen and Landauer (1993). In this study, the ROI model properties and limits are discussed in order to identify the base of an alternative model that considers a large number of variables for the estimation of the number of participants needed for a usability evaluation. Using the bootstrapping statistical inference (Efron,1979), we propose a model, named Bootstrap Discovery Behaviour (BDB), suitable to take into account: a) the interface properties, as the properties at zero condition of evaluation; and b) the probability that the population discovery behaviour is represented by all the possible discovery behaviour of a sample. The data of two experimental groups, one of users and one of experts, are involved in the evaluation of a website. Applying the BDB model to the problems identified by the two groups we found that 13 experts and 20 users are needed to obtain the 80% of usability problems, instead of 6 experts and 7 users required according to the estimation of the discovery likelihood provided by the ROI model. The power of the BDB model rests on the most predictive validity for accurate predictions about a participants’ future discovery behaviour of usability problems, as regards the ROI model.

KEYWORDS

Asymptotic test; Bootstrap; Bootstrap Discovery Behaviour model; Effectiveness; Return of investment; Usability Evaluation.

1. _{INTRODUCTION: THE Λ VALUE OVERESTIMATION}

Nielsen and Landauer (1993) show that, generally, the number of evaluators (experts or users) required for usability evaluation techniques ranges from three to five; so that, adding more than five participants does not provide an advantage in estimating problem discovery rates of new problems in terms of costs, benefits, efficiency, and effectiveness (Turner et al., 2006). This model, known as return of investment (ROI), is an asymptotic test to esteem the number of evaluators needed by applying the following formula:

ܨ݋ݑ݊݀ሺ௜ሻ= ܰሾ1 − ሺ1 − ߣሻ௜ሿ (1). In 1, N is the total number of problems in the interface, λ is the

probability of finding the average for the usability problem when running a single average subject test (i.e. individual detection rate), and i is the number of users. As some international studies have shown (Nielsen 2000; Lewis 1994), a sample size of five participants is sufficient to find approximately 80% of the usability problems in a system, when the individual detection rate (λ) is at least .30. By using this mathematical model, it can be found the range of evaluators required for a usability test and, therefore, it can be calculated the increase of problems found by adding users to the evaluation. For instance, if for a 5 users evaluation λ equals .30, applying the formula 1, practitioners can estimate whether these 5 users are enough for an efficient assessment or, otherwise, how many n users are needed to increase the percentage of usability problems. As Nielsen and Landauer (1993) underline by discussing their model, the discoverability rate (λ) for any given usability test depends on, at least, seven main factors: (i)The properties of the system and its interface; (ii) The stage of the usability lifecycle;(iii) Type and quality of the methodology used to conduct the test; (iv) Specific tasks selected;(v) Match between the test and the context of real world usage;(vi) Representativeness of the test participant;(vii) Skill of the evaluator. These factors have an effect in the evaluation of the interaction between system and user that, according to our opinion, the ROI model is not

(8)

able to estimate. Indeed, the ROI model assumes that: (i) All the evaluators have the same probability to find

all problems – As Caulton (2001) states, the ROI model is based on the idea that all types of subjects have

the same probability of encountering all usability problems, without considering their different evaluators’ skills. At the same time, the ROI model does not take into consideration the effect of the evaluation methodologies used, the representativeness of the participant sample, and, finally, the match between the test and the context of real world usage. In particular, as Woolrych and Cockton (2001) claim, the ROI model actually fails to integrate all the individual differences in problem discoverability; in this sense, the participants’ probability to encounter all the problems remains as a relevant issues that need a clarification. (ii) The λ value estimation does not take into account the differences of the systems evaluated – It means that the model does not consider the effect on the evaluation results caused by the properties of the system, the interface lifecycle stage, and the methodologies selected for the evaluation.. In fact, the ROI model starts with a “one evaluator” condition and not at zero condition. It means that the characteristic of the system are considered only as the difference of problems found by the first evaluators. As Nielsen pointed out (2000), the first evaluator (user or expert) generally find a 30% of problems, because those are generally the most evident problems. Then, the next evaluators usually find a less percentage of new problems, just because the most evident are already detected by the first one. Therefore, the number of evident problems is determined just in an empirical and variable way since it is depending on the evaluator skills that, as we already stated, is also a factor not considered by the model. The international debate on the estimation of λ value also shows that the ROI model suffers of an overestimation of λ (Woolrych and Cockton 2001). In order to solve this overestimation problem, Lewis (2001) has elaborated a new formula, the Good-Turing adjustment, able to provide a better calibrated estimation of λ. Nevertheless, this adjustment does not solve all the other problems that the Nielsen and Landauer’s model generates. Taking into account the lacks of the ROI model, in a usability evaluation, practitioners must consider that the estimation of λ could have a variable range of value and, as a consequence, that this model can not granted the reliability of the evaluation results obtained by the first five participants.

This analysis allows us to provide an alternative model to the ROI one, based on the probabilistic behaviour in the evaluation. As its first feature, our alternative model provides the probabilistic individual differences in problem identification. The second feature of our model is that it must consider the evaluated interfaces as an object per se. The interface are considered different not for the number of problems find by the first evaluator (evaluation condition), but as different as object (zero condition) by estimating the probabilistic number of evident problems that all the evaluators can detect testing the interface. The third feature of the model is that, in order to calculate the number of evaluators needed for the evaluation, it considers the representativeness of the sample (as regards the population of all the possible evaluation behaviour of the participants). Our model is based on the statistical inference methods, known as Bootstrapping (Efron,1979).

2. THE BOOTSTRAP DISCOVERY BEHAVIOUR (BDB)

The present bootstrapping approach moves from the assumption that the discovery of new problems should be the main goal of both users’ and experts’ evaluation, as well as it was expressed in the formula 1 by Nielsen and Landauer (1993). Given a generic problem x, the probability that a subject will find x is p(x). If two subjects (experts or users) navigate the same interface, the probability that at least one of them identify the problem x is: ݌ = ሺݔ_ଵ∨ ݔ_ଶሻ (2). In 2 is the logic operators OR. According to De Morgan’s law (Goodstein 1963), 2 is equivalent to: ݌ = ሾ

¬

ሺ

¬

x_ଵ∧

¬

x_ଶሻሿ (3). In 3 is expressed the probability of ‘how much it is false that none of the subject finds anything’ (is the logic operator for negation). So 3, can be rewritten as: ݌_൫

¬

_୶൯= 1 − ݌_ሺ௫ሻ . Since a subject’s probability of finding a problem is independent to the probability of the other one, we state that:

݌

_ሾ

¬

_ሺ

¬

_୶_భ_∧

¬

_୶_మ_ሻሿ

= 1 − ൣ1 − ݌

_ሺ௫ଵሻ

൧ ∗ ൣ1 − ݌

_ሺ௫ଶሻ

൧

(4). Assuming, as hypothesis, that all subjects have the same probability (p) to find the problem x, then 4 can also be expressed as: ݌_ሺ௫_భ_∨௫_మ_ሻ= 1 − ൣ1 − ݌_ሺ௫ሻ൧ଶ(5). Of course, we can extend this case to a generic number of evaluators L: ݌_ሺ௫_భ_∨௫_మ_∨௄∨௫_ಽ_ሻ= 1 − ൣ1 − ݌_ሺ௫ሻ൧௅ (6). The equation 6 expresses the probability that, in a sample composed by L evaluators, at least one of them identify the problem x. Being Nthe total amount of problems in a interface, if the probability of finding any problem by any evaluator is constant (p(x)), then the IADIS International Conference Interfaces and Human Computer Interaction 2010

(9)

mean of problems found by L evaluators is:

ܨ = ܰ ቂ1 − ൫1 − ݌

_ሺ௫ሻ

൯

௅

ቃ

(7). In 7, in order to estimate p(x) we adopted the bootstrap model, avoiding estimation merely based on the addition of detected problems. This kind of estimation, in fact, could be invalidated by the small size of the analysed samples or by the differences in the subjects’ probabilities of problems detections. Our idea is that the bootstrap model should be able to grant a more reliable estimation of the probability of identifying a x problem.

2.1 Experiment and Participants

In order to test the Bootstrap Discovery Behaviour model: (i) Two experimental groups are asked to evaluate a target interface: 20 experts by the means of a cognitive walkthrough (CW) technique, and 20 users by using the thinking aloud (TA) technique. (ii) Using the “Fit” function in the Matlab software (http://www.mathworks.com), we applied a bootstrap with 5000 samplings. The results of each subsample were obtained by a random order of evaluators (experts and users) with repetition. (iii)The result of each bootstrap samples allowed us to esteem three parameters, in order to identify the best fit of the data in a 95% interval of confidence: i) the probable number of problems found (p) – this value was obtained as the normalized mean of problems found by each subgroup of subjects; ii) the maximum number of problems that all possible samples could identify (a), known as maximum limit, and iii) the value of the q known term.

Experts group: 20 experts (10 male, 10 female. Age mean=24.2) mean with different level of expertise, 10

experts have more than three years of experience, 10 experts have less than one year of experience. All the experts evaluated the target website with a cognitive walkthrough technique. Users group: 20 students of University of Rome “La Sapienza” (10 male, 10 female. Age mean=23.3) were involved in the thinking aloud (TA) analysis of the target website.

2.1.1 Evaluation Techniques

Cognitive walkthrough: it starts with a task analysis that allows: a) to specify the sequence of steps a user

should take in order to accomplish a task; and b) to observe the system responses to the actions. Once the task analysis is over, the expert simulates the actions of the potential user and identifies the problems the user is supposed to find. As Rieman, Franzke, and Redmiles (1995) claim, this technique is based on three elements: “a general description of who the users will be and what relevant knowledge they possess, a specific description of one or more representative tasks to be performed with the system, and a list of the correct actions required to complete each of these tasks with the interface being evaluated”.

The experts perform the cognitive walkthrough by asking themselves a set of questions for each subtask: (i) The user sets a goal to be accomplished with the system (for example, check spelling of this document). (ii) The user searches the interface for currently available actions (menu items, buttons, command-line inputs, etc.). (iii) The user selects the action that seems likely to make progress toward the goal. (iv) The user performs the selected action and evaluates the system’s feedback for evidence that progress is being made toward the current goal.

Thinking aloud: known as verbal protocol analysis, it had a large application in the study of consumer and

judgment making processes. In describing this users-based evaluation processes, Hannu and Pallab (2000) state: “The premise of this procedure is that the way subjects search for information, evaluate alternatives, and choose the best option can be registered through their verbalization and later be analysed to discover their decision processes and patterns. Protocol data can provide useful information about cue stimuli, product associations, and the terminology used by consumers.” The TA can be performed in two main different experimental procedures: the first procedure, and the most popular, is the concurrent verbal protocol, with which data are collected during the decision task; the second procedure is the retrospective verbal protocol, with which data are collected when the decision task is over. Our experimental work has used the concurrent TA because it is one of the most applied techniques of verbal report applied in the HCI studies. Indeed, in the concurrent TA users express their problems, strategies, stress, and impressions without the influence of a “rethinking” perception, as it happens in the retrospective analysis (Federici, Borsci, and Stamerra 2009). Each test was performed at the laboratory of cognitive psychology of the University of Rome “La Sapienza”.

(10)

2.1.2 Apparatus and Websites Target

Each participant uses a computer Intel-Pentium 4 with a RAM of 4 Gb, a Gforce 8800 (video) and a Creative Sound Blaster X-Fi (audio). The monitor is a SyncMaster900p, 19’’ and the speakers are two Creative GigaWorks T20 Series II. Each test is video recorded with a Sony – 3 Mg pixel and each user screen movement is recorded by the CamStudio 20 screen recorder. A 28’’ Sony screen is used for monitoring the users’ interaction by expert. Each user used as browser the Internet explorer 8. www.serviziocivile.it was chosen as target website. It was selected among those websites of the Italian Public Administration considered accessible by the CNIPA evaluation (http://www.pubbliaccesso.gov.it/logo/elenco.php). The expert and user-based analysis were carried out on four scenarios. These scenarios have been created and approved by three external evaluators with more than three years of experience in the field. These evaluators did not participate in the experimental sessions.

2.1.3 Procedure and Measures

Experts group: in a meeting with all experts, the evaluation coordinator exposed procedure, goals, and

scenarios provided by three external experts, with more than five years of experience in the accessibility and usability evaluation. Then, all experts are invited to evaluate the system through a CW and to provide an independent evaluation.

Users group: after 20 minutes of free navigation as training, users started the TA evaluation following

four scenarios. The evaluation coordinator reported all problems identified in the TA session, and checked and integrated the report using the video of verbalization and mouse action recorded by CamStudio.

We compared the number of evaluators needed in order to achieve the 80% of problems, applying both the ROI and our BDB model. The analysis was carried out by the SPSS 16 and Matlab softwares.

3. ALTERNATIVE MODEL OF ESTIMATION

In order to identify the number of experts and users needed for detecting more than 80% of problems, we must obtain the best fit with our results. Our model must provide an estimation of those parameters able to represent the properties of the interface and the representativeness of the sample. The bootstrap analysis was used in order to obtain the following parameters: (i) Parameter a: All the possible discovery behaviour of participants: considering our 5000 possible bootstrap samples, at each bootstrap step was selected a subsample composed by a random order of collected data (i.e. the identified problems). The maximum value of collected problems represents our maximum limit value, indicated below in 8 as variable value a (where in the Nielsen and Landauer’s formula is always equal to 1). This value indicates the representativeness of our sample. (ii) Parameter p: it is a rule to select the representative data; as representative data for the subsamples, we used the normalized mean of the number of problems found by each subsample (indicated below in 8 as p). This value represents the probable number of problems findable by all the population represented by our sample. The model expressed below in the 8 represents that one that best fits the values obtained by all the possible subsamples of our expert and user samples: ܨ = ܰ_௧ሾܽ − ሺ1 − ݌ሻ௅ା௤ሿ (8). In 8, Nt represents the total number of problems in the interface and the q variable expresses the hypothetical condition L=0 (an analysis without evaluators). In other words, the parameter q represents the possibility of a certain number of problems that has been already identified (or are evident to identify) and were not fixed by the designer: ܨ = ܰ_௧ሾܽ − ሺ1 − ݌ሻ௤ሿ (9). The value q represents the properties of the interface from the evaluation perspective. This is at least the “zero condition” of the interface properties.

4. RESULTS AND CONCLUSION

Experts identified 46 problems with a value of λ equal to .26. Following the ROI model the number of experts, needed to find 80% of problems, equals 6. The values of the parameters needed for calculating the model (a=0.9623, p=0.1414, q=0.8356) are calculated by running the model in Matlab. Our BDB results show that the 13 experts are needed for the evaluation for identifying more than the 80% of problems.

IADIS International Conference Interfaces and Human Computer Interaction 2010

(11)

Users identified 39 problems with a value of λ equal to .22. Following the ROI model, the number of users, needed to find 80% of problems, equals 7. The data of users group were processed as the experts’ one. Applying the results to the equation 8, the result of BDB analysis shows that 20 users are needed for the evaluation in order to identify more than the 80% of problems (the values of a=0.8691, p=0.1235,

q=2.3910). The results given by the BDB are very different as regards the ones obtained by the ROI model.

For the ROI model, the estimation of the costs for the usability evaluation required a sample composed of 6 experts and one of 7 users, while applying the BDB the estimation shows that a sample of 13 experts and one of 20 users are needed in order to obtain more than 80% of problems. Consequently, following the BDB model, there is an increase of the usability evaluation costs, respect to the data provided by ROI model. However, the increase of costs enlarges the results of the evaluation. Actually, the BDB approach allows to take into account the behaviour of the whole population (parameter a), the representativeness of the sample data (i.e. the problems found expressed by the parameter p), and the different properties of the interface (parameter q). The BDB, while respecting the assumption of the ROI model, opens new perspective on the discovery likelihood and on the costs of usability evaluation. Indeed, the possibility to consider both the properties of interface and the representativeness of data grants the practitioner a representative evaluation of the interface. A practitioner can run a test applying the BDB model after the first 5 experts and users in order to esteem the parameters a, p, and q and to esteem the number of evaluators needed for an evaluation that considers the specific properties of the interface and the representativeness of the sample. In this sense in the evaluation, a practitioner can take into account both the BDB model and the ROI one. Our perspective offers a new model for the usability evaluation, guaranteeing the representativeness of the data and overcoming the limits of the ROI model. The power of the BDB model rests on the most predictive validity for accurate predictions about a participants’ future discovery behaviour of usability problems, as regards the ROI model. Indeed, the behaviour predicted by the BDB model is based on a wider discovery variability of the potential participants (all users have different ability to find problems), rather than the Nielsen and Landauer model, in which the predicted behaviour is based on the same discovery conditions (all users have the same probability to find all problems).

REFERENCES

Caulton, D.A., 2001. Relaxing the homogeneity assumption in usability testing. Behaviour & Information Technology, Vol. 20, pp. 1-7.

Efron, B., 1979. Bootstrap Methods: Another Look at the Jackknife. Annals of Statistics 7:1-26.

Federici, S. et al., 2009. Web usability evaluation with screen reader users: Implementation of the Partial Concurrent Thinking Aloud technique. Cognitive Processing DOI 10.1007/s10339-009-0347-y

Goodstein, R.L., 1963. Boolean Algebra. Oxford: Pergamon.

Hannu, K. and Pallab, P., 2000. A comparison of concurrent and retrospective verbal protocol analysis. American Journal

of Psychology, Vol., 113, pp.387- 404.

Lewis, J.R., 2001. Evaluation of procedures for adjusting problem-discovery rates estimated from small samples.

International Journal of Human-Computer Interaction, Vol. 13, pp. 445-479

Lewis, J.R., 1994. Sample sizes for usability studies: Additional considerations. Human Factors, Vol. 36, pp. 368-378. Nielsen, J., 2000. Why you only need to test with 5 users.

Nielsen, J. and Landauer, T.K., 1993. A Mathematical Model of the Finding of Usability Problems. Proceedings of the

SIGCHI conference on Human factors in computing systems, ACM, Amsterdam, The Netherlands, pp. 206-213, DOI

10.1145/169059.169166.

Polson, P.G. et al., 1992. Cognitive walkthroughs: a method for theory-based evaluation of user interfaces. International

Journal of Man-Machine Studies, Vol. 5, pp.741-773.

Rieman, J. et al., 1995. Usability Evaluation with the Cognitive Walkthrough. In Conference on Human Factors in

Computing Systems, ACM, New York, pp. 387-388.

Turner, C.W. et al., 2006. Determining Usability Test Sample Size. In Karwowski W (Ed), International Encyclopedia of Ergonomics and Human Factors 2 ed, CRC Press, Boca Raton, Vol. 3, pp. 3084-3088.

Woolrych, A. and Cokton, G., 2001. Why and when five test users aren’t enough. In: Vanderdonckt J, Blandford A, Derycke A. (eds), Proceedings of IHM-HCI 2001 Conference. Toulouse, Cépadčus Éditions, France pp 105-108.

368

View publication stats View publication stats