Using Design Competitions in Crowdsourcing UI/UX Design: An Experimental Grounded Theory Study

(1)

Using Design Competitions in Crowdsourcing UI/UX

Design: An Experimental Grounded Theory Study

Micky Chen

University of Amsterdam Graduate School of Informatics

mickychen@yahoo.com

ABSTRACT

Crowdsourcing has gained great popularity over the past decade. Using the power of the crowd might be a useful source for next-generation software designing. However, little is known about whether the quality of one’s design can improve by using parts of other designs. To address this gap, I conducted a grounded theory study to analyze the role of crowdsourcing in creating user interface designs. An experimental study was conducted in the form of a contest to analyze how design competitions might be used in crowdsourcing user interface design. In the two-round contest, participants created a UI, which was subsequently distributed to the other participants, so each has access to all first round designs. The second round required a revised UI design, with participants being encouraged to borrow each other’s ideas in creating a revised design when they saw a fit. The results show that on average, participants’ designs improved and everyone used at least one idea from another person. This indicates that recombination and crowdsourcing leads to high(er) quality user interfaces. However, due to the limitations of the experiment setup, more research is needed into how to set up UX design contests so as to maximize the benefits of crowdsourcing.

General Terms

Design, Theory

Keywords

Design, Crowdsourcing, Grounded Theory, User Interface, User Experience, Experiment, Recombination

1. INTRODUCTION

With the rise of the Internet and the shift toward the Web 2.0, online collaboration started to emerge, overthrowing the traditional content creation model. As businesses and individuals started to realize the power and wisdom of the crowd, crowdsourcing gained great popularity. Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people and especially from the online community rather than from traditional employees or suppliers.1_{The race is on to build general} crowdsourcing platforms that can be used to quickly build crowdsourcing applications in many domains [3].

Four primary types of crowdsourcing can be distinguished: crowd

given that their options are diverse, independent, and properly aggregated [8]. Extrapolating this observation to the software design creation context, we hypothesize that a design is stronger when (re)combination takes place (i.e., designers look at other designs and incorporate ideas into their own design) and multiple people have input in what the UI looks like as opposed to when a single person creates the whole UI by themselves. These two approaches (multiple people versus individual) are significantly different because the first uses the power of the crowd. By using these two approaches and comparing the results, the effectiveness of crowdsourcing, iteration, and recombination can be tested.

The IT outsourcing market has reached a saturation level in terms of the benefit realization through outsourcing [12]. Using the power of the crowd seems the perfect source for next-generation software designing in order to optimally utilize resources and enhance time to value and productivity. Many platforms for crowd creation exist (e.g. TopCoder2_{, 99designs}3_{, Innocentive}4_),

however, little is known about how participating designers go about the process of creating a design. Furthermore, current platforms are typically contests where participants are given feedback on their own design, after which they revise it. However, these platforms do not allow designs to be made public during the process (with 99designs being the exception: although most contests are blind-contests meaning that only the contest holder can see all the entries, there are also contests that allow entries to be made public for other designers, though this is extremely rare) and, thus, designers do not look at one another’s design and use it as input for their own. Little is known about whether the quality of one’s design can improve by using parts of other designs. To address this gap, we conducted a grounded theory5_{study to}

analyze the role of crowdsourcing in creating user interface designs. The central research question is:

In what ways might design competitions be effectively used in crowdsourcing user interface design and what factors affect the quality of a design in such contests?

(2)

To answer this research question, a two-round design competition was held. In the first round, participants designed a UI, which they revised in the second round. The difference between these rounds is that, after the first round, all designs were distributed among the participants and they were encouraged to use each other’s ideas. The central research question is broken down into the following four segments:

• Design quality. What factors affect the quality of a design? What predicts a strong design?

• Crowdsourcing in contest form. Do people incorporate each other’s ideas or do they stick to their own design? • Adopting others’ ideas. What factors influence if people

copy each other’s ideas? What makes it difficult or easy to use ideas?

• Recombination. What is the effect of recombination? Does one’s own design improve by looking at others? Does a design become stronger through combination? This study may lead to new and improved tools for developers that suggest new ways in which design competitions can be used to produce software designs. The first part of the thesis describes the motivation for the study and discusses related work. This is followed by the methods and results of the study, after which the research questions are answered in the discussion section. The thesis concludes with limitations and future recommendations.

2. RELATED WORK

This section aims to give an overview of relevant literature on crowd creation, and goes deeper into the concept and applications of crowd creation. I review work closely related to crowdsourcing design.

2.1 Task-based Platforms

Platforms where tasks are crowdsourced include Threadless6_, Mechanical Turk7, 99designs8, StackOverflow9 and TopCoder10. On Threadless, consumers design T-shirts, which are then voted on by the crowd. The most popular designs are printed and sold. Websites such as Mechanical Turk are also used to crowdsource tasks, usually for research purposes. Users can perform small tasks for which they receive a monetary reward. According to its website, it is a web service that provides an on-demand, scalable, human workforce to complete jobs that humans can do better than computers, such as recognizing objects in photographs.11

99designs is a platform where designs are crowdsourced using a design contest model. The website functions as an online marketplace for graphic design services, with 230,000 contests and 240,000 registered designers. Swiftly, part of 99designs, is geared more towards micro tasks: people who already have a design, but need slight alteration or editing. Users enter a description of the work they are looking for, upload the relevant asset, and Swiftly matches the user with a designer who can get

6_{http://threadless.com/}

7_{https://www.mturk.com/mturk/welcome} 8_{https://www.99designs.com/}

9_{https://www.stackoverflow.com/}

the work done in the same day.12 The website DesignCrowd uses the same concept as 99designs, which is visualized in Figure 1.

Figure 1: Graphical representation of the crowdsourcing design

process used by 99designs and DesignCrowd

InnoCentive crowdsources innovation solutions by having their solvers propose ideas and solutions for business, social, scientific, and/or technical challenges. The winning solver receives a cash prize. StackOverflow13_{is a Q&A website where programmers can} post and answer questions (social learning). Using the crowd, answers are ranked based on the number of votes received. TopCoder is an online community of digital creators who develop and refine technology for customers through competitions and encourage creative software development. It is one of the early adopters of Open Innovation14_{, where innovators and creators} from around the world can select and solve the types of problems and challenges to which they would like to contribute [1]. Crowdsourcing contests are relatively disadvantaged with regard to more conventional means of procurement because the effort of losing contestants is wasted [2]. This is where my study comes into play: generally weak designs might contain strong parts, which can be incorporated in strong designs to make them even stronger. This way, designs are optimally utilized and high quality ideas are not wasted. To get better solutions from participants in such contests, [17] found that participants need to be motivated in the co-creation process. They found that intrinsic motivation (contest autonomy, variety, analyzability) mattered more than extrinsic motivation. Yet, to truly encourage participants, a balanced view of these two types of motivation is needed. Finally, they suggest crowdsourcing contest tasks to be autonomous, explicitly specified, less complex, and requiring a variety of skills.

2.2 Crowd Creativity

To analyze whether creativity increases through combination, an experiment on crowd creativity was conducted in the following way: participants (split up in three groups) had to design sketches of chairs for children using a sequential combination system: one crowd creates a first generation of designs, which is used by the second crowd [15]. The third crowd subsequently builds on the designs of the second crowd. Designs were evaluated by the crowd based on originality and practicality, and subsequently classified as either creative or not creative. Designs are labeled

12_{http://techcrunch.com/2013/08/13/99designs-launches-swiftly/} 13_{https://www.stackoverflow.com/}

14_{“Open innovation is the use of purposive inflows and outflows of} knowledge to accelerate internal innovation, and expand the markets for external use of innovation, respectively. [This paradigm] assumes that

(3)

creative if they are both practical and original. Results showed that crowd creativity indeed increases through combination: the number of creative designs in the third generation was greater than that in the first generation.

Another study crowdsourced design critique to test the viability of receiving critiques online [14]. After converting the 4-step critique process (describe the intent of the design, evaluate the strengths, evaluate the weaknesses, summarize the statements) into sub-tasks, workers at a crowdsourced platform were asked to view a design and to improve the initial description by either editing or adding text. Improvements were then voted on by other workers. Results showed that their approach allows designers to acquire quality critiques fast and accurately.

Based on these studies of design and crowd creativity, a number of tools have been created that aid in collaborative designing. An early example of multi-user design support is Commune, a shared drawing surface prototype, as a means of providing support particularly for distributed design groups [10]. TEAM STORM is an interaction model that allows a team of designers to work efficiently with multiple ideas in parallel [5]. More recent work includes IdeaVis [4]. IdeaVis is a digital pen and paper-based for augmenting the traditional paper-based workflows of sketching sessions, aimed to support co-located sketching sessions. SkWiki is a multimedia sketching system for collaborative creativity in digital multimedia project [16]. The model supports collaborative editing such branching, cloning, and merging paths from multiple contributors, and is particularly useful in multi-user collaborative settings where brainstorming and generating multiple design alternatives is key. They validated skWiki by conducting a user study where participants were asked to design children’s toys. Our study builds on this work – focusing specifically on how designs can be combined – by conducting an experiment in the form of a two round contest where in the first round participants create a design themselves and in the second round, peer designs are shared among participants and recombination is encouraged.

3. METHODS

3.1 Study Design

The experiment is conducted in the form of an online contest. Two contests were run in parallel: one for User Interface/User Experience, the other for Architecture and Design. The setup of the experiment and the entire process was identical as it was one project conducted by two persons. The results of the UI/UX and AD contests are split in two theses. This thesis focuses on the UI/UX contest.

Contestants were given the design challenge to create a user interface for a traffic signal simulator in two rounds. They were divided into two groups: experimental and control. See Appendix 1 for the full design prompt. After the first round of the design challenge, all designs were shared with the participants who are in the same group (either control or experimental). The experimental

Finally, subjects were interviewed to gain a deeper understanding into their thought process during the contest. After the interviews, all designs were graded independently by an expert panel. The best design in each round was awarded with one thousand dollars.

3.2 Participants

3.2.1 Recruitment

Participants were recruited via a number of channels. First, advertisements were posted on Reddit, Craigslist, and Facebook. This proved unsuccessful, after which personal networks were used. Professors from several universities (Carnegie Mellon University, UC Irvine, UC Berkeley, University of Washington, University of Southern California) were approached by email, who in turn contacted their students. Thirty people applied for the UCI User Experience Software Design Contest. Out of the 30 submissions, 20 people were chosen to participate in the contest. Criteria included years of experience, age, and how many programming languages and design tools they were familiar with. The requirement for signing up was having at least two years of industry experience.

3.2.2 Demographics

In total, eight contestants dropped out of the contest. Four people dropped out of the first round without submitting any design. Sixteen contestants submitted a design for the first round, of which four failed to meet the deadline for the second round submission.

The final sample consists of 12 participants: 7 females and 5 males. The age of the participants ranges from 23 to 32, with the average being 26.58, the median 26.5, and the mode 24. Participants had between 2 and 5 years of industry experience, with an average of 3.17 years. Ten out of twelve participants either completed a Master’s degree or are pursuing one. Fifty percent graduated from or currently attends Carnegie Mellon University. The majority of the participants studied Human Computer Interaction, followed by Computer Science/Software Engineering. Participants have professional working experience in technology consulting, web development, design, and software development.

Participants are familiar with the design tools Adobe Creative Suite, Photoshop, Illustrator, Sketch, Balsamiq, Eclipse, iMovie, Android Studio and Omnigraffle. They also have knowledge of a wide variety of programming languages, ranging from C, C# and C++ to .NET, Java, JavaScript, Python and CSS. See Appendix 2 for a detailed overview of all participants’ demographics.

3.3 Design Prompt

Participants were asked to design the user interface of a traffic signal simulator for students to use during a civil engineering course. They were giving the following broad requirements when designing the system; (1) Students must be able to create a visual map of an area, laying out roads in a pattern of their choosing; (2) Students must be able to describe the behavior of the traffic lights at each of the intersections; (3) Based on the map created, and the

(4)

Designs were graded based on three criteria: elegance/simplicity, clarity (easy to understand), and completeness (fulfills all requirements, all functionalities included).

3.4 Procedure

3.4.1 Process Description

The experiment is conducted in the form of an online contest. Participants were given the task to design a user interface for a traffic signal simulator. There were two design rounds in the contest. Contestants were divided into two groups: experimental and control. See Appendix 3 for the breakdown of the two groups. See Figure 2 for a visual representation of the contest.

It took roughly a month to recruit participants. A web site was created (http://softwaredesignuci.wordpress.com), together with two Google forms where people could sign up for the contest. Note that this website is for both the UI/UX and Architecture and Design contest. See Appendix 4 for the recruitment message. See Appendix 5 for the signup form.

The experiment took seven weeks in total. This includes sending out emails to selected participants, receiving consent forms, and notifying participants about the outcome of the contest. The actual contest took place over the course of three weeks.

In the first week of the experiment, participants signed a consent form where they agreed that their data is being used for research purposes. Contestants had three days to send back the signed form, after which the first round of the design challenge started. See Appendix 6 for the consent form.

Participants had one week to create an initial design for the traffic signal simulator. They were strictly required not to put their name and/or contact details on any of the pages. Designs were submitted via DropItToMe, an online link that automatically sends the files to a Dropbox folder. After all first round designs were handed in, designs were renamed to fruits (e.g. “John Smith” would be labeled as “Apple”). The designs for each condition were put in a Dropbox folder and shared with all participants in the group (so people in the experimental group could only see designs from other people in the experimental group, and correspondingly for the control group).

After submitting the first round designs, the control group was asked to only carefully review others’ designs, whereas the experimental group was asked to rank the designs per their judgment (excluding their own design). This was to analyze if having to rank the designs in the first round influences the way in which designers go about the challenge in the second round. They might spend more time looking at the other designs because they had to rank them and therefore borrowed more ideas, for example. The experimental group was given three days to perform the ranking. The control group did not have to submit anything, as they only needed to take a look at their peer’s designs.

After all rankings were received, the second round of the design challenge started. In the email with instructions, contestants were “strongly encouraged to closely examine the first round designs of the other participants, draw inspiration from them, borrow ideas from them to integrate into your design, copy parts, or even bring together ideas from multiple other designs to further your own design. ” They were again given a week for this.

The day after all second round designs were submitted, participants were asked to rank the second round designs. Both the experimental and control group was asked to rank these new designs per their judgment. The reason for having contestants make a ranking of the designs is to analyze if they are able to identify strong designs, and whether their ranking is in accordance with that of the expert panel. They were given three days time. In the final week of the experiment, subjects were interviewed for 30-45 minutes (semi-structured). All interviews took place via Skype. Contestants were asked about their own design as well as those of others, the recombination process (looking at peer designs and incorporating ideas into one’s own design), the design challenge, and the contest as a whole. These interviews served to understand the decisions participants made throughout the design process, including if and how one was inspired by others’ designs, and how this influenced the quality of their own design. See Appendix 7 for the interview protocol. The contest ended after all interviews were conducted.

After the interviews, all designs (both first and second round) were graded independently by four experts based on three criteria: elegance/simplicity, clarity, and completeness. The

(5)

highest-scoring design for each round was declared the winner.

Two prizes could be won: one for the best design in the first round, and one for the best design in the second round. The best design for each round was awarded with one thousand dollars. All participants received a one hundred dollars compensation for their participation in the experiment. Participants who dropped out after the first round received a pro-rated reward of forty-five dollars. See Appendix 8 for a timeline of the contest. See Appendix 9 for a flowchart of the email traffic. See Appendix 10 for all emails.

3.4.2 Data Collected

At the conclusion of the experiment, a large amount of data had been collected from the participants. Data included a total of 38 designs (16 first-round designs and 12 second-round designs due to four dropouts), 19 peer ratings (7 for the first round, 12 for the second round), and 12 interviews.

3.4.3 Data Analysis

Data was analyzed in a number of ways to produce results. An expert panel was used to grade the designs and card sorting (organizing topics from content into groups and labeling them15₎ was used to analyze the data from the interviews. Furthermore, a number of statistical analyses were performed to check for correlations between different variables.

3.4.3.1 Expert Panel

All designs (both first and second round) were graded independently by an expert panel of four, all affiliated with the University of California, Irvine and working in the Informatics department (the professor and chair of the Informatics department, one postdoctoral research associate, and two PhD candidates). The grading was double-blind. To eliminate any form of bias, graders did not see the designs beforehand. Furthermore, names were removed from the designs and replaced with fruits. Finally, graders did not know whether a submission was a first or second round design, or whether the participant was in the experimental or control group.

Designs were graded based on three criteria: elegance/simplicity, clarity, and completeness. A 1-7 Likert scale was used, with 1 being least elegant/simple, clear or complete, and 7 being most elegant/simple, clear or complete. This led to each design receiving 24 grades (3 criteria times 4 raters times 2 designs per participant). Finally, the average score per participant per round was calculated. See Appendix 11 for the expert panel grading and the final rankings for each round.

3.4.3.2 Interviews

For the UI/UX contest, twelve Skype interviews were conducted. All interviews were recorded and subsequently transcribed. Afterwards, transcriptions were read and important/relevant quotes were highlighted. See Appendix 12 for the interview transcriptions.

3.4.3.3 Card Sorting

together with the matching quote and the fruit name of the participant. Afterwards, all cards were put on four walls and grouped opportunistically by both persons. Then, all cards were labeled and subcategorized. Examples of categories include “Changes” (subcategories “simulation control”, “traffic light sequence”, “road creation”, etc.) and “Reasons for not incorporating” (subcategories “time constraints”, “no fit”, “both time and no fit”, “no flexibility”, and “software related”). Many cards from different interviews matched or were highly similar. See Figure 3 for photos of the cards on the walls.

Figure 3: Card sorting

3.4.3.4 Statistics on Quantitative Measures

Statistical analyses were performed using SPSS. The Pearson’s correlation analysis was used to investigate the relationship between each pair of quantitative variables, a paired samples t-test was used to look for differences in quantitative variables, and an independent samples t-test was run to look for an effect between two independent samples. The list of variables used in the analyses can be found in Appendix 13.

4. RESULTS

4.1 Designs

This section elaborates on the designs of both the first and second round, the design evaluations, and the design changes.

4.1.1 Design Process

The design process discusses the amount of time designers spent on both rounds, their perceived task difficulty, and their comments on why they thought the design challenge was difficult or not.

(6)

The control group spent 6.29 hours on average designing their first design, whereas the experimental one averaged 9.2 hours. The control group spent an average of 5.18 hours on their second design, while the experimental group spent 5.7 hours on average. See Appendix 14 for the full table with statistics.

4.1.1.2 Perceived Task Difficulty

On average, contestants rated the difficulty of the task 4.79 on a scale from 1 to 7. Ratings ranged from 3 (Grapes) to 7 (Cherry), the mode being 5. They rated the difficulty of the ranking task 3.83, with the lowest being 2 (Grapes and Kiwi) and the highest being 5 (Raspberry, Cherry, and Honeydew). See Appendix 15 for the full table with the statistics.

4.1.1.3 Difficulties in the Design Task

A number of difficulties were mentioned by participants when asked about the task difficulty. These can be grouped into two categories, (1) design prompt-related difficulties, and (2) non-design prompt related difficulties such as time pressure and use of design tools.

The main reasons for the design prompt-related difficulties include domain knowledge/background, taking users into account, and complexity. Honeydew mentioned unfamiliarity with the domain to be an issue: “I had trouble trying to think of similar things this could be based off of. It was something new in terms of designs I had thought about.” Strawberry, Honeydew, Cherry, and Blueberry all indicated unfamiliarity with the domain and not being completely sure about the expectations to be a problem. Blueberry: “I don’t know anything about this topic. I had to research a lot about the traffic rules.” Strawberry: “Just trying to understand the domain before I started designing, was challenging.” Cherry: “I had to search for the rules.”

Orange felt like the task was harder because of the level of detail, but not extremely difficult because “you are designing it for students or people who are in the tech-savvy community.” Five participants talked about the complexity of the task: some found the design challenge not too difficult because a single-user scenario makes it easier than a multi-user scenario (Grapefruit), whereas others found that there are many real world variables and different scenarios that had to be taken into account (Grapes, Raspberry, Lychee).

Most participants named complexity as to why they did or did not find the task difficult. Kiwi: “it was okay, we just had to go by what we were given (requirements).” Grapefruit also did not find the task too hard: “it wasn’t difficult because it was more a consumer-facing product. You don’t have to worry about scaling it.” Grapes thought that there were a lot of additional factors that they had to think about, and believes the task to be easier if they did not have to think about these. Raspberry’s argumentation supports this:

“When I was designing this, I was trying to imagine using this in a real life context. There are so many real world variables and it is challenging to incorporate all of these in the software.” Non-design prompt related difficulties include time pressure and use of design tools. Time pressure as difficulty was pointed out by Mango, Orange, Grapefruit, Pear, Strawberry and Raspberry. Most people believed they could have done better had they had

Orange and Honeydew mentioned technical issues to get in the way of their design (design tools shortcomings). Orange and Honeydew found their tools to impose difficulties. Honeydew mentioned their program to be laggy, which made it difficult, whereas Orange found themselves restricted by the tool Balsamiq (“There is no way to add certain colors in Balsamiq. That is kind of a shortcoming of the tool I used.”)

4.1.2 Structure and Characteristics of Designs

4.1.2.1 Length

The number of pages per design ranges from 1 (Orange) to 18 (Honeydew), with an average of 6.5. See Appendix 16 for an overview of the number of pages per design. The table is sorted by designers’ ranking in the respective rounds. In the second round, the average is 7.92 pages, the shortest design being 3 pages (Pear) and the longest one being 20 pages (Honeydew). Designs increased by 1.58 pages on average.

4.1.2.2 Typical Sections

Every UI design contains the following features: create a visual map (lay out roads), adjust traffic density, create light sequences, and simulate traffic flows. All designs contain multiple intersections, and there is a traffic light at every intersection.

4.1.2.3 Visual Styles

Aspects of a UI’s visual style include elegance, level of detail, to what extent the interface describes how it works and level of clutter. The first and second round winners, Grapefruit and Cherry, both scored a 6 on elegance/simplicity for the second round (graded by the expert panel). Grapefruit’s design is not overly cluttered: it is approximately 70% graphic and 30% verbal. Per screen, important icons are marked with a balloon and a one-sentence explanation. The simple interface makes it easy to understand: not too many items on one page, well readable, not too much text. The second round winner (Cherry) contains 8 pages. Every page contains at least one UI screen. Everything is clearly readable and contains a solid explanation. The design looks clean.

The first and second round bottom design (Pear) only contains 2 pages for the first round, hardly any visuals besides one small road mockup, and a whole lot of buttons. Their first design is entirely black and white, whereas the second has color. Pear scored a 2 for elegance/simplicity on both rounds, and a 1.5 and 1.8 on clarity for the first and second round, respectively. This leads to the assumption that strong designs are clearer than weak designs, though we cannot provide statistical evidence for this due to its highly subjective nature.

Designs differed greatly in terms of visual style. Some designs were overly cluttered, whereas others kept it simple by having one screen per page and a few sentences of explanation. Most designs put the interactions in a certain order, e.g. the first screen shows how to build a road map, the second how to add street lanes, the third how to adjust light behavior. Some designs had a popup window for creating/editing a light sequence, whereas others had this feature in a side toolbar. Designs also differed in the amount of text used: some had a whole explanation, others simply used a few words. Grapefruit mainly used grey, black and orange, which made the design look very clean. Most designs used a wide variety of colors.

(7)

4.1.3.1 Relationship Between Strong Designs and

Length of Designs

The average number of pages is 6.5 and 7.92 in the first and second round, respectively. Two designs in the first round top 3 had less than the average amount of pages. The bottom design had 2 pages in the first round and 3 in the second. This may lead to the assumption that stronger designs generally are shorter than average, and weaker designs ,too.

By looking at the statistics, there are no significant correlations between the page count and absolute rankings for both first and second round (first round: r = .543; p = .068, second round: r = -.527; p = .079). However, a strong positive correlation can be found between page count and grading given by the expert panel for the first round (r = .589; p = .044). The second round correlation is insignificant (r = .461; p = .132).

4.1.3.2 Relationship Between Strong Designs and

Designers’ Expertise

The age of the participants ranged between 23 and 32. Most participants were students. Participants had between 2-5 years of experience in software development. The winning design of the first round and first runner-up in the second round (Grapefruit) had 4 years of professional experience and was familiar with 6 programming languages and 5 design tools. The second round winner (Cherry) had the most years of professional experience (5). This design placed third in the first round. This leads to the belief that the best designers were among the most experienced ones in terms of professional experience and knowledge of languages and tools.

However, when running a correlation analysis in SPSS between years of industry experience and absolute rankings for both first and second round, no significant results could be found (first round: r = -.423; p = .171, second round: r = -.296; p = .351). Running a correlation analysis between years of industry experience and expert panel grades for the first and second round also yields insignificant results (first round: r = .323; p = .306, second round: r = .352; p = .262).

4.1.3.3 Relationship Between Strong Designs and

Time Spent Designing

The top two designs took less than average time to create. The average number of hours spent on the first design is 7.5. The winner and first runner-up spent 5 and 5.5 hours creating the design, respectively. Grapes (2nd_{place) spent an hour thinking} about requirements, assumptions and key questions, an hour and a half on sketches, and three hours on building the sketches out in Photoshop. The bottom design spent 6 hours on their design, arguing that he “jotted it down on paper, then made the word file” (Pear).

The average number of hours spent on the second design is 5.4. The winner and first runner-up spent 12-15 and 3 hours creating the design, respectively. Cherry spent between 12 and 15 hours on it including research:

First runner-up Grapefruit did not spend as much time on their design, arguing that it is more difficult to make changes to the second round due to the high fidelity of the designs. They indicated they did not want to change too much and start over, which led to them preferring to tweak their design over coming up with an entirely new one. The bottom design (Pear) spent 2-3 hours on their design. Pear found their own design to contain unique features that others did not cover. They argued that, though other designs contained a few good features, their own designs had some really good ones.

This indicates that strong designs do not necessarily take more time to create. By running a correlation analysis in SPSS, no significant relation can be found between the time spent creating the designs (both first and second round) and their grades received from the expert panel (first round: r = .107; p = .740, second round: r = .417; p = .178).

4.1.3.4 Relationship Between Strong Designs and

Degree of Influence by Other Designs

Second round winner Cherry made 6 changes that were influenced by other designs. Runner-up Grapefruit was influenced by Raspberry and Honeydew and made two changes that were inspired by others (visualization of light timings on maps, popup window light sequence and time), the reason being that they did not want to go back to an initial rough sketch since he already had a ‘final’ design after the first round. Grapefruit indicated that 90% of his changes were influenced by other designs. The bottom design, Pear, only took one idea from other participants (start and help buttons) and according to themselves, was 25% influenced by other designs, indicating that 75% of their changes came from themselves. Pear: “I had a few ideas in the first round but was short in time.” This leads to the belief that good designs were more influenced by other designs.

However, a correlational analysis run in SPSS between absolute ranking for the first and second round and the number of copied ideas yielded insignificant results (first round: r = -.178; p = .58, second round: r = -.427; p = .166).

See Appendix 17 for an overview of to what degree each design was influenced by others.

4.1.3.5 Relationship Between Strong Designs and

Perceived Task Difficulty

Participants were asked how difficult they believed the task to be on a scale from 1 to 7, with 1 meaning very easy and 7 meaning highly difficult. As mentioned earlier, the task was considered difficult for design prompt-related reasons and non-design prompt related reasons. The first round winner rated the difficulty of a task as 3.5, their reasoning being the following:

“It was more complicated than a simple design challenge because there were a lot of rules involved. But it wasn’t difficult because it was more a consumer-facing product. You didn’t have to consider the number of users; you only had to consider one user. So you don’t have to worry about scaling it.”

(8)

undergrad where they programmed a traffic monitoring controller. On average, participants rated the task a 4.79, indicating moderate difficulty.

Given the fact that the first round winner found the task moderately difficult and the second round winner found it extremely difficult, we can infer that designers of strong designs did not find the task easy. However, the bottom ranked designer did find the task more difficult than the average participant. By looking at the statistics we cannot infer a significant correlation between perceived task difficulty and the absolute final rankings for both first and second round (first round: r = .121; p = .707, second round: r = .037; p = .909).

4.2 Rankings

4.2.1 Accuracy of Peer Rankings

See Appendix 18 for an overview of the individual peer rankings. The matrix in Figure 4 shows the difference in ranking between participants and the expert panel. The rankings are a combination of all expert rankings and all participant rankings. A negative number means that participants ranked the design lower than experts, whereas a positive number indicates that participants ranked the design higher than experts. The number indicates the number of ranks (e.g. in round 1, the design that was ranked third by the experts, was placed four ranks lower by participants). Note that R1 and R2 stand for round 1 and 2, respectively, and UE1 indicates the control group whereas UE2 represents the experimental group. See Appendix 19 for a detailed participant-expert grading comparison.

There is a stronger positive correlation between rankings provided by the expert panel and rankings given by participants for the designs from the second round than the first round, but both correlations are non-significant (first round: r = .193; p = .756, second round: r = .505; p = .094).

There is a strong significant negative correlation between rankings provided by participants and grading from the expert panel after the second round (r = -.599; p = .039), which is non-significant for the first round (r = -.146; p = .815). These statistical results

show that participants were better able to identify good designs from the bad in the second round.

Partici -pants Experts #1 #2 #3 #4 #5 #6 #7 #8 R1-UE2 0 0 -4 0 0 0 -1 +5 R2-UE1 0 -3 0 -2 +3 +2 0 R2-UE2 -1 +1 -2 +1 +1

Figure 4: Ranking matrix participants and experts

4.2.2 Inspirational Designs and Their Rankings

Mango and Strawberry were most often mentioned as inspirational designs (three and two times, respectively). However, Mango placed 6th in the first round and 9th in the second round. Strawberry ranked 10th_{in the first round and 7}th_{in the} second. Thus, the highest ranked designs were not necessarily the most inspirational ones. See Appendix 20 for an overview of peers’ inspirational designs.

4.2.3 Time Spent on Ranking

1.05 hours were spent on average by the experimental group ranking the first round designs. Grapefruit and Raspberry both took 30 minutes, while Blueberry spent an hour and a half on it. In the second round, contestants took an average of 1.19 hours, the lowest being 30 minutes (Grapefruit) and the highest being 2.25 hours (Orange). On average, participants spent more time ranking in the second round. See Appendix 21 for an overview of the number of hours each designer spent on ranking.

4.2.4 Designers’ Ranking Criteria

The criteria used by participants when evaluating and ranking others’ designs can be broken down into the following eight (not necessarily distinct, some criteria have a slight overlap): completeness, clarity, elegance/simplicity, usability, visual design, effort, user-centered, and level of detail. Table 1 provides an overview of which participant used what criteria.

Participant

Criteria

Completeness Clarity Elegance/

Simplicity Usability Visual design Effort User-centered Level of detail Grapefruit X X X Cherry X X X X Strawberry X X X Mango X X X X Pear X X Orange X Lychee X X X Grapes X X X Kiwi X X X X Blueberry X X X X Raspberry X X X X Honeydew X X

(9)

Six participants used completeness as criteria. They mainly looked if the interactions were designed, whether the requirements were fulfilled, how much they addressed the problem statement, how many aspects the systems covered.

Many participants used clarity as a criteria. Honeydew looked at end-to-end flows and whether they could understand the design. Grapes looked at how easy it was to understand the interaction and if they understood what the designer was trying to communicate. Lychee, Raspberry, Blueberry and Strawberry simply mentioned clarity to be a factor.

Lychee mentioned elegance as one of their criteria. Honeydew looked if the design did not have a lot of clutter, and Mango, Grapefruit, Kiwi, and Strawberry focused most on simplicity. Usability was mentioned by Grapes (“Some aspects of others’ designs ranked lower in terms of usability”), Raspberry (“Is the design efficient?”), Blueberry (focused on whether main interactions were designed, and if they were effective), Mango (looked if it is easy to know where to press the button and what steps to take next). According to Kiwi, usability and controls are most important. Pear spent more time on designs that were user-friendlier.

Six participants mentioned visual design as a factor. Visual appeal and fidelity played a role in evaluating designs. Kiwi: “Some designs just looked more attractive than others.”

Cherry looked if the designer thought about the real use of the system: was the user taken into account? Kiwi: “I kind of used what I would use the most if I were the user.” Level of detail was also important to Cherry: “The second criteria was if the person listed out the wireframes and how to use them. Did they try to create a storyboard or were there only wireframes?”

4.2.4.1 Ranking Strategy

There are a number of ways participants went about evaluating and ranking their peers’ designs, the main ones being comparing and one-by-one. Raspberry first looked for the best and worst, then did a side-by-side comparison to determine the rank of the remaining designs. Cherry compared designs in pairs, as did Pear, whereas Grapes found it more helpful to look at them one at a time, rather than comparing them against each other. Kiwi compared all designs to their own design. Blueberry looked at each design to get a first impression, then looked at how each design went about certain interactions. Designs were then given a grade on a 1 to 10 scale. Strawberry looked at how much the designs addressed the problem statement, how easy it was to understand the software, and how much freedom it gave in terms of how objective the design was.

There were notable differences in how the experimental group ranked designs in the first round versus the second round. Some participants spent the same amount of time on ranking in both rounds, whereas others found the second round ranking to go much faster. Blueberry: “I have already seen their designs.”

4.2.4.2 Perceived Ranking Difficulty

On average, participants found ranking not too difficult (3.83 on a 1-7 scale). The lowest score given is 2 (mentioned three times) and the highest 5 (mentioned three times). This indicates that participants found it easier to rank designs than to create designs (task difficulty scored a 4.79 on average). See Appendix 22 for a table of each participant’s perceived difficulty of ranking.

4.2.4.3 Difficulties in Ranking

Participants encountered several difficulties when ranking others’ designs. The main difficulties include variety, quality, quantity, and expectations/standards. According to Raspberry, ranking was difficult due to the lack of consistency in the designs: “It’s hard to put designs side by side because there are so many different variables in terms of design style and clarity. There was no baseline.” Both Pear and Mango mentioned that some designs were similar, which forced them to look into the designs in more detail. Grapes found the ranking process easy because there were only a few. Strawberry also found it simple because they already had certain expectations since they had been working on the design themselves. Cherry had difficulties figuring out what criteria to focus on: either the visual design or the idea behind it.

4.2.5 Accuracy of Self Rankings: Relationship

Between Strong Designs and Self Rankings

9 out of 12 participants (everyone except for Strawberry, Mango, and Pear) believed they ranked in the top 3 designs for the first round. The average first round ranking according to oneself is 2.5. For the second round, this is 2.33, although 11 out of 12 participants (91.2%, everyone except for Mango) indicated their design would be in the top 3. See Appendix 23, 24, 25, and 26 for an overview of each participant’s ranking according to themselves, participants’ ranking according to peers, participants’ ranking according to experts, and participants’ absolute ranking according to experts.

The first round winner (Grapefruit) ranked themselves third, arguing that they did not spend much time on the design (“I know I didn’t spend a lot of time on it, so I feel like the people who did probably deserve a higher ranking.”) Most participants (9 out of 12) ranked themselves fairly high and believed their design to be in the top 3. Raspberry mentioned the following:

“I think I spent a lot of time on it. And I felt like some of the other ones had more features but I made the interactions easier to use. The main reason is that my design is easy to use.”

Lychee found his design to be complete, but felt like the interface was “not that great”. The first runner up (Grapes) ranked themselves either first or second, arguing that they were more detailed about the interactions and provided clearer explanations. The second runner-up (third rank, Cherry) ranked themselves first, however they mentioned that their only real threat was Grapefruit, who ended up first: “I can only compete with Grapefruit.” As Cherry argued:

(10)

covered the “understanding part and not the jazzy part of the system.” On average, participants expected their design to be ranked 2.5, indicating that most designers were confident about their own work.

The second round winner (Cherry) ranked themselves first. Most participants (9 out of 12) ranked themselves fairly high and believed their design to be in the top 3. Orange mentioned completeness to be the reason why their design should be placed either first or second (“I tried to have all the functionalities in there, not even missing one”). The first runner-up (Grapefruit) ranked themselves third, again for not spending much time on it. The bottom design (Pear) believed themselves to rank in the top 3 for adding colors and buttons to the system. On average, participants expected their design to be ranked 2.33, indicating that most designers were even more confident about their revised design than about their first round design.

Although most participants believed that their designs would end up high in the rankings, no significant correlations can be found between rankings of their own design and their final ranking in both first and second rounds (first round: r = .175; p = .586, second round: r = .0.039; p = .904). This indicates that participants’ own judgment is a poor indicator for their final ranking. See Appendix 27 for the correlation tables of all analyses performed.

4.2.6 Relationship Between Strong Designs and Time

Spent Ranking

Participants were asked how much time they spent on ranking the designs. Grapefruit indicated that they spent 30 minutes ranking the first round, and another 30 for the second round, the reason being that they looked at both rounds independently. “I tried not to think about their previous design and looking at their second design, I just tried to take it as a whole. I used the same criteria that I used in the first round.” Furthermore, Grapefruit did not spend much time on ranking because they found it to be fairly easy (2 on a 1-7 scale). They mentioned that “not all of them were that great”, and that it was relatively easy to separate some from the others and amongst those make a judgment call of which one was easier to understand. Cherry spent 2 hours on ranking, both for the first and second round. In the second round, Cherry looked at the changes made for each design. They found ranking to be moderately difficult (5), “because for the second round, you don’t know what to pay more attention to: the visual design, the idea itself, the whole system, or if they focused on sequence simulation only.” Cherry added one criteria to the second round, which is if the person changed anything and how much they changed. On average, participants in the experimental group spent 1.05 hours on ranking the designs for the first round, and both the control and experimental group together spent an average of 1.19 hours ranking the second round designs. Grapes said that ranking designs was easy because there were so few. Strawberry found ranking easier because they already had certain expectations of the design since they had worked on it themselves. According to Lychee, designs that were more graphic took less time. Pear pointed out that it was not too easy because they had to choose between two or three designs and properly compare them. The bottom designs (Pear, Kiwi, and Raspberry) spent 45, 60, and 45 minutes ranking the second round, respectively. Designers of

average time on ranking, the bottom three designs were below average.

No significant correlation can be found between time spent on ranking designs after the first round and their final ranking for the second round (r = -.498; p = .393). This indicates that spending more time on ranking in the first round did not necessarily lead to better designs in the second round. Furthermore, there are no significant correlations between final rankings and time spent on ranking designs for both first and second round, which means that designers of good designs did not necessarily spend more time on ranking designs.

4.2.7 Reactions to Seeing Other Designs

In general, people believe that a good design is encouraging: it provides an opportunity to improve. Grapes, Pear, Orange and Lychee felt encouraged seeing strong designs, for different reasons. Grapes found it interesting to see the way that different people represented the UI and claims that some designs made them more encouraged to make sure that theirs looked “polished and ready to go.” According to Pear, seeing a good design is encouraging: “It was good, I mean if they can try then I can also try.” Both Orange and Lychee felt encouraged by seeing how much effort some designers put in it.

Participants also found it encouraging seeing designs that they believed were of lower quality than theirs. It made them feel more confident about their own design. Blueberry felt that their design stood out because their interactions were superior to others.’ Reviewing peers’ design was considered by Kiwi and Cherry to be encouraging as it made them realize that they had to include certain features. Moreover, finding similarities between one’s own design and that of a peer was encouraging: comparison helped with their design (Kiwi).

Discouraging factors include bad designs and bad feedback. Cherry and Honeydew both expressed their disappointment about some designers not putting in much effort in the contest or not making any significant changes. Mango pointed out that they were not sure whether their preference for a certain design was ‘correct’: “Let’s say I like Blueberry’s design. Am I right to like that? Is it subjective that I like this design? I was not sure if liking a design was good enough reason to go with it.”

4.3 Design Changes

4.3.1 Design Improvement

The amount of improvement on a 1-7 Likert scale averaged 1.77 points. Out of 12 participants, 9 received a higher score for their revised design, 1 did not change, and 2 received a lower score. A paired sample t-test run in SPSS to analyze if designs received higher scores in the second round confirmed the assumption that on average, people improved (at the 5% significance level with p=.034).

However, running these analyses separately for the control and experimental group yields invalid results due to the low sample size (7 and 5), which results in a higher degree of freedom. Both groups have a p-value that is larger than .05 (.08 and .29 for the control and experimental group, respectively), meaning that we cannot draw inferences from these analyses. Furthermore, an independent samples t-test was run to analyze if ranking (UE2)

(11)

improvement. See Appendix 28 for an overview of each participant’s improvement.

4.3.2 Design Improvement Predictors

4.3.2.1 Relationship between Improvement and

Absolute First Round Ranking

Kiwi made the biggest improvement: their design grade increased by 7.5 points.16_{They went from 11}th_{place in the first round to 6}th in the second. Grapefruit went from 1st_{to 2}nd_{place with a zero} score improvement (16 points both for the first and second round). First round second runner-up Cherry, however, improved by 4 points (from 14.5 to 18.5)17_{and moved up one place, making} Cherry the winner of the second round. The first round runner-up, Grapes, received a lower score for their revised design (15.25 and 14).18_{This led to Grapes moving down three ranks, placing them} fifth in the second round.

High-ranked designs (top 2) did not change as much as the rest did. On average, participants improved by 1.77 points. The first round top 2 designs (Grapefruit and Grapes) improved by 0 and -1.25, respectively. The first round bottom five designs (Orange, Lychee, Strawberry, Kiwi, and Pear) improved by 3.15 on average. This leads to the belief that on average, lower-ranked designs might have improved more than strong designs.

However, by looking at the statistics we cannot infer a significant correlation between improvement and absolute final ranking of the first round (r = .529; p = .077). This indicates that designs that scored lower in the first round did not necessarily improve the most in their revised design.

4.3.2.2 Relationship between Improvement and Time

Spent on Revised Design

On average, participants spent 5.4 hours on the design revision, and improved by 1.77 points. The biggest improvement came from Strawberry, who spent 15 hours on their revised design and improved by 5 points. The second biggest improvement came from Cherry, who spent 13.5 hours on their revised design and improved by 4 points. The two contestants who received a lower score (Grapes and Raspberry) spent 2 and 4 hours on revision, while Grapefruit, whose score remained constant spent 3 hours on their second round design. This indicates that people who improved more, spent more time. Statistical evidence shows that there is a significant strong positive correlation between hours spent in the second round on creating the revised design and improvement (r = .606; p = .037).

4.3.2.3 Relationship between Improvement and

Number of Pages Added in Second Round

The number of pages per design in the first round ranged between 1 (Orange) and 18 (Honeydew), with an average of 6.33. In the second round, the average was 7.92 pages, the shortest design being 3 pages (Pear) and the longest one being 20 pages

16_{Kiwi scored 2.25 and 4.25 on completeness, 1.75 and 4.5 on}

(Honeydew). Designs increased by 1.58 pages on average. See Appendix 29 for an overview of the page increase per design.

There is no significant correlation between the increase in the number of pages submitted for the second round design compared to the first round and improvement in the grade received by experts (r = .026; p = .935).

4.3.3 The Borrowing Process

All participants claimed to have carefully reviewed each design in their batch. Furthermore, everyone borrowed at least one idea from another design. Designs did not drastically change: the majority of the participants incorporated a number of ideas from their peers without completely redesigning their own user interface. They looked for design features that had a fit with their UI.

Grapes described their borrowing process as follows:

“I looked through all of them individually again and then went through them one by one. I made a list of things that I liked from each of them. But I kept my own design in mind. If the things I liked would also fit in my design, I chose to incorporate it.” Strawberry looked at each design and analyzed them by their pros and cons and general approach. They specifically looked which of the designs could be incorporated into their own design. Furthermore, they looked at the fidelity of other designs and what fidelity they wished to maintain in terms of presentation.

After seeing the other designs, Cherry realized that they had not described their design in very much detail. They mention that most designers explained how to create a map, whereas Cherry merely said that users can create a map, without designing the process. This encouraged Cherry to add a ‘how to create a map’ feature in their revised design. Honeydew went through every design and went through the instructions once again. They saw that people had things that they did not have, such as naming streets or saving and re-using light sequences, which inspired them to add it as well.

The following sections elaborate on the ideas borrowed. To determine who borrowed what from whom, two persons independently reviewed all designs. Everyone’s first and second round design was compared and all changes were marked. Secondly, each second round design was compared against all other first round designs in the same batch to determine the sources of the changes. A matrix was made that shows which ideas were borrowed by whom and from whom (according to the reviewers). The two reviewers’ matrices were then combined into one. When conducting the interviews, participants were asked which ideas they copied from whom. The ideas in our matrix were either confirmed or rejected by the participants. With this information, a new matrix was created that shows which ideas were copied from whom (as confirmed by participants). See Appendix 30 for the reviewers’ initial matrices and the final

(12)

4.3.3.1 Most Popular Idea Borrowed and Most

Popular Designer

Most ideas were adopted only once, one idea was adopted twice, and none was adopted three times or more. The idea that was incorporated twice is saving/re-using light sequences, introduced by Grapefruit and copied by Blueberry and Honeydew.

Figure 5 show Grapefruit’s first round light sequence popup window, of which the ‘save’ functionality was copied by Blueberry, whose first and second round light sequence is depicted in Figure 6 and Figure 7.

Figure 5: Light sequence by Grapefruit (round 1): users can

create a new sequence or choose an existing one.

Figure 6: Light sequence by Blueberry (round 1)

Figure 7: Light sequence by Blueberry (round 2): users can

choose available schemes that are either provided by the system or created by the user earlier.

The most popular designers are Mango in the control group and

most ideas adopted from: seven changes influenced by Grapefruit were made by three people, as opposed to five changes influenced by Mango. This means that the design that got most features borrowed from is not necessarily the design that most people adopted from.

4.3.3.2 Types of Small Changes Made

The criteria used to determine whether a change is a small or a big one is how big the impact is on the UI and the corresponding UI. For example, if only the visual style is changed (e.g. a square button is made round) and it does not (significantly) affect the UI in any way, the change is classified as small. A small change is either one where nothing new is added to the UI (e.g., change in color) or one with a minor impact.

Examples of small changes include changing check boxes to toggle buttons (Cherry copied this from Raspberry) and changing text to icons for the traffic lights (the words “yellow”, “red”, and “green” were replaced by icons in these colors). Cherry (experimental) made this change and was influenced by Grapefruit, Blueberry, and Honeydew (see Figure 8 and Figure 9 for a visual representation of the change), whereas Orange (control) was inspired by Mango, Grapes, and Kiwi.

Cherry also copied Grapefruit’s functionality where an intersection is automatically created when two roads intersect. Honeydew added Grapefruit’s warning sign on the map for conflicts. Another small change is adding “start” and “help” buttons, which was done by Pear. Influences came from Strawberry, Mango, and Kiwi.

Figure 8: Cherry - words in round 1

Figure 9: Cherry - icons in round 2

4.3.3.3 Types of Big Changes Made

Big changes include adding features that enrich the UI, such as choosing different kinds of roads (e.g. curved roads) instead of just one default road.

Cherry added pre-designed templates and the option to create new templates after seeing this in Blueberry’s design. Grapefruit

(13)

use (duplicate) an existing light sequence, and naming and saving light sequences, which can be seen in Figures 10, 11, and 12.

Figure 10: Grapefruit’s feature to use existing light sequences,

round 1

Figure 11: Honeydew, round 1

Figure 12: Honeydew’s feature to use existing light sequences,

round 2

Raspberry added the ability to trace a path and create a road with one’s mouse, which was inspired by Cranberry.

Honeydew’s popup window for lights scheme, sequence, and time was copied by both Grapefruit and Blueberry. Orange added ‘help’ and ‘instructions’ after seeing this in Strawberry, Kiwi and Mango’s designs, and Strawberry incorporated Grapes’ notifications panel. Error handling, as seen in both Grapes’ and Kiwi’s design, was adapted by Strawberry. Kiwi copied Lychee’s function to randomize traffic density. Mango used Grapes’ vector