Feedback on Code Quality in Introductory Programming Courses

(1)

M A R T I J N S T E G E M A N

F E E D B A C K O N

C O D E Q U A L I T Y I N

I N T R O D U C T O R Y

P R O G R A M M I N G

C O U R S E S

U N I V E R S I T Y O F A M S T E R D A M

(2)

(3)

T H I S T H E S I S W A S W R I T T E N I N T H E C O N T E X T O F G R A D U AT I N G F R O M T H E M A S T E R O F S C I E N C E P R O G R A M I N S O F T W A R E E N G I N E E R I N G AT T H E U N I V E R S I T Y O F A M S T E R D A M . V E R S I O N S O F T H E T H E S I S F O R O T H E R A U D I -E N C -E S , A N D F O L L O W - U P W O R K , W I L L B -E P R -E S -E N T -E D O N W W W . S T G M . N L . A B I G T H A N K Y O U G O E S T O E R I K B A R E N D S E N , S J A A K S M E T S E R S , J U R G E N V I N J U A N D PA U L K L I N T F O R G U I D I N G T H E P L A N N I N G A N D E X E C U -T I O N O F -T H I S R E S E A R C H P R O J E C -T. I W O U L D A L S O L I K E -T O -T H A N K H A N S F O R P U S H I N G H A R D T O R E A C H T H I S G O A L , A N D M Y F R I E N D S , F A M I LY A N D E S P E C I A L LY J A N N E K E F O R B E A R I N G W I T H T H E P R O C E S S A N D T H U S B E -I N G C ATA LY S T S T O T H E S U C C E S S O F T H E P R O J E C T.

(4)

(5)

Preliminaries

7 Introduction

9 Feedback and rubrics

13 Research questions

19 Method

21 Results: analysis of professional literature

25 Results: instructor interviews

31 Results: rubric construction

35 Results: evaluation

41 Conclusions

45 Discussion

47 Bibliography

51

(6)

(7)

Preliminaries

Abstract

This thesis presents research into developing a feedback instru-ment for code quality in introductory programming courses. To do this, we analyzed professional standards of code quality embed-ded in three popular software engineering books and found 401 suggestions that we categorized into twenty topics. We recorded three instructors who performed a think-aloud judgment of three student-submitted programs, and we interviewed them on the twenty topics from the books. We recorded a focus group discus-sion between the same instructors on a new set of programs. Using this data we constructed a rubric for giving feedback on code qual-ity. The statements from the instructor interviews allowed us to inductively generate a set of criteria relevant to their practice of giving feedback. We then used both instructor statements and book suggestions to create descriptions of four levels of achievement as-sociated with each criterion. One criterion was split in two during this process, resulting in a total of 9 criteria of code quality. After performing a limited reliability test and diary study, we concluded that our subjects could not reliably use the rubric to assess student submissions.

Terminology

As much as possible, I have opted to use terminology that seems ancient to some, but for me such terms seem more generic and less focused on specific technologies and programming paradigms. So I use the word routine to mean anything that you would call a procedure, function or method. The word module is used to cover the concepts of source file, as well as class.

(8)

(9)

Introduction

Most introductory courses in computer science are programming-focused: students learn about computing concepts for the first time by learning a programming language and actually building soft-ware. Most instructors of introductory programming courses have adopted some form of code quality as a learning goal, for exam-ple choosing an appropriate type of loop, or having a consistent coding style (ACM/IEEE-CS Joint Task Force on Computing Curric-ula, 2013). Such goals can lay the groundwork for more advanced treatment of code quality in software engineering-related courses. Besides being a direct learning goal, instructors tend to link a lack of code quality to problems with understanding the underlying computing concepts.

To monitor progress towards such learning goals, instructors of-ten grade these aspects as a part of the programming assignments in their courses, and several instructors have previously published their grading schemes, defining criteria for judging student pro-grams. Some of these criteria represent an aspect of code quality, while other criteria are more specific to the instructor and the as-signment: “conformance to specifications” is an example.

We will now discuss some grading schemes that have been pub-lished by instructors in the past decade, in order understand in what ways they assess student programs and how they came to their criteria.

Previously published grading schemes

Instructors from Jacksonville University documented their process to come to a number of criteria in an informal consensus-based process (Hamm et al., 1983). In their scheme, execution, design and documentation are covered, but style is notably absent. They re-ported that having consensus about the criteria helped reducing the time needed to grade programs. Later, one instructor from the St. Cloud State University recognized that instructors had previously not thoroughly documented the reasoning behind their grading scheme (Howatt, 1994). This is why he explicitly provided the ratio-nale for each of his criteria. Notably, several were split up to make expectations clear to students. Design was separated from style, as students neglected to make an up-front design. Specification conformance was separated from execution, as students stopped

(10)

implementing the remaining requirements as soon as they got part of the program working correctly.

In the past decade, instructors have created and published more detailed grading schemes, although these are still provided with-out justification for the contents. Becker’s example (Figure 1) is described in great detail: there are many criteria, and for each of those, three levels of accomplishment have been defined. There is particular attention to specific design choices, such as the use of magic numbers and global variables. Smith and Cordova (2005) provided a similar scheme, again without justification (Figure 2).

Presentation. Information, files, submission, demo.

Documentation. Overall impression, identifier names, indentation, external documentation, logical blocks, class diagrams, flowchart. Design. Specifications, style,

effi-ciency, subdivisions, flow, con-stants, magic numbers, globals, initialization, clean-up, structure. Input/output/interface. Intro, finish,

output, input.

Testing/error detection. Choice of test data, annotation, endpoints, detection, correction, limitations, debugging aids.

Figure 1: Summary of the rubric from Becker (2003, appendix)

Program correctness. Correct output, output quality, specifications, testing, efficiency.

Programming style. Coding style, program design, decomposition, modularity, parameter usage. Program documentation. Neatness,

clarity, general documentation, module-level documentation. Design documentation. Neatness/

clarity, completeness, agreement with code.

Figure 2: Summary of the primary trait scheme from Smith and Cordova (2005)

Several grading schemes have been published in the context of creating automated assessment tools. These tools mostly use quantitative measures of code. Jackson and Usher (1997) created a system that runs many automated tests and provides the instructor with a report. Example criteria here are correctness and complexity. As with other automated solutions, the metrics used are quite ab-stract, and optimizing for specific metrics could ruin the design of the program. That is why in this system, instructors always provide qualitative feedback that is based on the metrics. Another solution is to reduce the assignments to small exercises and base the feed-back on a set of commonly occurring errors (Truong et al., 2004). Although this can provide students with essential understanding of many of the errors they make, this limits the feedback that can be provided.

Code quality

As described above, instructors vary in their expectations of code quality. This mirrors the practice of software engineering, where there is also no single, agreed-upon definition of this concept, even though several models of software quality have been proposed over the years. Just like the grading schemes, some of these models contain qualitative goals, such as readability or maintainability, and other models contain quantitative metrics, such as the amount of code or bugs per module (Kitchenham and Pfleeger, 1996). Here, we will define code quality as the part of software quality that can be determined by looking at the source code, i.e. without any form of testing or checking. Note that this precludes concepts like efficiency, portability, and conformance to a specification.

Research goal

We observe that instructors differ in their approaches of assessing code quality, but provide no rationale for their choice of criteria. Although such a grading scheme is often personal to the instructors and focuses on criteria that they see as important learning goals, it is notable that no single grading scheme above covers all of the criteria that are used by these instructors. We would therefore like to create a rubric that formally covers our previous definition of

(11)

f e e d b ac k o n c o d e q ua l i t y i n i n t ro du c t o r y p ro g r a m m i n g c o u r s e s 11

code quality. Our research goal is thus to construct a rubric to give appropriate feedback on code quality of programming assignments in introductory programming courses. In the next chapter, we will study how feedback works for learning and how rubrics can be used as a feedback instrument. We will then translate our goal into three research questions to be answered.

(12)

(13)

Feedback and rubrics

In order to develop our research questions, we will explore how feedback can be a part of the learning process, how rubrics have been developed as a grading tool, and how rubrics can be changed to focus on giving feedback.

Feedback for learning

Feedback given to students in a course can be used to promote learning. Ramaprasad (1983) defined feedback in management theory as “information about the gap between the actual level and the reference level of a system parameter which is used to alter the gap in some way.” Based on this definition, Sadler (1989) proposed a model of feedback for learning. He emphasized that presenting a summary grade as feedback does not satisfy learning needs. In order to learn, a student needs to know three things:

• what good performance on a task is;

• how their own performance relates to good performance; • and what to do to close the “gap” between those.

This means that during learning, a student needs to build a conceptual model of quality that is similar to that of the teacher; needs to acquire some skill of assessing quality; and needs to find ways to close the gap. Learner, peer learner and teacher can all in-fluence this process; when combining these actors with the three learning needs, we can discern five strategies of formative assess-ment (Wiliam, 2011). These strategies show how class participants can together use feedback to improve learning (Figure 3).

1. Clarifying and sharing learning intentions and criteria for success; 2. Engineering effective classroom

discussions, questions, and learning tasks;

3_{. Providing feedback that moves} learners forward;

4_{. Activating students as instructional} resources for one another; and 5. Activating students as the owners

of their own learning.

Figure 3: Classroom strategies when using feedback for learning, from Wiliam (2011).

Rubrics as a grading tool

A rubric is a tool that helps the assessment of student work by defining a set of criteria, a number of levels of accomplishment, and optionally verbal descriptors that explain the various levels (Sadler, 1985). Such a rubric can be used to calculate grades, to provide feedback, or both (Jonsson and Svingby, 2007).

Historically, rubrics have been used to guide the process of the holistic assessment of student-written English texts. The idea is that

(14)

readers of a text can gain a quick impression of its quality and immediately rank or score it (Cooper, 1977). The first of such tools were developed in the context of American state-wide and national testing of writing skill, requiring the assessment of large numbers of students. At the time of the introduction of rubrics, the assessors had been struggling with cost, but also reliability:

Spending no more than two minutes on each paper, raters, guided by some of the holistic scoring guides I will describe here, can achieve a scoring reliability as high as .90 for individual writers. The scores provide a reliable rank-ordering of writers, an ordering which can then be used to make decisions about placement, special instruction, graduation, or grading. For researchers and for state and national assessors, the possibilities in holistic evaluation are a reminder that they need not settle for frequency counts of word or sentence ele-ments or for machine-scorable objective tests. (Cooper, 1977)

Paul Diederich is credited as the first creator of a rubric. In a book chapter on its development, Diederich describes the College Entrance Examination that historically had contained only essay writing questions. From 1926 on, the exam had gradually moved to having only objective tests in it. Diederich describes the continuing demand to reintroduce an essay component, although readers of essays more often than not disagreed in their assessment. These conflicting requirements led to a desire to understand which are the different qualities in writing that graders respond to. By hav-ing 300 papers sorted by each of 60 readers on the basis of general merit, the researchers were able to perform a factor analysis, and elicit specific criteria for writing (Diederich, 1965). One of the key challenges is creating useful rubrics without the costly data gath-ering that is associated with this process. However, the need for establishing criteria was clear:

Here, evidently, were some of the reasons why expert College Board readers had so long failed to agree. Like the distinguished readers assembled for this study, they were responding to different qualities in the papers, or they differed in the weights they attached to these qualities. One possible conclusion might be that papers in important tests of writing ability should be rated by five different readers, each of whom was especially sensitive to one of these factors. Since this was hardly feasible, it was comforting to find no solid evidence that any reader was entirely blind to any of these qualities. (p. 84)

So, if grading is to be “objective” or at least consistent between readers, the most important thing to do is to agree to a set of com-mon criteria and their respective weights; this results in what is called an analytic scale (Cooper, 1977). As an example, Diederich’s rubric (pictured in Table 1 on the next page) states eight different criteria and the possible points to award. The rubric allows the grader to calculate the final score, where the weight of the first two criteria is doubled.

(15)

Low Middle High

General merit Ideas 2 4 6 8 10 Organization 2 4 6 8 10 Wording 1 2 3 4 5 Flavor 1 2 3 4 5 ____ Mechanics Usage 1 2 3 4 5 Punctuation 1 2 3 4 5 Spelling 1 2 3 4 5 Handwriting 1 2 3 4 5 ____ Total ____

Table 1: Diederich’s rubric from Cooper (1977).

In this case, the rubric is complemented by verbal descriptors of low, middle and high performance on each criterion. These help graders to get to know the scale and the intended use (Cooper,

1977). Students do not have access to these descriptors. An example

for the spelling criterion:

High. Since this rating scale is most often used for test papers writ-ten in class, when there is insufficient time to use the dictionary, speling standards should be more lenient than for papers written at home. The high paper usualy has not more than five mis-spellings, and these occur in words that are hard to spell. The spelling is consistent: words are not spelled correctly in one sen-tence and misspelled in another, unless the misspelling appears to be a slip of the pen. If a poor paper has no misspellings, it gets a 5 in spelling.

Middle. There are several spelling errors in hard words and a few violations of basic spelling rules, but no more than one finds in the average paper.

Low. There are so many spelling errors that they interfere with comprehension.

Rubrics as a learning tool

While rubrics were developed to perform a summative assessment by quickly ranking or scoring, thereby providing no feedback to students other than their grade or rank, they can be augmented to function as a tool for formative assessment, where they are used to provide feedback to students who can use it to improve their performance. Andrade (2005) even argues for the concept of an in-structional rubric. It is defined as a rubric that is primarily designed as a teaching tool instead of a scoring tool:

A rubric that is cocreated with students; handed out; used to facili-tate peer assessment, self-assessment, and teacher feedback; and only then used to assign grades is an instructional rubric. It is not just about evaluation anymore; it is about teaching. Teaching with rubrics is where it gets good. (Andrade, 2005)

(16)

Many recent, instructional rubrics look like the fragment in Table 2. These rubrics prominently feature the verbal descriptors, which become the focal point of the assessment process.

Criterion Unacceptable Meets Requirements Exemplary

spelling There are so many spelling

errors that they interfere with comprehension.

There are several spelling errors in hard words and a few violations of basic spelling rules, but no more than one finds in the average paper.

Not more than five misspellings, occur in words that are hard to spell. The spelling is consistent, unless the misspelling appears to be a slip of the pen.

Table 2: On criterion in an instruc-tional rubric.

In the case of Diederich’s work, the rubric and descriptions were only meant to be used by graders. The key idea of the instruc-tional rubric is that both teacher and student can use the rubric to understand what is important for good performance on the task at hand (Jonsson and Svingby, 2007). This implies adding self-assessment or peer assessment to the list of educational activi-ties where the rubric is used. Such use clearly resonates with the formative assessment strategies from Wiliam (2011) that we previ-ously described, although we should acknowledge that introducing rubrics may not directly cause the desired learning effect, as there are many mediating and moderating factors. Still, many positive results have been reported following the introduction of rubrics in the classroom (Panadero and Jonsson, 2013).

Rubrics can be either general purpose or task-specific. General purpose rubrics can be used across a range of assignments or even courses, while task-specific rubrics are fully aligned to the partic-ular requirements of an assignment. A general purpose rubric can be used as a long-term learning tool, assuming that a better un-derstanding will develop as the rubric is used repeatedly by the students (Timmerman et al., 2011).

Deliberately choosing the number of achievement levels in a rubric is especially relevant to instructional rubrics. Sadler (1989) argues that many levels have to be defined. This should help student motivation by allowing them to see the results of their progress, compared to simply getting a pass/fail result. A more practical matter is that the number of criteria should preferably be even, as graders tend to bias to a middle level when the rubric is used by multiple graders (Walvoord and Anderson, 2011).

Rubrics can also be augmented or replaced by exemplars; these are partial products that represent a particular level of accomplish-ment. These are less explicit than descriptions of criteria, and stu-dents have to deconstruct them to understand what is expected. Interestingly, rubrics without exemplars improve learning more than when using solely exemplars or both (Lipnevich et al., 2014).

(17)

Rubric evaluation

The reliability of rubrics is often tested by checking the consistency between graders (Moskal and Leydens, 2000). Some researchers emphasize that while reliability is important, validity should also be considered, in particular the suitability of the rubric for the group of students that is being scored (Reddy and Andrade, 2010).

(18)

(19)

Research questions

We seek to give appropriate feedback on code quality for program-ming assignments in introductory programprogram-ming courses. In the previous chapter, we have seen that rubrics are appropriate for communicating complex expectations and that they can be used to help students form their own conceptual model of quality, when used appropriately in the classroom. Thus, we will develop a rubric that can be used to assess code quality in introductory program-ming assignments. Quantitative criteria such as software metrics merely represent numerically the aspects of this model, so the crite-ria in the rubric should be qualitative.

As it would be infeasible to gather a representative sample of instructors and perform a statistical factor analysis such as the one that Diederich (1965) used, we will establish a model of code qual-ity based on an inventory of professional criteria and standards. Our first question is thus:

1. What criteria and standards of code quality are important to professional software engineers?

Such a professional standard does not necessarily match the learn-ing goals for introductory programmlearn-ing courses, even when we constrain the standard to relevant topics. Therefore, we will estab-lish what kind of quality standards instructors use and in what way they give feedback to help students learn:

2. What criteria and standards of code quality are used by instruc-tors for programming assignments, and what kind of feedback do they give to students regarding this?

Finally, we will combine the insights we gain from answering the above questions to construct a rubric for code quality. It should be useful for giving appropriate feedback, i.e. span a spectrum of code quality that is relevant to introductory courses:

3. How can we translate the results from 1 and 2 into a reliable rubric that is useful for giving appropriate feedback to students on code quality in introductory programming courses?

(20)

(21)

Method

Our students need to build their own working model of code qual-ity, which we can accommodate by giving feedback. We will now describe the methods used to answer our research questions.

Analysis of professional literature

• Pragmatic programmer (*) • Code complete (*) • Design patterns

• C programming language • Refactoring

• JavaScript: the good parts • Clean code (*) • Mythical man-month • SICP • Effective Java • Introduction to algorithms • Programming pearls • Coders at work • Javascript: the def guide • Learning python

Figure 4: Top 15 programming books, as fetched on February 24, 2014 from https://www.goodreads.com/shelf/ show/programming.

To build a baseline of criteria and standards for code quality, we chose to sample professional standards and constrain those to top-ics that are relevant to introductory programming courses. Specif-ically, we studied popular books from the field that are concerned with code quality. Our selection of books came from the Goodreads website, which publishes a list of popular books that have been as-sociated with a “programming” keyword by its users. The website is not programming-centric, which helped us get a selection that is representative for a diverse group of people, not specifically be-ginners or experts. To keep the amount of books small, we filtered out any book that is focused on teaching a specific programming language, as well as any book that is mainly focused on the pro-cess of programming. This left us with books that are concerned with code. From this selection, we removed books mainly concerned with higher-level constructs such as design patterns or advanced

algorithms. The selected titles are starred (*) in the list in Figure 4. _{• fundamentals (variables,} assign-ment, etc.)

• logical operators

• selection statement (if/else) • definite loops (for) • indefinite loops (while) • arrays

• function/method parameters • function/method return values • recursion

• object-oriented basics (class defini-tion, method calls)

Figure 5: Topics in introductory programming courses, from Tew and Guzdial (2010).

To generate a preliminary model of code quality from these books, we performed a qualitative analysis of the contents. We selected relevant suggestions on code quality by using two previ-ously published models. One comes from Tew and Guzdial (2010), who documented which topics are common to introductory pro-gramming courses by studying a cross-section of textbooks. This provided us with a greatest common denominator to work with (Figure 5), allowing us to select only suggestions that are relevant to introductory courses. Additionally, Boehm et al. (1976) created a model that defines software quality in purely qualitative charac-teristics. One of these is understandability, itself comprised of five sub-characteristics. These five are precisely the properties in the model that can be determined from the source code (Figure 6), al-lowing us to select suggestions pertaining to code quality as we previously defined it.

The analysis consisted of three steps. First, we labeled all sug-gestions using the understandability characteristics that we just

(22)

described1

. As these characteristics are on an abstract level that is 1

The full definitions are included in an appendix.

not commonly used by instructors to give feedback, we then in-ductively generated criteria within each characteristic by grouping common topics together. Finally, we categorized the new criteria into a new set of code quality themes in line with the previously published grading schemes we cited earlier. We separated some of the concepts that are independent of the code structure: comments, formatting, layout, and naming. These are common to previously published qualitative grading schemes. This leaves us with all sug-gestions that concern the use of the programming language itself to express the intent of the programmer. We separated the suggestions on higher-level structure (decomposition, modularization) from the suggestions on expressiveness of small structures.

• consistency • self-descriptiveness • structuredness • conciseness • legibility

Figure 6: Understandability character-istics for software, from Boehm et al. (1976).

Instructor interviews

To find what quality criteria and standards are used by instructors of introductory programming courses, and what kind of feedback they give regarding these, we interviewed three instructors. Our interview sample was based on convenience: we selected three instructors that were immediately available (one of which was a supervisor to this thesis). They all teach at different schools in the Netherlands, and all teach different kinds of introductory courses: • Instructor 1 teaches an introductory programming course for

science students in general, which is given in Java.

• Instructor 2 teaches a course for computer science freshmen, which is also given in Java.

• Instructor 3 teaches a course for applied computer science fresh-men, which is given in C.

We held two rounds of interviews: the first consisted of sessions with each of the individual instructors, and the second was a focus group with all instructors together. The semi-structured interview protocol that we used for both rounds is detailed in Figure 7.

Introductions

• Notice the audio recording of the interview.

• We will be talking about code qual-ity of programming assignments. • We will not be talking about

correctness of the implementation. Think-aloud judgment

• Let me first present you with three solutions to the same programming assignment.

• The assignment is from the third week/last week of a first program-ming course.

• What do you think of these solu-tions: what is important to you, what could have been better, and what strikes you particularly? Existing criteria

• I will now briefly name some criteria from programming books. • For each, I would like to know

if you use that criterion to judge your students, and if you can name examples.

Figure 7: Protocol for interview rounds 1_{and 2.}

The first round, consisting of separate interviews, aimed to elicit criteria that instructors individually use. It was based on a pro-gramming assignment in C from the third week of an introductory course. We started with the recording of a think-aloud protocol while the participant gave feedback about three different solutions to a programming assignment. We chose a think-aloud judgement because we intended to record the criteria that instructors actually use, instead of recording what they claim to use. After the think-aloud judgment, the instructors were presented with each of the criteria we found in the professional programming books. We did this to make sure our generated view of code quality would become as complete as possible and not limited by the assignment used.

The second round, being a focus group, aimed to generate a broader understanding of what levels of achievement can be

(23)

dis-f e e d b ac k o n c o d e q ua l i t y i n i n t ro du c t o r y p ro g r a m m i n g c o u r s e s 23

tinguished for the criteria. We also wanted to discuss more ad-vanced criteria by basing this round on the final assignment in a Java course that treats object-orientation about as far as described

by Tew and Guzdial (2010)2

. Here too, we started with a think- 2

Student code from rounds one and two is reproduced in the appendix. aloud judgement and followed up with a discussion of the criteria

from the professional programming books.

We processed the interviews by inductively coding all statements about observable code quality that were expressed. To perform the coding, we printed all statements from each interview on index cards, grouped the remarks that use more or less the same wording to describe quality properties, and then summarized each group by assigning a description such as “comments should not repeat the code” or “choosing the right variable scope”. The groups were then clustered as criteria, and the criteria were taken together to form themes. As these criteria would be the main source for generating the rubric, we kept in mind that we should form enough criteria to facilitate giving detailed feedback, but not let the amount grow so much that assessment would take unreasonably long.

Rubric construction

As described previously, we can construct a rubric by a defining a set of criteria, formulating a number of levels of accomplishment, and writing verbal descriptors that explain the various levels.

We initially based our selection of criteria on the topics that we generated in the analysis of the instructor interviews. That way, our system of criteria could become instruction-centric and not profession-centric. To be able to use the book suggestions for the contents of the rubric, we labeled the suggestions using the derived criteria.

For the number of levels, we have previously seen that it should preferably be even, but provide more than pass or fail information, so we constructed the rubric using four levels. To create a simple progression, we chose the following definitions:

1. problematic features are present

2. core quality goals not yet achieved 3. core quality goals achieved

4. achievement beyond core quality goals

To create an instructional rubric that can facilitate learning, we also wrote verbal descriptors that can be used to understand dif-ferences between levels of achievement. Although the criteria were defined by only using statements made by the instructors, we did not want to constrain the contents of the rubric to the view of qual-ity currently held by the three instructors we happened to choose. Therefore, we used both their statements and the suggestions from

(24)

the books to compose verbal descriptors for each of the achieve-ment levels of the rubric. For each criterion, we then analyzed the statements and suggestions and placed them on appropriate levels of accomplishment. We identified explanations of how to apply constructs to achieve higher quality code as core goals, examples of what it means to not achieve such goals were marked as prob-lematic features, and examples of going beyond the core goals were marked as maximum achievements. We allowed for splitting crite-ria where the descriptions indicated incompatible learning goals.

Evaluation

Any rubric should go through a phase of testing and refinement (Walvoord and Anderson, 2011). We bootstrapped this iterative process by performing a single evaluation, where we defined two goals: checking the usefulness for giving code quality feedback, and checking the inter-rater reliability.

To check reliability, two experienced teaching assistants used the rubric to grade five assignments each. The assistants have both worked in multiple introductory programming courses, where feedback is given using a small rubric that has four briefly specified criteria3

. We calculated inter-scorer reliability using Cohen’s kappa 3

These criteria are: scope, correctness, design and style.

to identify weakly described criteria.

We consider the rubric useful as far as it accommodates any and all feedback that graders want to provide to students. During the evaluation, the assistants kept a diary of all observations they made about the rubric and the scoring process. We asked them to focus on ambiguities, questions and criteria they could not assess. We used the diary to explain problems in reliability, and to evaluate the usefulness of the rubric.

(25)

Results: analysis of professional literature

Here we describe the results of our analysis of professional liter-ature. We used three popular programming books to generate a broad view of code quality that should be relevant to introductory programming courses.

Data reduction

conciseness 44 consistency 38 legibility 249 self-descriptiveness 13 structuredness 57

Table 3: Frequency count of sugges-tions per characteristic.

Selecting suggestions from the books by using introductory top-ics (Tew and Guzdial, 2010) and characteristtop-ics of understandabil-ity (Boehm et al., 1976) resulted in 401 individual items. The result of labeling all suggestions using the understandability characteris-tics is presented in Table 3.

Inductively grouping the suggestions within each characteristic resulted in the topics described in Table 4, and categorizing the topics under a new set of themes is described in Table 5 on the next page.

The 21 suggestions from The Pragmatic Programmer (Thomas and Hunt, 1999) that we selected, were all either connected to the “Don’t repeat yourself” principle, suggestions about good docu-mentation or structure. The 167 suggestions from Clean Code (Mar-tin, 2008) and 213 from Code Complete (McConnell, 2004) both span most of the criteria that emerged from the inductive group-ing; the only exception is that Martin (2008) has no suggestions specifically about the appropriate use of idiom.

Below, we describe the contents of the books based on the themes from Table 5.

conciseness 44 dead code 6 duplication 4 fragmentation 2 minimalist comments 32 consistency 38 appropriate idiom 28 formatting consistency 3 naming consistency 7 legibility 249 affinity 3 clarification 30

clear control flow 27

comment content 6 expressive formatting 18 formatting 21 module size 1 naming 108 order 11 routine size 3 type signature 21 self-descriptiveness 13 comment content 13 structuredness 57 abstraction 19 focus 38

Table 4: Frequency count of sugges-tions for each of the content-derived topics within the characteristics of understandability.

About comments

Suggestions on comments can be separated into two themes: argu-ments on having as little commenting as possible, and suggestions about appropriate content. A theme that is present in all books is to only comment if strictly needed:

Why am I so down on comments? Because they lie. Not always, and not intentionally, but too often. The older a comment is, and the farther away it is from the code it describes, the more likely it is to be just plain wrong. The reason is simple. Programmers can’t realistically maintain them. (Martin, 2008)

(26)

The books use different perspectives for minimizing comments. Code should preferably be self-documenting by use of good names and a clear structure (Martin, 2008; McConnell, 2004). Comments can be redundant as they repeat information that can be derived from the code (Martin, 2008; McConnell, 2004; Thomas and Hunt, 1999). There is also information that is often listed in header com-ments but can be usually found in, for example, a source control system: authorship, revision history, the name of the current file, etc. (Martin, 2008; Thomas and Hunt, 1999). The repeated argu-ment for minimizing comargu-ments is that comargu-ments can get obsolete quickly (Martin, 2008; McConnell, 2004):

comments 51 comment content 19 minimalist comments 32 expressiveness 85 appropriate idiom 28 clarification 30

clear control flow 27

formatting 42 expressive formatting 18 formatting 21 formatting consistency 3 layout 20 affinity 3 dead code 6 order 11 naming 115 naming 108 naming consistency 7 structure 88 abstraction 19 duplication 4 focus 38 module size 1 routine size 3 type signature 21 fragmentation 2

Table 5: Frequency count of sugges-tions for each of the content-derived topics in the new classification. The older a comment is, and the farther away it is from the code it

describes, the more likely it is to be just plain wrong. The reason is simple. Programmers can’t realistically maintain them. (Martin, 2008) However, there are still categories of information that are com-plementary to the code, so there are many suggestions for includ-ing appropriate information in comments. McConnell summarizes:

The three kinds of comments that are acceptable for completed code are information that can’t be expressed in code, intent comments, and summary comments. (McConnell, 2004)

In contrast to this, Martin (2008) suggests that even summary comments should hardly be needed, as routines and classes should be of minimal length. However, McConnell (2004) recognizes the value of such summaries, allowing readers to quickly scan the code. Also, header comments should describe at least how to use routines and classes (McConnell, 2004; Thomas and Hunt, 1999).

All books describe small decisions and problems that can be highlighted by using comments, such as exceptions in control flow or significant data type declarations.

Finally, Martin (2008) wants comments to be precise and spelled correctly.

About visual formatting

The first major theme in formatting is consistency. This theme is present in Martin (2008) and McConnell (2004):

You should take care that your code is nicely formatted. You should choose a set of simple rules that govern the format of your code, and then you should consistently apply those rules. If you are working on a team, then the team should agree to a single set of formatting rules and all members should comply. (Martin, 2008)

The second theme is the need for the formatting to mimic the structure of the program:

The Fundamental Theorem of Formatting says that good visual layout shows the logical structure of a program. (McConnell, 2004)

(27)

Other suggestions in Martin (2008) and McConnell (2004) link specific types of formatting to that goal: related statements should be grouped and separated by a blank line; indentation should con-sistently follow the scope of the statements; multi-line statements should be split at a point that makes clear that they are unfinished; white space and parentheses should be used to emphasize the ex-pected evaluation order of expressions; and brackets should be used to emphasize flow control where it is not self-evident.

Nearly all code is read left to right and top to bottom. Each line rep-resents an expression or a clause, and each group of lines reprep-resents a complete thought. Those thoughts should be separated from each other with blank lines. (Martin, 2008)

Finally, line length should normally not be more than 80–100 characters, but long lines should be the exception, and not the rule (Martin, 2008; McConnell, 2004).

About the layout of code in files

For the layout of code in files, we found three themes: the idea of putting related parts close together, the order in which to put parts, and the presence of old code.

Putting related parts together in the code is called “affinity” by Martin (2008). It requires, for example, that related routines are placed close together in a source file. McConnell (2004) adds that each class should be in a separate file.

Ordering of code in a file can be optimized for readability (Mar-tin, 2008), for example by putting the most used routines at the top and related routines directly below. An alternative is to order rou-tines alphabetically (McConnell, 2004). Consistency in a project also plays a role here; for example by always putting variable declara-tions at the top (Martin, 2008).

If one function calls another, they should be vertically close, and the caller should be above the callee, if at all possible. This gives the program a natural flow. If the convention is followed reliably, readers will be able to trust that function definitions will follow shortly after their use. [. . . ] This makes it easy to find the called functions and greatly enhances the readability of the whole module.

Old code, such as commented-out code or routines that are never called, should always be removed according to Martin (2008). Mc-Connell (2004) adds that variables can also remain unused.

About the names of things

The books emphasize three themes for naming: expressiveness, readability and consistency. Martin (2008) summarizes the first goal as follows:

Choosing names that reveal intent can make it much easier to under-stand and change code.

(28)

All books support this idea and provide ways to achieve expres-sive names. Names should cover the complete abstraction (Mar-tin, 2008; McConnell, 2004), be concise (Mar(Mar-tin, 2008) and distinc-tive (Martin, 2008; McConnell, 2004). For distincdistinc-tiveness, many

examples are given: the use of very generic words such asflag

for names is pointed out as potentially problematic (McConnell,

2004). The books also suggest what to avoid here: intentional

mis-spellings (McConnell, 2004; Martin, 2008) or multiple similar names with number postfixes (McConnell, 2004). In contrast to expressive-ness, names can become meaningless:

Variable names, of course, should be well chosen and meaningful. foo, for instance, is meaningless, as isdoitormanagerorstuff. Hungarian notation (where you encode the variable’s type infor-mation in the name itself) is utterly inappropriate in object-oriented systems. Remember that you (and others after you) will be reading the code many hundreds of times, but only writing it a few times.

Take the time to spell outconnectionPoolinstead ofcp. (Thomas and

Hunt, 1999)

Some names can be plainly misleading, for example when abus-ing generic pre- and postfixes:

The word “list” means something specific to programmers. If the container holding the accounts is not actually a List, it may lead to false conclusions. (Martin, 2008)

There are also suggestions for the readability of names: for ex-ample, having clear word boundaries in names by using under-scores or camel casing, using positive boolean names, and using easy to pronounce names (McConnell, 2004; Martin, 2008).

Consistency in names is mostly related to vocabulary (Martin,

2008; McConnell, 2004). The books suggest that modules have a

noun name or noun phrase name in line with platform conven-tions. There also also names that are common in programming, for

exampledone,error,foundandsuccess(McConnell, 2004).

About the structure

For routines, we found that size should mostly be limited (Mc-Connell, 2004; Martin, 2008). This is reflected in many suggestions that propose to constrain the focus of each routine. Martin (2008) advises to create extremely small routines that are just two, three or four lines long, while McConnell (2004) speaks of individual routines that could be allowed to grow to a size of 100–200 lines. However, this is deemed an exception. In general, functions should do one thing, as emphasized by McConnell (2004):

Functional cohesion is the strongest and best kind of cohesion, occur-ring when a routine performs one and only one operation.

If the routines still have multiple tasks, McConnell (2004) argues to separate these into parts as much as possible. This includes variables (McConnell, 2004; Martin, 2008).

(29)

As a general rule, the variables you initialize before the loop are the variables you’ll manipulate in the housekeeping part of the loop. (McConnell, 2004)

Martin (2008); Thomas and Hunt (1999) argue to limit the amount of variables that are shared between routines. This can be con-trasted to having not too many parameters, which is also argued by Martin (2008). McConnell (2004), while acknowledging the problems programmers have with parameters, focuses on putting parameters in a natural or consistent order and giving intention-revealing names.

Finally, removing repetition in routines is an explicit goal in all books. Just as with comments, Thomas and Hunt (1999) argues against such duplication:

Avoid similar functions: often you’ll come across a set of functions that all look similar — maybe they share common code at the start and end, but each has a different central algorithm. Duplicate code is a symptom of structural problems. (Thomas and Hunt, 1999)

For modules, we find that they should have well-defined sub-jects (McConnell, 2004; Martin, 2008). Martin (2008) cites the “single responsibility principle” to support this. McConnell (2004) calls it “presenting a consistent level of abstraction.” All books argue for defining modules such that communication between them is limited.

At the highest level of design we find a trade-off between keep-ing modules small and preventkeep-ing fragmentation of the system as a whole. Martin (2008) argues that fragmentation is of lower priority, however. The same trade-off happens between preventing fragmen-tation of individual modules, thus having many routines, and the preference to keep routines small.

About the expressiveness of the code

Concerning expressiveness, we found three larger themes: having a simple control flow, using appropriate idiom, and having simple expressions.

Control flow should be kept simple by avoiding nested struc-tures (McConnell, 2004; Martin, 2008) and keeping strucstruc-tures like loops short (McConnell, 2004; Martin, 2008). Furthermore, Mc-Connell (2004) advises to feature the nominal path to the code most prominently, for example by always keeping the expected case in

a selection statement in theifclause and not in theelseclause.

Also, do not use too manyreturnandbreakstatements to jump

out of the normal flow (McConnell, 2004). However, McConnell (2004) says that sometimes, it is actually the right thing to do:

In certain routines, once you know the answer, you want to return it to the calling routine immediately. If the routine is defined in such a way that it doesn’t require any further cleanup once it detects an error, not returning immediately means that you have to write more code. (McConnell, 2004)

(30)

Finally, both McConnell (2004) and Martin (2008) suggest not to do more than one thing per line, especially if it is considered a side-effect.

For the use of language idiom, only McConnell (2004) gives many examples of choosing the right structure, and of using struc-tures in a misleading way. For example, when choosing a loop,

prefer aforloop when it’s appropriate, such as when looping

over a known range or a certain number of times. Otherwise, use a whileloop. Also use awhileloop any time you need to jump out of the middle of the loop (McConnell, 2004). Custom use of control structures can be very misleading to other programmers:

It’s bad form to use the value of the loop index after the loop. The terminal value of the loop index varies from language to language and implementation to implementation. The value is different when the loop terminates normally and when it terminates abnormally. Even if you happen to know what the final value is without stopping to think about it, the next person to read the code will probably have to think about it. It’s better form and more self-documenting if you assign the final value to a variable at the appropriate point inside the loop. (McConnell, 2004)

For expressions, we found three themes: keeping them simple, using the right data types, and naming all constants. In general, for keeping expressions simple, the books suggest performing logical simplifications, using intermediary variables, and making implicit or explicit comparisons (Martin, 2008; McConnell, 2004).

Instead of merely testing a boolean expression, you can assign the expression to a variable that makes the implication of the test unmis-takable. (McConnell, 2004)

Only McConnell (2004) provides suggestions for choosing the right data type, and focuses on often-overlooked enums and struc-tures. All three books comment on the use of unnamed constants, or “magic numbers.”

This is probably one of the oldest rules in software development. I remember reading it in the late sixties in introductory COBOL, FORTRAN, and PL/1 manuals. In general it is a bad idea to have raw numbers in your code. You should hide them behind well-named constants. (Martin, 2008)

(31)

Results: instructor interviews

Below, we describe the results of two rounds of interviews with instructors. We combined the remarks they made during the think-aloud judgment of student submissions with the remarks that were prompted by describing criteria derived from the professional lit-erature. This lead to a first set of criteria that span code quality in introductory courses.

Deriving the criteria

We selected 178 statements about observable code quality from the instructor interviews. These cover the think-aloud judgements as well as the remarks prompted by describing the topics from profes-sional literature. The inductive analysis of the statements formed twelve criteria and a further grouping generated four themes (Table

6). Each of the criteria that emerged were supported by a variety of

statements from interviews.

Documentation Presentation Algorithms Structure

names layout flow decomposition

comments formatting expressions modularization

Table 6: Criteria derived from the instructor interviews.

The amount of coded statements is provided in Table 7. Of the indi-vidual criteria, relatively few statements were on layout: this aspect was mentioned in only 5 statements by instructors 2 and 3. Names and modularization were represented by 11 statements each, while all other criteria were represented in 21 or more statements.

instructor 1 2 3 total Documentation 13 21 21 48 Presentation 4 16 9 29 Algorithms 26 14 19 59 Structure 20 12 10 42 63 63 52

Table 7: Amount of statements from instructors that were coded.

In terms of differences between the first and second round of inter-views, we find that especially instructor 1 made many more state-ments on decomposition in round two. There were also a few more

(32)

statements on modularization. In lieu, there were less statements on flow, formatting and comments.

Looking at the differences between prompted and spontaneous statements, we find that modularization is only represented when prompted. Naming generated more statements when prompted than spontaneously, while all other criteria were represented more often in spontaneous statements than in prompted statements.

We will now discuss the statements that the instructors provided for each of the eight criteria.

Names

The instructors gave feedback on the appropriate names of routines, modules or variables. Instructor 2 said: “The name of a class should precisely indicate what that thing has or does.” There were some negative examples: one instructor found that a longer name did not actually describe what the routine did, and two instructors saw problems with short unexplained variable names.

One could say “it’s a small program and it does not really matter that you violate the rules,” but even then I think it would be wise to teach them to pick good names. So now that I think about it, I would probably deduct a point for that4

. (Instructor 1) 4

All quotes in this chapter have been translated from Dutch by the author. Instructors 1 and 2 also commented on names that they found

hard to read:

I would prefer that variables, if they contain multiple words, use camel casing, so I can easily read them. (Instructor 1)

Comments

All instructors stated that comments should add meaning to the code: instructors 1 and 2 made this explicit, and instructors 2 and

3gave feedback on redundant comments that repeat what is in the

code.

The comments should be at a higher level of abstraction. You should describe the intent behind what you are doing, not provide a textual version of what follows. (Instructor 1)

Two instructors expressed problems with having too many com-ments: instructor 1 and 3 stated that comments in code should often not be necessary, provided that the code is simple. On the other hand, instructors 1 and 2 noted that routines and modules should normally have header comments. Instructors 2 and 3 also gave feedback on some occasions where comments were missing.

Layout

Instructors 2 and 3 both remarked on the presence of old code and the ordering of code in a file. Instructor 2 said that old code would

(33)

not necessarily result in points deducted, but that it would be com-mented on. Instructor 2 stated that students having old code hap-pens very often. Instructor 3 noted that one solution had instance variables at the bottom of a module, and instructor 2 concluded that this ordering should be consistently applied.

Formatting

All instructors mentioned some form of formatting. The methods that were discussed: grouping with blank lines (2 and 3), extra brackets (1), indentation (1, 2 and 3), controlling line length (2) and spacing (2 and 3). Two goals for formatting were stated by the instructors: making code structure explicit (all), and emphasizing similarities and differences (2):

I think that is very nice, very symmetrical. I like symmetric code; if you do similar things they should be similarly formatted. (Instructor 2)

All instructors noted that formatting choices should be con-sistently applied. Instructor 3 mentioned two cases were he said formatting was misleading.

Flow

The instructors all listed reasons why they thought the flow control could be too complex: deep nesting (1 and 3), choice of control structures (all), performing more than one task per line (1 and 3), having exceptions in the flow (1 and 2), having large blocks of code in a conditional (all), and library use (all). Instructor 3 gave feedback on the misleading use of idiom. Instructor 1 said about the choice of control structure:

I think it’s really neat that he uses enhancedforloops.

Expressions

All instructors commented on the redundancy of some expressions for loops and conditionals, where the expression partially retests a previous condition. Instructors 1 and 2 also noted various tests that were completely duplicate.

This is not DRY, as you have the same test: one in theifstatement, and the other in thewhile. (Instructor 2)

All instructors also commented on using hardcoded literals in expressions (magic numbers). They all said that this should be avoided and replaced by named constants.

(34)

Decomposition

The instructors all said that routines should be limited in length and perform a limited amount of tasks.

Inverting lists and printing them, we don’t do that. That is when we need to have a separate function for inverting and a separate function for printing. (Instructor 3)

Instructors 1 and 2 also commented that if a routine performs multiple tasks, they should be clearly separated within the routine. The same instructors gave feedback on having loop counters as class-wide variables.

That list of instance variables containsiandindex. That is a big problem. It really violates every rule in the book. (Instructor 2)

Modularization

All comments about modularization were made during the discus-sion of suggestions from the profesdiscus-sional literature, and not during the think-aloud judgment. Instructors 1 and 2 emphasized sepa-ration of concerns, where instructor 1 stated that “you rarely have too many classes,” and instructor 2 said that classes should not be longer that a page of code. They also put a limit on the separation; both stated that classes should not be artificially separated.

(35)

Results: rubric construction

In this chapter, we describe the results of combining the statements made by instructors and the suggestions in the professional litera-ture to create the levels and descriptors of our rubric. The resulting rubric is included at the end of this chapter.

Revisiting the criteria

We based our set of eight criteria for the rubric on the analysis of the instructor interviews. To start creating levels and write de-scriptors, we first labeled all book suggestions using these criteria. Doing so lead us to split one the comments criterion into two, as the book suggestions reinforced the idea from instructors that com-ment headers and comcom-ments in the code have differing goals. As one instructor said:

If I’m writing a program for solving a quadratic equation then I should put that at the top5

. 5

This quote was translated from Dutch by the author.

This mandatory aspect of explaining what larger parts of the code do, is also present in the books. This in contrast to comments in the code, where an important theme between instructors and books is that they should only be included when strictly needed. This is why we separated headers from comments as a criterion, resulting in a total of 9 criteria, as summarized in Table 8.

Documentation Presentation Algorithms Structure

names layout flow decomposition

headers formatting expressions modularization

comments

Table 8: Criteria derived from the instructor interviews.

How to read this chapter

Below, we list the sources we used for the verbal descriptors in the rubric. As we have previously described, the rubric we created contains four levels of achievement for each criterion. In the de-scriptions, g signifies a core goal of the criterion. Such a goal was usually supported by at least the instructor statements, and landed in level 2 and 3 of the rubric. p signifies problematic symptoms in code that are related to the criterion; sometimes this is a violation of

(36)

simple heuristics (e.g., old code still present), sometimes it indicates non-effort (e.g., meaningless names). These became level 1 of the rubric. s signifies a stretch goal that involves experimentation and more advanced trade-offs. The stretch goals together formed level 4 of the rubric.

Names

Instructors and books emphasize that names should describe the intent g of the underlying code. Both contrast this to meaningless p names; for example, one- or two-letter variable names that have no apparent connection to the intended meaning. The books add

mis-leading p names. Furthermore, the books list ways in which names

can be more meaningful in the context of a program: complete, distinctive (supported by one instructor) and concise.

The books mention that names can be more or less readable. One instructor stated a way to achieve this: having clearly separated parts by way of camel-casing. Examples in the books concern other easy to fix problems, such as needless abbreviation or using

hard-to-distinguish characters. We therefore added unreadable p names

as a problem to the rubric.

In the books, there are many examples of having a consistent

vo-cabulary s . As this requires experimentation and some experience

to get right, we marked this as a stretch goal.

Headers

Although the books in general argue that comments should hardly ever be needed, this is partly based on having a decomposition containing only very small routines; as this is a separate goal, we take the position that header comments should usually be present, although the stretch goal still requires only essential information.

Instructors state that routines and modules should normally

have header comments. These usually summarize g the goal of each

part of the program and explain parameters, i.e. how to use g the part. Using correct spelling can be a part of this, as supported by one instructor and one book. Instructors and books both emphasize

that redundant p comments are a problem. One book and

instruc-tors noted the goal of spelling g correctly. The books add that de-scriptions can be obsolete p , and instructors noted that headers can be completely missing p .

The books list several types of comments that are redundant be-cause the information is available outside of the code. Providing only essential s information can thus be seen as a stretch goal. To allow students to progress from missing headers to only essential information, we added incomplete and wordy descriptions as inter-mediary stages.

(37)

Comments

The books explain that good comments in the code provide elab-oration of decisions g and potential problems p . As with header comments, these can be redundant, obsolete p , missing p or use mixed languages p .

Instructors and books both state that comments in the code should usually not be necessary, provided that the code is sim-ple. We defined where strictly needed s as a stretch goal. Between missing comments and only commenting where strictly needed, we added wordy and concise descriptions as intermediary stages.

Layout

The books suggest that optimizing layout for readability is a goal g , and specify two aspects of this: grouping code and ordering it. The latter is also supported by two instructors. Together, we call these

“arrangement” in the rubric. Doing this consistently s between files

in emphasized by one instructor and by the books.

Two instructors offered that old code was regularly present, but they did not say this was a big problem. Because the books argue that having old code p is easily avoided, we list this as a problem.

Formatting

In the books as well as in the interviews we found that many syn-tactic features can be used to make code easier to read. Using in-dentation, blank lines, spacing and brackets to consistently highlight

g

the intended structure of the code are named as most important

factors. We defined it as a problem when these are missing p or

plainly misleading p , as indicated by the instructors.

A more complex goal is highlighting similarities and differences s between lines of code by formatting those in consistent fashion. One instructor in particular valued this use and the books give several examples, so we include this as a stretch goal.

One instructor commented that line length p prevented the code

to fit on the screen in one case. As the books support this by sug-gesting that line length be limited, and it is easily controlled, we add it as a problematic feature.

Flow

Instructors and books define simplifying g and limiting exceptions g

as goals for the the control flow. Choosing appropriate g control structures and libraries is named by all instructors and in one book.

Instructors have problems with deep nesting p and performing

more than one task per line p . They and the books also mention

(38)

Prominently featuring the expected or nominal s path was named by all instructors and as an important topic in one book. Because this requires refactoring we add this as a stretch goal.

Expressions

The books define a goal of having simple g expressions. Because

it is prominently featured, we put it into the rubric. We also adopt

choosing appropriate g data types as a goal from the books. The

in-structors in particular named duplication p of (partial) expressions

and the use of unnamed constants p as problems. Having only

essen-tial s expressions, thus not covering a subset of another expression, can be seen as a complex goal, as it requires a good grasp of data types and edge cases.

Decomposition

Instructors and books would like to see a decomposition that re-sults in most routines having a limited set of tasks g , and duplication

g

that is eliminated. Within routines, instructors and books would like different tasks to be separated into parts g . Because the in-structors generally placed a strict limit on tasks, we considered it a problem to put most code in one or a few routines p .

A related goal named by both instructors and books is having

a limited amount shared variables g , i.e. reducing scope. Reusing

variables for multiple purposes is a related violation p that was reported by instructors and books.

The associated stretch goal is to have routines perform very lim-ited p sets of tasks while at the same time limiting the number of shared variables and parameters.

Modularization

Based on the statements from one instructor, and supported by the books, we included having a clearly defined subject g as a goal for modularization. Elaborating on this, having a limited amount of routines was requested by one instructor and all books, and having a limited amount of variables only by the books. The books in general go beyond this by requiring a more advanced separation

of concerns that minimizes communication s between modules.

We included this as a stretch goal. On the other hand, instructors noted that students sometimes perform artificial separation, which can be seen as a problem p ; this is also supported by the books, in arguing to limit the amount of modules.

Feedback on Code Quality in Introductory Programming Courses

M A R T I J N S T E G E M A N

F E E D B A C K O N

C O D E Q U A L I T Y I N

I N T R O D U C T O R Y

P R O G R A M M I N G

C O U R S E S

U N I V E R S I T Y O F A M S T E R D A M

Contents

Preliminaries

7

Introduction

9

Feedback and rubrics

13

Research questions

19

Method

21

Results: analysis of professional literature

25

Results: instructor interviews

31

Results: rubric construction

35

Results: evaluation

41

Conclusions

45

Discussion

47

Bibliography

51

Preliminaries

Abstract

Terminology

Introduction

Previously published grading schemes

Code quality

Research goal

Feedback and rubrics

Feedback for learning

Rubrics as a grading tool

Rubrics as a learning tool

Rubric evaluation

Research questions

Method

Analysis of professional literature

Instructor interviews

Rubric construction

Evaluation

Results: analysis of professional literature

Data reduction

About comments

About visual formatting

About the layout of code in files

About the names of things

About the structure

About the expressiveness of the code

Results: instructor interviews

Deriving the criteria

Names

Comments

Layout

Formatting

Flow

Expressions

Decomposition

Modularization

Results: rubric construction

Revisiting the criteria

How to read this chapter

Names

Headers

Comments

Layout

Formatting

Flow

Expressions

Decomposition