Code quality: generating good feedback on function decomposition

(1)

Code quality:

generating good feedback on

function decomposition

Stijn de Vries

August 11, 2017

Supervisor(s): Martijn Stegeman, Jelle van Assema

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

feedback to all students. For this reason it is useful to create tools that give automated feedback. In this project a tool to automatically generate feedback is created. This tool focuses on function decomposition in programming assignments of beginning programmers. It was then evaluated by comparing the detections and output of the tool against the feed-back given by teaching assistants (TAs). The tool found around half the problems that were detected by the TAs and has a low rate of false positives. The feedback produced by the tool also matched the general line of the feedback given by the TAs and would therefore be useful to students.

(4)

(5)

1 Introduction 7

1.1 Code quality . . . 7

1.2 Aim of this project . . . 7

2 Theoretical background 9 2.1 Good and bad decomposition . . . 9

2.2 Automatically detecting bad decomposition . . . 9

2.3 Feedback . . . 10

2.4 Other tools that give feedback on code quality . . . 10

3 Detection of functions that might be split 11 3.1 Detecting functions that perform more than one task . . . 11

3.1.1 Loop analysis . . . 11

3.1.2 Scope analysis . . . 12

3.1.3 Output analysis . . . 13

3.2 Detecting functions that are too long . . . 13

4 Feedback 15 4.1 Loop feedback . . . 15 4.2 Scope feedback . . . 15 4.3 Output feedback . . . 16 4.4 Length feedback . . . 16 4.5 Compact output . . . 16 5 Evaluation 17 5.1 Method . . . 17 5.2 Results . . . 18 5.3 Analysis . . . 19 5.3.1 True positives . . . 20 5.3.2 False negatives . . . 20 5.3.3 False positives . . . 21 6 Conclusions 23 7 Discussion and Future work 25 7.1 Discussion . . . 25

(6)

(7)

Introduction

When teaching programming it is important to give feedback. This allows students to improve on their programming. As class size increases it gets harder to give enough good feedback to all students. For this reason it is useful to create tools that can give automated feedback on easy to detect problems. This reduces the workload on teaching assistants (TAs), allowing them to spend more time giving feedback on more complex issues in student’s code.

1.1 Code quality

There are multiple aspects of programming and when teaching programming, many are taught. One of these aspects is code quality. Code quality is a property of code that only concerns the layout and structure of the code. It is the aspect of code that determines how easy it is to read and understand. Stegeman et al. created a model for evaluating code quality [9]. This model was later used to generate a rubric1_{. A rubric is a tool that can be used to evaluate work done}

by students. To help with this, a rubric defines a set of criteria, and for each criterion a number of levels of accomplishment. The rubric divides code quality into 10 different criteria. In this thesis I will be focusing on one of the aspects: function decomposition. This aspect deals with how a program is separated into functions.

1.2 Aim of this project

In this project, a tool is created to detect potential improvements in programs concerning function decomposition. This tool will then give feedback on how to improve the code by separating parts of it into new functions. This tool can then be used in introductory programming courses to give feedback to students. The feedback may help students improve the quality of their code and, with the help of TAs, better understand what makes good code. However, this tool is not intended to fully replace TAs.

(8)

(9)

Theoretical background

2.1 Good and bad decomposition

When it comes to what is good and bad function decomposition, there are many opinions. There are a few things sources agree on:

• A function should only perform one task [9, 5, 4, 2]. • A function should not be too long [9, 4, 2].

• A function should be used if code is used in multiple locations [9, 5, 4, 3]. • A function should not be too complex [5].

This tool will focus on functions that perform more than one task, and on functions that are too long. We chose to do this because in introductory programming courses complex functions are often allowed. We will also not be looking at code duplication since this is a different section of code quality in the rubric1.

2.2 Automatically detecting bad decomposition

Most literature for detecting bad function decomposition focuses on finding dependencies in code: structural dependencies in the form of control statements and data dependencies in the form of variable declaration and usage. There are multiple approaches for doing this:

• Using program slicing; program slicing is a method to create sets of statements from a function. These sets are created include all statements that contribute to the value of a variable x at position p. This variable is called the slicing criterion. Program slicing was first proposed by by Weiser [12]. These program slices can then be used to isolate the computation of the slicing criterion into its own function [1, 11].

• Using a graph containing data and structural dependencies of the function and removing longest edges until two disconnected sub-graphs can be found; these two sub-graphs are then possibly separate functions [7].

• Using continuous blocks of statements that could be refactored into a function by an IDE and scoring these based on the similarity of data dependencies between the extracted block and the function that is ”left behind” [8].

All these approaches utilise the use of variables sometimes combined with the use of control statements. These approaches aim to detect separate tasks based on the data they use and the control statements they are executed in.

(10)

2.3 Feedback

In order for feedback to be considered useful [6] it has to be used by the student for improvement. Feedback that helps student improve should include three pieces of information [6]. These are: what the goal to reach is, where the student is now in relation to the goal, and what the student can do to get there. To help students improve, the tool will focus on giving two of these pieces of information: what the goal to reach is and what they can do to get there.

2.4 Other tools that give feedback on code quality

Automatically giving feedback on programs has two large categories evaluating the correctness of the program and evaluating code quality. Evaluating the correctness of the program is done in the following ways:

• Evaluating the functionality of the program created by the student [10]. This is often done by comparing the output of the student program with a predetermined output or the output of a model solution with the same inputs.

• Detecting logic errors [10] by looking at the input and output of smaller sections of the program to locate the area that the error was made.

There are existing tools for giving automated feedback on code quality. Some for checking basic style rules like pycodestyle2 _{and tools that deal with problems like function decomposition.}

Tools that give automated feedback on function decomposition usually give their feedback in the form of refactor opportunities. These opportunities are changes to the program that preserve the program behaviour and as measured by the parameters defined by the tool would improve the decomposition of the program. The programmer is then required to decide if this change would improve the program and then tell the tool to apply this change. An example of a tool like this is the long method refactoring of the eclipse plugin JDeodorant3 using the algorithm based on block based slicing from [11] and the tool proposed in [8]. However these tools are not designed to be used to teach students, they are designed to help programmers improve the decomposition of large code bases, where bad decomposition can be hard to find because of the size of the code base.

These tools are not suitable for teaching students because they don’t provide good feedback. They only provide the student with a way to change the program that the program thinks will improve the quality. It is then up to the programmer to determine if this is true. This requires that the programmer already knows what the goal is they are trying to get to.

2_{pycodestyle: https://pypi.python.org/pypi/pycodestyle} 3_{JDeodorant: https://github.com/tsantalis/JDeodorant}

(11)

Detection of functions that might be split

3.1 Detecting functions that perform more than one task

The goal of the tool is to detect problems with function decomposition in student code and then give feedback to the student that helps them to improve the quality of their code. To do this, the detection method of the tool needs to satisfy the following criteria:

• The detection method should have a low rate of false positives to prevent students from being given wrong feedback and learning the wrong thing.

• The detection method should result in detection that can be solved by beginning program-mers and should therefor not include complex refactorings.

• The detection method should provide enough information about the reason for the detection to give useful feedback.

To do this effectively the proposed algorithms will only look for blocks of code that meet the following criteria:

• The code block is continuous, this means that the to be separated statements are not broken up by other unrelated statements.

• The code block can be separated without the duplication of control structures: This means that either all statements in a control structure would be included or none.

I propose three different approaches that could satisfy these criteria these analysis will then be evaluated.

The tool is written in python and bases its analysis on a full syntax tree (FST) provided by redbaron1_{. Each analysis is done for each function in the program. The analysis does have}

access to the entire FST and source file for the creation of more complex criteria.

The tool is also setup to allow for easy expansion with extra analysis methods by defining each analysis in its own self contained class and allows for simple dependencies between analysis in the form of output priorities.

3.1.1 Loop analysis

For every loop that is not nested into another control statement, the set of variables that are used in that loop is created. If this loop is a for-loop, the iterator variable is removed from this set. This is done because the iterator of a for-loop in python can’t be in the same scope as the iterator of another for-loop excluding nested loops. Then for every loop, this set of variables is compared to all the other sets of variables. If the size of the intersection of these loops is

(12)

less than some configurable threshold, the loop is marked as separable into its own function and feedback will be given to potentially separate this loop into its own function.

1 def main():

2 shortest_string_len = float('inf')

3 shortest_string = None

4 for string in strings:

5 if len(string) < shortest_string_len:

6 shortest_string_len = len(string)

7 shortest_string = string

8

9 print("The shortest string is:", shortest_string)

10 strings.remove(shortest_string)

11 strings_containing_shortest_string = []

12 for search_space in strings:

13 if shortest_string in search_space:

14 strings_containing_shortest_string.append(search_space)

15

16 return strings_containing_shortest_string

Listing 1: Program in need of decomposition

If this analysis is used on the code in listing 1, the program detects the following:

• The function main has two loops. • The first loop has a variable set:

{strings without shortest,shortest string, strings containing shortest string}. • The second loop has a variable set:

{shortest string len, strings, shortest string}. • Their shared variable set is {shortest string}.

3.1.2 Scope analysis

To detect groupings of variable use that possibly signify separate tasks, the scope analysis de-termines the scope of each variable in a function. This is a tuple containing the first line where the variable occurs and the last line that it occurs. These scopes are then used to find lines in the function were it might be split without separating to many variable scopes. For a line to be considered for splitting it needs to fit the following criteria:

• The line is not in the first three or last three lines of the function; this was done to prevent the creation of function with less than three lines; this threshold is based on the findings in [8].

• The line needs to be outside of any control structures; this is done to prevent the duplication of control structures into the new function.

For each of these lines three values are calculated:

• The number of scopes that end above the line. • The number of scopes that start below the line. • The number of scopes that intersect with the line.

(13)

the splitting of functions on lines that have only a function below or above them.

• Both the number of scopes above the line and below the line need to outnumber the number of scopes broken. This ensures that the resulting two functions contain more variables than they have parameters.

• The number of scopes that are broken needs to be less than 5. This is to keep the number of parameters that the new function would have to a manageable level as recommended in [4] [5].

If this algorithm is run on the code in listing 1, the program detects the following:

• Splitting opportunity at line 8, intersected scopes: {shortest string, strings} • Splitting opportunity at line 9, intersected scopes: {shortest string, strings} • Splitting opportunity at line 10, intersected scopes: {shortest string, strings}

When multiple potential splitting opportunities are found for a function, only one is reported to the user. The opportunity that splits the fewest scopes is chosen as the opportunity to present to the user. When there are multiple opportunities with the same number of scopes the program first checks if any of these lines is empty if there are these lines get priority because it simplifies the separating of the functions. If there are still multiple options the first one is chosen.

3.1.3 Output analysis

First, two sets are computed. One set contains the variables that are returned from the function to the rest of the program. The other contains the variables that are outputted from the function to the user. If the intersection between these two sets is not empty, feedback is given. This is done because, if a function is returning the same variable that it is printing, it is mixing tasks. These tasks are the computing of information, and the outputting of this information to the user. In this case it is better to move the print outside the function after the function call, or if the print is complicated, to separate the print into its own function.

If the code that was improved with the loop analysis listing 2 is run through output analysis it will detect the following:

• Returned variable set {shortest string}. • Printed variable set {shortest string}. • Intersection {shortest string}.

After applying this feedback the program could look like listing 3:

3.2 Detecting functions that are too long

For giving feedback on functions that had no problems using any of the other detection methods, the tool will look at the length of the function and give feedback based on the function length. For the detection of functions that are too long the number of physical lines of code are counted. To do this, the tool first removes all comments and doc strings from the redbaron fst. The resulting fst of the function is then converted back into source code because redbaron keeps all formatting information the resulting function has the same layout as the input function if the comments and doc strings where removed. For this code fragment all non white-space lines are counted. This number is then compared to a configurable threshold if it exceeds this threshold the function is marked as to long.

(14)

1 def get_shortest_string(strings):

8 print("The shortest string is:", shortest_string)

9 return shortest_string

10

11 def find_containing_strings(strings, search_string):

12 strings_containing_string = []

15 if shortest_string in string: 16 strings_containing_string.append(string) 17 18 return strings_containing_string 19 20 shortest_string = get_shortest_string(strings)

21 containing_strings = find_containing_strings(strings, shortest_string)

Listing 2: Program with loop feedback used

1 def get_shortest_string(strings):

8 return shortest_string

9

10 def find_containing_strings(strings, search_string):

11 strings_containing_string = []

14 if shortest_string in string: 15 strings_containing_string.append(string) 16 17 return strings_containing_string 18 19 shortest_string = get_shortest_string(strings) 20 print("The shortest string is:", shortest_string)

21 containing_strings = find_containing_strings(strings, shortest_string)

(15)

Feedback

For each detection method described in chapter 3 we can generate feedback. This feedback is not intended to tell the student exactly what parts of the code should be put into their own function. This was done because the detection methods can not decide with enough certainty if a function should be split. When the tool is run on the code in listing 1, two of the four detection methods generate feedback.

4.1 Loop feedback

The feedback is generated based on the information gathered by the algorithm discussed in loop analysis. This data also includes the variables that were shared between the two loops. This information is not shown in the feedback to keep it clear and easy to read for a beginning programmer. Because separating a loop into its own function is easier to understand than looking at each individual variable that these two loops share. The feedback is the following:

Feedback on function: <main>

This function has multiple loops that look like they could be split into separate functions. These loops are at lines 4, 13.

The goal pointed to in this feedback, is that the function may need to be split up. And it tells the student how to get there by pointing out the loops that could be separated. If this feedback is used to split the program into functions, the result could look like the code in listing 2.

4.2 Scope feedback

The feedback is generated out of the splitting options found using the algorithm discussed in scope analysis. These options were at lines: 8, 9 and 10. All three splitting options were identical differing only in their line number. This results in the following feedback:

This function could be split at line 8 the two resulting functions would only have:[shortest string, strings] as shared variables.

The goal pointed to in this feedback is that the function may need to be split up. And it tells the student how to get there by pointing at the line where the function could be split. Applying this feedback would result in listing 2. When the improved code of listing 2 Is run through the tool again an other detection method will go off.

(16)

4.3 Output feedback

The underlying feedback generator has more data than it displays to the user. This informa-tion includes line numbers of the print and return statements that caused the error. This information is not shown, in favour of giving the student feedback on what is often best practice.

Feedback on function: <get shortest string>

This function prints the same variable that it also returns. It is often better to separate the printing of a value and the calculating of that value into separate functions.

The goal pointed to in the feedback is to separate all output into separate functions. And the student is pointed at how to get there by pointing at a function that does both output and something else.

4.4 Length feedback

The feedback tells the student that the function has a certain length and that this is considered to long. It then explains why it is a good idea to split long functions. This feedback is the least specific and gives the least information. Because of this the feedback is only given when the none of the other methods detected a problem with this function.

This function is x lines long after the removal of comments, docstrings, and whitespace. Long function are often less readable and more likely to have errors than short functions.

This feedback only contains one of the criteria the goal. It tells the student that the function should be shorter.

4.5 Compact output

The normal output mode outputs feedback for every function in a file. This can produce a large amount of output. To limit this output, and to make students more likely to read the feedback, instead of skip to the exact line numbers given there is a option that can be enabled for compact output. This output gives the same feedback, but groups it by file instead of by function, and prefixes each feedback string with a list of functions problem was detected in.

(17)

Evaluation

5.1 Method

The detections and feedback given by the tool where evaluated by comparing to feedback given by teaching assistants (TAs). Five files from a first year programming course were selected. These five files where selected using the following criteria:

• The file contains at least two detections by the tool.

• The file is no longer than 250 lines to reduce the time required for the TA’s to analyse them.

Three TAs where then asked to evaluate these five files and point out instances of bad func-tion decomposifunc-tion, and explain why this was bad funcfunc-tion decomposifunc-tion. They received the following instructions:

You will be given five files these files are part of a first years programming course for physics students.

Could you give feedback on these files specifically on the topic of function decomposition. While giving feedback do not look at other problems with the code and describe your thoughts.

During this evaluation I recorded their feedback. Later I analysed these recordings by writing down each instance were the TAs gave feedback on the topic of function decomposition. For easier comparison with the tool, the feedback was assigned categories based on the reason the feedback was given. Each category represents a reason the TAs gave for separating the function. If the reason did not yet have a category a new one was created for it. This resulted in the following categories:

1. Task This part of the function is its own task.

2. Duplication This part of the function is repeated several times.

3. Complex This part of the function should be separated to reduce the complexity of the function.

4. Output This part of the function displays output to the user and should be separated.

5. Length This function should be split because it is to long.

Multiple categories can be assigned to a single instance of feedback. For example, a piece of code can be a separate task and be duplicated, so it would get the categories Task and Duplication. One of the instances of feedback given by a TA was discarded because after analysis of the code

(18)

it was determined the feedback was based on an incorrect interpretation of the code.

Then for every analysis, the reason given for separating was matched to one or more of the categories. This resulted in the following categories for the analysis.

• Task loop analysis, scope analysis, output analysis • Output output analysis

• Length length analysis

For the comparison of the TA feedback with the feedback of the tool the thresholds used were the following:

• Loop analysis shared variables: 2. This was chosen to be the lowest number that produced results. Lowering the value any more reduces the results significantly and increasing the value does not produce more results.

• Maximum function length: 30. This value was chosen to be a number that created feedback for functions that where hard do comprehend because of their length. This was done because even though most sources agree on that functions should not be to long. They do not have a good number to limit this to a almost all agree that in some circumstances longer functions are required.

5.2 Results

The categorised feedback instances created using the feedback given by the TAs were compared to the feedback generated by the tool. The result of this comparison can be seen in tables 5.1 to 5.5

Line number Category #TAs Tool detection 51:53 Task/Duplication 3 -66:69 Task/Duplication 3 -88:97 Task 2 -99:106 Task 2 -109:111 Task/Duplication 3 -130:137 Task 2 -138:145 Task 2 -150:155 Output 2 -- Length - _X - Output - _X

Table 5.1: Detections in Data.py

Line number Category #TAs Tool detection

12:30 Task 3 _X 40:41 Task 3 _X 46:49 Task/Duplication 2 _X 55 Task 2 _X 58:60 Complex 2 -61:63 Complex 2 -69:72 Task/Duplication 2 _X 76:83 Task - _X 90:93 Task/Duplication 2 _X 96:98 Task 1 _X

(19)

69:83 Duplication/Complex 1 -97:100 Task - _X 106 Task/Output 1 _X 111:127 Task 2 _X 129:133 Task/Output 1 _X 137:151 Task 2 _X 151:154 Output 2 -160:174 Task 3 _X 175:183 Ouput 3

-Table 5.3: Detections in doos2.py

Line number Category #TAs Tool detection

11:23 Task 3 _X 22:23 Task - _X 26:40 Task 3 _X 42:49 Task 3 _X 52:69 Output 3 -71 Output 1 _X

Table 5.4: Detections in error5.py

Line number Category #TAs Tool detection 57:73 Duplication/Complex 1 -87:88 Task - _X 90:106 Task 2 _X 109:114 Output 2 -115 Task 3 _X 116:137 Task 3 _X 140:146 Output 2 -166:167 Task 1 -172:193 Duplication/Complex 1 -209:210 Task - _X 209:232 Task 1 _X 237:243 Output 3

-Table 5.5: Detections in week6.py

5.3 Analysis

When comparing the feedback of the tool with the feedback of the TAs, four different results may appear:

• False positive: The tool gave feedback on a piece of code that the TAs did not give feedback on, or using a feedback method from a different category.

• False negative: The tool gave no feedback on a piece of code that the TAs did give feedback on.

• True positive: The tool gave feedback on a piece of code that the TAs also gave feedback on and the category of the tool matches matches the category of the feedback.

(20)

• True negative: This is the remaining code that has no problems related to function decom-position. True negatives were not measured because there is no good way to measure the true negatives. Tp Fp Fn loops 17 5 0 scope 1 0 0 output 1 1 4 length 1 1 0 other 0 0 12

Table 5.6: Results grouped by detection method

The results in tables 5.1 to 5.5 have 20 true positives, 22 false negatives and 7 false positives.

5.3.1 True positives

In table 5.6 it can be seen that the loop analysis approach was by far the most successful approach for detecting bad function decomposition. The other methods did not find many problems with the code and when they did they where not always problems that the TAs would have pointed out.

5.3.2 False negatives

Most of the 22 false negatives can be divided into four categories:

• Linear tasks tasks that consist of a few statements of the code that are close together. • Other forms of output parts of the function that perform output using other means that

the print function.

• Duplication parts of the code that according to the TAs should be separated because it is used in multiple locations in the code.

• Mixed tasks tasks that should be separated but are mixed with other tasks making the separation of these tasks harder.

In the next sections for each of these categories it will be discussed why the tool did not detect these problems.

Linear tasks

Scope analysis was designed to find groupings of variables and by doing this find statements that could be separated into their own function. It failed in finding these groupings because student programs showed different patterns than expected. The different tasks in the function often had the same input variables and the output of the multiple tasks was often combined at a later point in the function. This results in scopes of variables that span the entire function. This meant scope analysis could not find these groupings.

A better approach for finding these tasks might be the use of the block based method used in [8]. This approach determines the number of parameters and output required if a piece of code is put into it’s own function. However as discussed in the paper this can produce a large number of suggestions not all of witch are valid. And creating a function that accurately determines what the input and output variables are of any piece of code in non trivial and in [8] the build in functionality of the eclipse IDE was used.

(21)

TAs were all of the opinion that this should be done in a separate function. After the evaluation a new analysis was programmed to detect these instances. This evaluation determines how a plotting library was imported and then finds all uses of this library it then checks the percentage of statements of a function that are part of this library. If this number is below a certain threshold it clusters these uses together and advises to put these into their own plotting function. When the output of this new analysis was compared to the feedback of the TAs it found all of the false negatives without introducing any false positives. The tool was also designed to easily change what library is tracked by this analysis and could therefore also be used to find groupings of other libraries.

Duplication

At the start of this thesis it was decided that duplication would not be added to the detection methods. In the analysis of the feedback given by the TA’s it did come forward as an important factor for TAs in if they would give feedback on a function. Because of this the detection of the tool could be improved by adding a form of duplication detection analysis.

Mixed tasks

The mixing of tasks into one complex code block often requires complex refactoring and is hard to detect. If the tool would want to detect these tasks a form of slicing could be used. However these mixed tasks often require more complicated refactorings and could be more easily explained by the TAs themselves.

5.3.3 False positives

There are three detection methods that produced false positives. I will discuss the causes and possible solutions to these false positives in the following sections.

Loop analysis

The false positives of the loop analysis are caused by two things: the detection of one or two line loops that while technically a separate task, were too small of a mistake for the TAs to give feedback on, and the reporting of multiple loops in a single function while a TA would suggest only separating one and leaving the other loop in the original function. Improving the first problem could be done by adding the minimum length parameter used in scope analysis, at least three lines before giving feedback, to loop analysis. This method was tried, and it removed all but one of the false positives it also removed 3 true positives these true positives were combinations of task and duplication pointed out by the TA’s because of the duplication. The second problem is harder to prevent a possible approach could be evaluating the number of statements left in the function after loop extraction or prioritising certain loops using the dependency difference used in [8].

Length analysis

The length analysis is only used when other analysis methods can not find any problems with the code. This looks to be similar to the reasoning of the TAs because feedback on length gives less information than other types of feedback. This caused the tool to report a length problem while the TAs had more specific problems with the code that the tool did not detect.

Output analysis

The false positive produced by the output analysis was due to a pattern in the code where the result of every function called was printed. And the return value was also the value result of a different function. These two functions had the same argument. This triggered the tool. This

(22)

could be prevented by ignoring parameters of functions during the search of variables inside print and return statements.

(23)

Conclusions

A tool was created to automatically give feedback to beginning programming students. This tool detects when functions do more than one task and when functions are too long. It was evaluated by comparing to feedback given by teaching assistants. It was concluded that the tool could detect certain types of bad decomposition. The most successful of these methods was loop analysis, while scope output and length analysis where less successful. By adding some improvements like minimum size requirements for new functions, loop analysis can be improved, further reducing the false positives. If methods for duplication detection were to be added, the tool would be able to detect a large part of the instances of bad decomposition in the tested files, and because of its low rate of false positives, may be used to help students improve their code quality.

(24)

(25)

Discussion and Future work

7.1 Discussion

There are a few things that need future improvement. The number of test files used in the evaluation was low because analysing a file for a TA takes a significant amount of time and the TAs were only available for a limited amount of time. As a result analysis might not be representative of all student programs. The files also came from the same programming course, so certain patterns observed might show up more frequently in this specific course. The clearest example of this would be the newly added plot analysis. This analysis only has a use if the assignment involves plotting data, as was the case in most of the used assignments.

Because feedback is also only useful if students can actually use it to improve their code, an evaluation that looks at the ability of students to understand the feedback given and improve code based on the feedback given by the tool, should be done if the effectiveness of the tool is to be fully evaluated.

Furthermore, if more detection methods are implemented and the need for the length analysis decreases, removing length analysis altogether could improve the quality of the feedback given to students. Its threshold does not have a good scientific backing, and it does not provide the student with as much information as the other detection methods.

7.2 Future work

The current feedback contains two of the three aspects of good feedback as defined in [6]. In order to give better feedback the tool could be expanded with extra algorithms that can score found detections into levels like the ones defined in [9]. Using these levels, the tool can then give an indication to the student where they are in relation to the goal.

As discussed before in the results of the evaluation, the tool could be improved by changing a few things in the current detection methods and adding duplication and plot detection. The loop detection could be improved by adding the minimum size requirement from scope analysis. This would most likely improve the analysis by removing 1-line and 2-line loops that would not be seen as a mistake by TAs. The second change would be to replace the shared variable check with a different analysis based on the similarity index used in [8]. This would allow the analysis to detect single loops by comparing the loop to the rest of the function instead of other loops, it would also allow for more flexible comparison by taking into account the number of variables used in the loop. Adding a duplication analysis to the tool would also allow it to detect specific problems that are often a reason to split a function.

As also discussed in the analysis of the output analysis the current version is only capable of detecting output based on print statements. A new analysis was created after the evaluation to detect output based on a plotting library. This analysis looks for groupings of calls to this library and if they are not the majority of a functions statements recommends to the student to put the

(26)

plotting into its own function. Using this method all of the false negatives that where caused by the mixing of plotting with other tasks where found without producing new false positives.

(27)

[1] Abadi, A., Ettinger, R., and Feldman, Y. Fine slicing for advanced method extraction. In 3rd workshop on refactoring tools (2009), vol. 21.

[2] Fowler, M., and Beck, K. Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.

[3] Hunt, A., and Thomas, D. The Pragmatic Programmer: From Journeyman to Master. Pearson Education, 1999.

[4] Martin, R. C. Clean Code: A Handbook of Agile Software Craftsmanship, 1 ed. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2008.

[5] McConnell, S. Code Complete. DV-Professional Series. Microsoft Press, 2004.

[6] Sadler, D. R. Formative assessment and the design of instructional systems. Instructional Science 18, 2 (1989), 119–144.

[7] Sharma, T. Identifying extract-method refactoring candidates automatically. In Proceed-ings of the Fifth Workshop on Refactoring Tools (New York, NY, USA, 2012), WRT ’12, ACM, pp. 50–53.

[8] Silva, D., Terra, R., and Valente, M. T. Recommending automated extract method refactorings. In Proceedings of the 22Nd International Conference on Program Comprehen-sion (New York, NY, USA, 2014), ICPC 2014, ACM, pp. 146–156.

[9] Stegeman, M., Barendsen, E., and Smetsers, S. Towards an empirically validated model for assessment of code quality. In Proceedings of the 14th Koli Calling International Conference on Computing Education Research (New York, NY, USA, 2014), Koli Calling ’14, ACM, pp. 99–108.

[10] Truong, N., Roe, P., and Bancroft, P. Automated feedback for ”fill in the gap” programming exercises. In Proceedings of the 7th Australasian Conference on Computing Education - Volume 42 (Darlinghurst, Australia, Australia, 2005), ACE ’05, Australian Computer Society, Inc., pp. 117–126.

[11] Tsantalis, N., and Chatzigeorgiou, A. Identification of extract method refactoring opportunities for the decomposition of methods. Journal of Systems and Software 84, 10 (2011), 1757 – 1782.

[12] Weiser, M. Program slicing. In Proceedings of the 5th International Conference on Software Engineering (Piscataway, NJ, USA, 1981), ICSE ’81, IEEE Press, pp. 439–449.