Automated Docstring Generation for Python Functions

(1)

Automated Docstring Generation for Python

Functions

Michael Anthony Cabot 6047262

Master thesis Credits: 42 EC Master Articial Intelligence

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. M.W. van Someren December 3, 2014

(2)

Abstract

This work presents a novel approach to automatically generate descriptive summaries for functions in software by viewing the generation of documentation as a task analogous to that of machine translation, i.e. the act of translating a text from one natural language to another. The source code of a function is treated as the language that needs to be translated (source language) and the function's corresponding description is treated as the language into which needs to be translated (target language). Open source software projects were used to train a phrase-based Statistical Machine Translation (SMT) system that learns the relations between source code and description. The goal of this project is to automatically generate English descriptions, known as docstrings, for functions written in the programming language Python. The empty translation, i.e. the translation of a source phrase with an empty target phrase, was used to help generate a summary of a function's behaviour by ignoring unimportant parts of a function's source code that are not explicitly mentioned in the corresponding docstring. The results show an improvement in the quality of the generated docstrings when using empty translations.

(3)

4.3.2 Language Model . . . 9 4.3.3 Heuristic features . . . 10 4.4 Decoder . . . 10 4.5 Optimization . . . 12 4.6 Overview . . . 14 5 Experiments 14 5.1 Data . . . 15 5.2 Filtering . . . 15 5.3 Evaluation . . . 15 5.4 Optimization . . . 16 5.5 Training Data . . . 18 5.6 Qualitative analysis . . . 18 5.7 Ranges . . . 20 5.8 Out-of-Domain Data . . . 23

6 Discussion and Conclusion 24

A In-Domain Repositories 27

(4)

1 Introduction

Every programmer can appreciate well documented software. When you are continuing a project you have not looked at for a long time or when you are trying to make use of a software library, documentation is essential to quickly and eectively understand the software you want to use or improve upon. Without documentation you are forced to delve into the implementation, which can take up valuable time and money. At the same time, writing and maintaining the documentation of your software is tedious work. Consequently, some may not nd it necessary while others might prioritize nishing the code when facing an approaching deadline. Be it due to negligence or lack of time, poor upkeep of documentation will hurt in the long run. Functions without documentation will need to be inspected, while functions with outdated documentation will cause misuse. This paper explores the automated generation of documentation of functions in order to reduce the amount of time programmers need to spend on documenting their source code.

First, it is important to make a distinction between the type of documentation generation that is meant in this paper and the documentation generation tools that exist on the internet. Sphynx1_{, for example, is}

a documentation generator which extracts documentation in software, i.e. comments in source code les, and converts it into HTML websites and other formats such as PDF. Similarly, the tools doc-o-matic2_{, happydoc}3_and

imagex4_{extract documentation, as well as other information, from source code and use it to construct software}

documentation. Thus these tools do not generate the documentation of a software project from scratch, rather they oer a means to extract existing documentation from software and generate a document that gives an overview of the software's functionality, e.g. an overview of all the classes, functions and the relations between dierent classes. This work, on the other hand, presents a novel approach to automatically generate descriptive summaries for functions in software by viewing the generation of documentation as a task analogous to that of machine translation, i.e. the act of translating a text from one natural language to another. The source code of a function is treated as the language that needs to be translated (source language) and the function's corresponding description is treated as the language into which needs to be translated (target language). Open source software projects were used to train a phrase-based Statistical Machine Translation (SMT) system that learns the relations between source code and description. The goal of this project is to automatically generate English descriptions, known as docstrings, for functions written in the programming language Python.

In phrase-based statistical machine translation (SMT) (Koehn et al., 2003) bilingual sentences are decom-posed into smaller units, words or phrases, which are aligned such that paired units constitute as possible translations of one another. For example, the Spanish-English bilingual sentence Devolver los números pares-Return the even numbers could have the following alignment:

Devolver los números pares

Return the even numbers

Pairing the English word numbers with the Spanish word números implies that numbers becomes a can-didate translation of números. The phrase pair números-numbers could then be applied when translating another sentence containing the word números. The alignments between source and target sentences are used to build statistical models that model the relation between the source and target language (translation model) and model the structure of the target language (language model). Given a source sentence, the decoder searches among all possible candidate translations and retrieves the one that scores highest according to the statistical models.

Figure 1 shows a function whose documentation is equal to the English side of the aforementioned bilingual sentence. The function takes as input a list of numbers and returns only those numbers in the list that are even. Similarly to choosing alignments between natural language sentences, alignments can be found between source code and its documentation. For example, the word even could be aligned with the operation number%2 == 0, which asserts whether a number is even. Another possibility is to align the phrase the even numbers with the variable even onto which all even numbers are appended. Following the SMT methodology, these alignments are used to train statistical models that model the relation between a function's source code and the corresponding documentation. Alignments found in the training data can then be applied when generating the documentation for a new function. Figure 1 should also give an impression of the nature of the docstrings. Each docstring consists of one or more natural language sentence and corresponds to the source code of one function. Similarly, the assumption is made that the documentation in the software projects used to train the SMT system consists of sentences written in natural language.

A dierence between natural language translation and documentation generation is the fact that documenta-tion forms a summary of its funcdocumenta-tion behaviour. This means that not all parts of the source code are mendocumenta-tioned

1_{http://sphinx-doc.org/} 2_{www.doc-o-matic.com}

3_{http://happydoc.sourceforge.net}

(5)

def even_numbers ( numbers ) :

""" Return the even numbers """ even = [ ]

for number in numbers : i f number%2 == 0 :

even . append ( number ) return even

Figure 1: Function that returns all even numbers in a given list.

in its documentation. When using an SMT approach to generate documentation, this would imply that not all parts of the source code need to be translated. For example, the for-loop statement in Figure 1, which iterates over the given numbers, is not explicitly mentioned in the documentation. The task of generating documenta-tion for a funcdocumenta-tion can thus be seen as both translating source code to natural language and summarizing its behaviour. The additional step of summarizing a function's behaviour makes documentation generating more dicult than natural language machine translation. When translating from one natural language to another natural language multiple valid target sentences might exist, all of which convey the full meaning of the source sentence. With documentation generation, not only could there be multiple ways to convey the same informa-tion, there is an additional degree of freedom as to what part of a function needs to be captured by the summary. It is likely that the level of detail captured by a function's summary would dier in the context of dierent projects or when written by dierent programmers. This greatly increases the diculty of extracting phrase pairs between functions and their documentation. Additionally, assessing the validity of candidate summaries becomes more dicult since a valid candidate summary does not necessarily need to resemble the reference summary.

An SMT system is trained on a parallel corpus consisting of aligned bilingual sentences, where each source sentence corresponds to one target sentence. When dealing with function documentation, one function corre-sponds to one summary. However, there is no guarantee that the documentation of a function consists of a single sentence. Sometimes multiple sentences are used to describe a function's behaviour. This again stems back to the inherent lack of structure there is when writing documentation, i.e. there are no inherent rules as to how a function should be documented. To minimize the number of sentences used to describe the behaviour of a function, lters are used to remove unnecessary parts of documentation when constructing the training set (see Section 3.2).

Another dierence between natural language translation and documentation generation is the fact that source code does not have a well dened vocabulary. Some tokens, such as if-statements and for-loops, are dened by the programming language while others, such as function names and variable names, are dened by the programmer. Finding alignments between source code and documentation relies on consistent co-occurrences of source code and documentation terms. The success of the SMT system thus relies in part on the consistency of naming conventions being enforced throughout the software projects.

This paper details how a phrase-based SMT system is used to generate docstrings from the source code of a function. First, Section 2 gives an overview of related work. Section 3 describes how a parallel corpus is created from open source software projects. The parallel corpus will consists of real world examples of documented functions. Section 4 describes how the parallel corpus is used to build the statistical models of the SMT system and how the statistical models are used to generate candidate docstrings for unseen functions. Section 5 shows the result of the experiments. Last, Section 6 discusses and concludes this work.

2 Related Work

Little research has been done on automating the generation of documentation in software. The only known work is that done by Sridhara et al. (2010). Sridhara et al. (2010) describe a technique to automatically generate descriptive summaries for methods written in the programming language Java5_{. The authors divide the act of}

creating a summary into two problems: content selection, selecting the parts of the source code to be used in the summary, and text generation, converting those pieces of source code into natural language. During content selection, the method's features, such as its Abstract Syntax Tree, Control Flow Graph and identiers, are fed to SWUM: Software Word Usage Model. SWUM identies linguistic elements from a piece of code, such as its action, theme and secondary argument:

Based on common Java naming conventions, [Sridhara et al. (2010)] assume that most method names start with a verb. SWUM assigns this verb as the action and looks for the theme in the rest of the name, the formal parameters, and then the class.

(6)

Several heuristics are then applied to select those parts of the source code that seem most important. In the end, templates are used to construct phrases from the linguistic elements extracted by SWUM which are combined to form sentences. A survey was conducted to evaluate the accuracy, content adequacy and conciseness of the generated summaries. The results show that the generated summaries were accurate, did not miss important content, and were reasonably concise.

Sridhara et al. (2010) and this work share the same goal of generating descriptions for functions, but dier greatly in their approaches. The work of Sridhara et al. (2010) depends on heuristics and templates to select content from the source code and produce their summaries, respectively, whereas this works learns the importance of pieces of source code (content selection) and the relation between source code and natural language phrases (text generation) from human made descriptions. By learning from human made descriptions this work oers a framework for generating descriptions in any natural language for source code written in any programming language.

While little research has been done on documentation generation, comparisons can be drawn with sum-marization techniques. Banko et al. (2000) note that most previous work on sumsum-marization has focused on extractive summarization: selecting sentences from a document and combining these to form a summary docu-ment. Their work describes an alternative approach to summarization capable of generating summaries shorter than a sentence by viewing summarization as a problem analogous to statistical machine translation:

The issue then becomes one of generating a target document in a more concise language from a source document in a more verbose language.

Similar to Sridhara et al. (2010), Banko et al. (2000) divide the task of generating a headline for a document into two sub-tasks: content selection and surface realization. However, Banko et al. (2000) dier in their approach from Sridhara et al. (2010) in that they use statistical models, instead of heuristics and templates, to achieve their goal. The content selection task involves selecting words from a document that are to be used in the headline. This is modelled by the likelihood of a word wi occurring in a headline H given that it appears in

a document D. The assumption is made that the likelihood of a word in the headline is independent of all other words in the headline. The surface realization task involves assigning a probability to the ordering of the candidate headline. Banko et al. (2000) use a bigram language model, which conditions the probability of each word on its immediate left context. Additionally, a candidate headline is scored according to the number of words it contains. The overall probability of a candidate headline can then be computed using a log-linear model of (1) the likelihood of the selected words, (2) the probability of the summary's length and (3) the probability of the headline's word ordering. Finding the the best headline ˆH for a document D can be found by searching through all possible candidate headlines and selecting the one with the highest probability:

ˆ H = arg max H (α · n X i=1 log(p(wi ∈ H|wi∈ D))+ β · log(p(len(H) = n))+ γ · n X i=2 log(p(wi ∈ H|wi−1∈ H)))

where α, β and γ denote the weights of the three features. The Viterbi algorithm (Forney Jr, 1973) was used to eciently solve this problem and nd a near-optimal solution. Their system was evaluated on the word-error rate between their generated headlines and the reference headlines. The results show a word-error rate around 0.30.

Similarly, this work adopts the statistical machine translation methodology: a log-linear model of similar features is used to assign probabilities to a candidate summaries and the Viterbi algorithm is used to search for the most likely candidate summary given the source code of a function. The main dierence between this work and (Banko et al., 2000) is the required translation step from source language to target language. Banko et al. (2000)'s content selection model is estimated by examining the frequency with which a word occurs in a headline given that it occurs in the document. When translating a piece of source code, however, it is not immediately clear what parts of the source code correspond to what parts of the documentation and would thus constitute as a valid translation. Section 4.2 explains how a translation model is altered to adopt the functionality of the content selection model used by Banko et al. (2000).

3 Parallel Corpus

The rst requirement of building an SMT system is a parallel text corpora from which statistical models can be derived. In order to generate documentation for functions, the parallel corpus must consist of functions' source

(7)

code paired with their documentation. The following sections discuss the two main components of creating a parallel corpus: (1) representing a function's source code as a sentence and (2) extracting and ltering the relevant sentences from the function's documentation.

In this work all descriptive summaries are generated for functions written in the programming language Python. Explaining Python's syntax and semantics lies beyond the scope of this paper. To learn more about Python the reader is referred to https://www.python.org/. Note, however, that this framework is not restricted to Python software and can be applied to generate documentation in any natural language for functions written in any programming language. All that is required is enough training data containing documented functions of the required natural language and programming language pair.

3.1 Abstract Syntax Tree

An abstract syntax tree (AST) is a tree representation of a piece of source code. Each node in the AST refers to a construct in the source code, e.g. a function denition, an if-statement or a variable assignment. By converting a le, containing Python source code, into an AST, the AST of all functions can be extracted by traversing the le's AST and searching for nodes that correspond to functions. To convert a le into an AST all that is required is that there are no syntax errors in the source code. Runtime errors have no impact since the code is not executed. The documentation of a Python function, henceforth known as a docstring, can also be found in the AST, since it is maintained during the runtime of a program.

Figure 2 shows the function sum, which returns the sum of its input, alongside a representation of its corresponding AST. The nodes are named after the corresponding classes of the Python AST module6_{. If a}

node contains content it is shown below the class' name. For example, the variable iterable belongs to the class Name and the number 0 belongs to the class Number. The source sentence consists of a linearised AST, which is constructed by traversing the AST top-down left-to-right. When encountering a node with content, the node's content is added to the source sentence instead of the node's name. Nodes named Str, which contain a string, were not replaced by their content to prevent large amounts of words being added to source side sentences. In Figure 2 the node Str contains the function's docstring. The nodes Expr and Str, corresponding to the function's docstring, were removed from all source sentences, since (1) they do not add information about the function and (2) they are not present in the AST of a function without a docstring. To further decrease the size of the source sentences, the context nodes (Param, Store and Load) were also removed. The function's linearised AST together with its corresponding docstring form a pair of parallel sentences, as shown in Figure 2.

3.2 Docstring

A docstring is a string that is used to document a piece of source code. Unlike traditional documentation, docstrings are not removed when the source code is compiled, allowing the program to access the documentation at runtime. As is shown in Section 3.1 this allows a docstring to be extracted when a function's abstract syntax tree (AST) is constructed. Programming language that do not adopt the docstring, for example Java, will require a dierent approach to extract descriptions from the source code.

When examining human made docstrings the focus is put on the rst few sentences that describe the behaviour of its corresponding function. These are the sentences that will need to be automatically generated by the SMT system. Before aligning a function's source code with its docstring, any additional information, such as side-eects or parameter descriptions, need to be ltered out. The following lters were applied to remove unwanted pieces of documentation from a docstring.

remove doctests A doctest7 _{is a piece of text containing an interactive Python session which can be executed}

to verify the behaviour of a piece of code. The following is an example of a doctest that could be included in the docstring in Figure 2 to verify that the function returns 10 when given the list [1, 2, 3, 4]:

>>> sum ( [ 1 , 2 , 3 , 4 ] ) 10

All doctests were removed from the docstrings.

keep rst description A docstring sometimes contains multiple sections. Manual inspection of the software projects showed that the rst section of a docstring often gives an overview of the function's behaviour while subsequent sections explain ner details, side eects, possible exceptions or requirements of input variables and return value. This lter assumes that sections are separated by at least one empty line and removes all sections but the rst, thus maintaining the part of the docstring that describes the behaviour of the function.

6_{https://docs.python.org/2/library/ast.html} 7_{https://docs.python.org/2/library/doctest.html}

(8)

def sum( i t e r a b l e ) :

""" Return the sum of a l l items in i t e r a b l e . """ t o t a l = 0 for item in i t e r a b l e : t o t a l = t o t a l + item return t o t a l FunctionDef ·sum Return Name ·total Load For Assign BinOp Name ·item Load Add Name ·total Load Name ·total Store Name ·iterable Load Name ·item Store Assign Number ·0 Name ·total Store Expr Str ·... arguments Name ·iterable Param source target

sum arguments iterable Assign total 0 For item

iter-able Assign total BinOp total Add item Return total Return the sum of all items in iterable.

Figure 2: A Python function that returns the sum of its input (top), the corresponding abstract syntax tree (middle) and the extracted parallel sentences (bottom).

remove wx wrappers The wx module, one of the used Python projects, contains a large number of functions that are wrappers for functions written in C++8_{. Their docstrings are formatted as follows:}

name( argument_1 , . . . , argument_n ) −> return_value

Functions with these type of docstrings do not contain descriptions in natural language and were therefore not added to the parallel corpus9_.

remove parameter descriptions As noted before, it is desirable that parameter descriptions are removed from docstrings. Some repositories have the convention of declaring a function's parameters as follows:

: param variable_name : d e s c r i p t i o n Such lines were removed from the docstrings.

The goal of these lters was to get docstrings that best describe their corresponding function. To achieve the best result, each Python project should use custom lters that best reect the structure and conventions of its documentation. However, in this work the four lters where applied to all Python projects. Creating specic lters for each repository is left as future work.

3.3 Cleaning the data

After constructing the parallel corpus, the corpus was tokenized and cleaned using two preprocessing scripts from the Moses framework (Koehn et al., 2007): tokenizer.perl10 _{and clean-corpus-n.perl}11_{. The tokenizer}

was used on the docstrings to split words and separate punctuation. A custom tokenizer was used on the linearised AST that split compound words on underscores and camelCase. For example, the compound words remove_ags and removeFlags were both split into the phrase remove ags. The cleaning script was used to account for MGIZA's (a word alignment tool) requirements: a sentence pair was removed from the parallel corpus if one of the two sentences contained more than 100 words or if one sentence contained more than 9 times as many words as the other sentence. Both source and target sentences were reduced to lower case. The

8_{http://www.cplusplus.com/}

9_{Furthermore, this type of documentation is discouraged by the Python Enhancement Proposals: http://legacy.python.org/}

dev/peps/pep-0257/.

10_{https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl} 11_{https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl}

(9)

result is a parallel corpus consisting of aligned source sentences and target sentence, where each source sentence consists of a linearised AST and each target sentence consists of a corresponding docstring. Note that the lters applied to the docstrings do not guarantee that the docstring consists of a single natural language sentence. Consequently, a target sentence can contain multiple natural language sentences. This, however, will not have any consequences for the workings of the SMT system since the docstring is seen as one target sentence.

4 Statistical Machine Translation

This section details the phrase-based SMT system used to generate docstrings for functions. Firstly, it shows how statistical models are built from the parallel corpus. These statistical models, along with some heuristic features, are used by the decoder to search for the most likely docstring given a linearised AST. Lastly, it is shown how the performance of the decoder is optimized.

4.1 Word Alignments

In SMT, the task at hand is nding the target sentence e that best translates a source sentence f. For this we need to model the probability of a target sentence given a source sentence, p(e|f). Using Bayes' theorem this can be decomposed as

p(e|f ) = p(e)p(f |e)

p(f ) (1)

Finding the target sentence with the highest probability, ˆe, becomes ˆ e = arg max e p(e)p(f |e) p(f ) (2) = arg max e p(e)p(f |e) (3)

which contains the following three computation problems:

language modelling problem estimating the language model probability p(e). translation modelling problem estimating the translation model probability p(f|e).

search problem constructing an ecient search algorithm to nd the target sentence that maximizes the product of the language model probability and translation model probability.

Brown et al. (1993) try to solve the translation modelling problem by introducing word alignments. An alignment a is dened as a set of connections between a source sentence of m words, f = fm

1 ≡ f1. . . fj. . . fm,

and a target sentence of l words, e = el

1≡ e1. . . ei. . . el:

a ⊆ {(j, i) : j = 1, . . . , m; i = 0, . . . , l} (4) The translation model probability p(f|e) can be written in terms of the conditional probability p(f, a|e) as

p(f |e) =X

a

p(f , a|e) (5)

where a is an alignment between e and f. Dierent models have been proposed that decompose the probability p(f , a|e). MGIZA (Gao and Vogel, 2008), the word alignment tool used in this work, makes use of the IBM models (Brown et al., 1993) and HMM model (Vogel et al., 1996), which are trained using the EM algorithm (Dempster et al., 1977). The EM algorithm consists of two steps:

E-step For all sentences in the training corpus, calculate the expected number of times that target word e connects to source word f for the sentence pair (f, e) according to the alignment model:

c(f |e; f , e) =X a p(a|f , e) m X j=1 δ(f, fj)δ(e, eaj) | {z }

number of times e connects to f in a

(6)

where δ is the Kronecker delta function which returns one if both of its arguments are equal and zero otherwise.

(10)

M-step Calculate the model parameters according to the counts of the word alignments: p(f |e) ∝ S X s=1 c(f |e; f(s), e(s)) (7)

where (f(s)_{, e}(s)₎_{is one of the S sentence pairs in the training corpus.}

After optimizing the parameters of the chosen model, the best alignments ˆa can be found by computing: ˆ

a = arg max

a p(f , a|e) (8)

Take, for example, the IBM-1 model (Brown et al., 1993). IBM-1 assumes all source sentence lengths to have equal probability and assumes that all connections for each source word to be equally likely. This results in the following decomposition of the probability p(f, a|e)

p(f , a|e) = (l + 1)m m Y j=1 t(fj|eaj) (9)

where is a small xed number corresponding to the uniform distribution of the source sentence length, (l+1)−m

corresponds to the uniform distribution of the source word alignments12 _{and t(f}_j_|e_a

j)is the lexical translation probability of the source word at index j given the target word with which the source word is aligned. Combining Equation 5 with Equation 9 gives the following optimization function:

p(f |e) = (l + 1)m X a m Y j=1 t(fj|eaj) (10)

Maximizing Equation 10, constrained by Pft(f |e) = 1, gives rise to in the following two equations which

compute c(f|e; f, e) and t(f|e):

c(f |e; f , e) = t(f |e) Pl i=0t(f |ei) m X j=1 δ(f, fj) | {z } count of f in f l X i=0 δ(e, ei) | {z } count of e in e (11) t(f |e) = λ−1_e S X s=1 c(f |e; f(s), e(s)) (12)

where S is the number of sentence pairs in the training corpus and λeis a Lagrange multiplier which indicates

that the translation probabilities must be normalized.

Using Equations 11 and 12, the EM algorithm can be applied to optimize the parameters t(f|e): 1. initialize all t(f|e)

2. For each sentence pair (f(s)_{, e}(s)₎_{, use Equation 11 to compute c(f|e; f}(s)_{, e}(s)₎_.

3. For each source word f and target word e, use Equation 12 to calculate t(f|e). 4. Repeat steps 2 and 3 until t(f|e) converges.

After optimizing the IBM-1 model's parameters, t(f|e), the optimal alignments can be computed according to Equations 8 and 9.

4.2 Phrase pair extraction

The word alignments form the starting point for the phrase-based SMT system (Koehn et al., 2003). Given a training set of source sentences, target sentences and their word alignments, all possible phrase pairs that are in accordance with the word alignments were extracted from the parallel corpus. A phrase pair consists of one or more contiguous source words paired with one or more contiguous target words. Given a source sentence e and a target sentence f, a phrase pair (¯e, ¯f )is said to be consistent with its alignment a if and only if:

1. No source words in the phrase pair are aligned to words outside of the phrase pair: ∀fi∈ ¯f(ei, fj) ∈ a ⇒ ej∈ ¯e.

12_{The number of target words l is incremented by one to account for a null alignment, i.e. a source word that is not connected}

(11)

f1 f2 f3 f4 f5

e1 e2 e3 e4

source phrase target phrase

f1 f2 e2 f1 f2 e1e2 f4 e4 f5 e3 f4 f5 e3e4 f3 f4 f5 e3e4 f1 f2 f3 e2 f1 f2 f3 e1e2 f1 f2 f3 f4 f5 e2 e3 e4 f1 f2 f3 f4 f5 e1 e2e3 e4

Figure 3: Example of all phrase pairs that are in accordance with the alignment. 2. No target words in the phrase pair are aligned to words outside of the phrase pair:

∀ei∈¯e(ei, fj) ∈ a ⇒ fj∈ ¯f.

3. The phrase pair contains at least one word alignment: ∃ei∈ ¯e, fj∈ ¯f such that (ei, fj) ∈ a.

Figure 3 shows an example of all the phrase pairs that can be extracted given the word alignment between two sentences. Note that unaligned words can be grouped with adjacent phrases as they do not conict with the word alignment.

The translation model constructed from these extracted phrase pairs will be denoted as the normal transla-tion model. Two additransla-tional translatransla-tion models were constructed that contained empty translatransla-tions, i.e. source phrases that are translated with empty target phrases. Applying an empty translation is equivalent to removing a phrase from the source code, thus allowing pieces of source code to be excluded from the translation. empty Unaligned source code words are seen as evidence for adding empty translations. In Figure 3, for

example, the empty translation f3-<empty> would be extracted.

extra empty In addition to the empty translations based on unaligned words, for each phrase pair one empty translation is added to every source phrase. In Figure 3 that means that ten additional phrase pairs are added, i.e. one empty translation for each phrase pair.

The set of empty translations is similar to the content selection feature of Banko et al. (2000). Whereas the content selection feature determines whether a word in a document is used in its summary, the empty translation adds the ability to exclude source code words from the translation.

4.3 Features

Equation 1 shows how the probability p(e|f) can be decomposed into a translation model and language model. This work turns away from the Bayesian approach and instead uses a log-linear model to nd the most likely docstring ˆe for a function f:

ˆ e = arg max e Y i hi(f , e)λi (13)

where hi is a feature and λi is its corresponding weight. This allows the log-linear model to utilize multiple

statistical models, as well as dierent heuristic features, when scoring a candidate docstring. Section 4.5 shows how the feature weights are optimized.

The following subsections give an overview of the features used by this system. 4.3.1 Translation Model Features

A phrase-based SMT system translates a source sentence by decomposing the source sentence into source phrases and translating each source phrase with a target phrase. The probability assigned by a translation model feature hTM is the weighted product of the probabilities of the phrase pairs ( ¯f , ¯e)used in the decomposition:

hTM(f , e) =

Y

∀( ¯f ,¯e)∈D(f ,e)

TM( ¯f , ¯e) (14)

where D(f, e) is a decomposition of (f, e).

Four translation model features were used: the conditional probabilities p( ¯f |¯e)and p(¯e| ¯f ), weighted by λpf e

(12)

f1 f2 f3

e1 e2 e3

lex(¯e| ¯f , ¯a) = w(e1|f1) ·

1

2(w(e2|f2) + w(e2|f3)) · w(e3|<empty>)

Figure 4: An example of how the lexical weight is calculated for a phrase pair (¯e, ¯f )and their word alignment ¯

a.

estimated according to the phrase pairs extracted from a training set, as shown in Section 4.2. The conditional probability p(¯e| ¯f )was estimated by its relative frequency

p(¯e| ¯f ) = count( ¯f , ¯e)

count( ¯f ) (15)

Since the only requirement of extracting a phrase pair is that it is in accordance with the word alignment, often phrase pairs are created that have little supporting evidence, i.e. only few words between the phrase pairs are aligned while all other words are unaligned. Often such phrase pairs receive high conditional probabilities due to low frequency counts. For example, a phrase pair of which both source phrase and target phrase only occur once has the following conditional probabilities: count( ¯f , ¯e) = count( ¯f ) = count(¯e) = 1 ⇒ p(¯e| ¯f ) = p( ¯f |¯e) = 1₁ = 1.0. The lexical weight is a feature that measures the quality of a phrase pair according to the underlying word alignment. This allows the use of phrase pairs that have both high conditional probabilities and high lexical weights. The lexical weight lex(¯e| ¯f )is calculated as follows:

where ¯a is the word alignment between phrases ¯e and ¯f, and w(ei|fj) is the word translation probability

estimated by the relative frequency of the word alignments. Figure 4 shows an example of how the lexical weight is calculated for a given word alignment. If multiple alignments exist for a phrase pair, the highest lexical weight is used. The mirror probabilities p( ¯f |¯e) and lex( ¯f |¯e) were calculated in a similar manner to Equations 15 and 16, respectively.

As noted in Section 4.2, there are three translation models (normal, empty and extra empty) that correspond to three dierent phrase extraction approaches. Consequently, each translation model will have dierent values for translation model features p( ¯f |¯e), p(¯e| ¯f ), lex( ¯f |¯e)and lex(¯e| ¯f ). The conditional probabilities of the empty translations in the empty and extra empty translation models were calculated according to Equation 15. Since empty translations do not have any underlying alignment, their lexical weights were set to their conditional probabilities, i.e. lex( ¯f |<empty>) = p( ¯f |<empty>) and lex(<empty>| ¯f ) = p(<empty>| ¯f ).

4.3.2 Language Model

The language model feature pLM(e), weighted by λlm, measures the probability of a generated docstring e given

a model of the docstrings seen in the training set. The language model was created by extracting all n-grams, with n ≤ 3, from all target sentences. Their probabilities were calculated using their relative frequency. As backo strategy, stupid backo (Brants et al., 2007) was used to account for n-grams that did not appear in the language model:

pLM(e) = l

Y

i=1

p(ei|ei−1i−n+1) (18)

p(ei|ei−1i−n+1) =

   count(ei i−n+1) count(ei−1 i−n+1) count(e i i−n+1) > 0

α · p(ei|ei−1i−n+2) otherwise

(19) where l is the length of e, ei

j denotes the contiguous words with index j to i and α is a xed parameter set

to 0.4. If all of the words in the n-gram are unknown to the language model, an ad hoc low log-probability of -10000.0 is assigned.

(13)

4.3.3 Heuristic features

In addition to the statistical features discussed so far, three heuristic features were used: the phrase penalty, the word penalty and the linear distortion, weighted by λpp, λwp and λld, respectively. The phrase penalty assigns

a cost based on the number of phrase pairs used in a translation: hPP(f , e) = exp(

X

∀( ¯f ,¯e)∈D(f ,e)

−1) (20)

The word penalty assigns a bonus for each target word generated. hWP(f , e) = exp(

X

e∈e

1) (21)

The linear distortion measures the degree to which the transition deviates from a strict monotonic left-to-right translation. A monotonic translations refers to a translations in which the source phrases are translated in the same order as they appear in the source sentence.

hLD(f , e) = exp( |D(f ,e)|

X

i=2

LD( ¯fi−1, ¯fi)) (22)

The linear distortion between two source phrases ¯fi and ¯fj is dened as

LD( ¯fi, ¯fj) = −1 · |first_pos( ¯fj) − last_pos( ¯fi) − 1| (23)

Take for example the source sentence f0f1f2f3f4f5 of which the source phrase ¯fi = [f2, f3]is the only phrase

to have been translated. There are four possible starting positions for the next source phrase ¯fj: f0, f1, f4 and

f5. Since f4 is adjacent to the last position of ¯fi, ¯fj = [f4] would correspond to a monotonic distortion and

have a cost of LD( ¯fi= [f2, f3], ¯fj = [f4]) = −1 · |4 − 3 − 1| = 0. If instead the next source phrase would start

at position f0the distortion cost would be LD( ¯fi= [f2, f3], ¯fj= [f0]) = −1 · |0 − 3 − 1| = −4.

4.4 Decoder

The decoder is tasked with nding the most probable translation of a source sentence. Given the phrase pairs extracted from the training data, there may be many possible ways to decompose a source sentence into source phrases, as well as many possible orders in which the source phrases can be translated. The Viterbi algorithm (Forney Jr, 1973) is used to eciently traverse all candidate translations and nd the one that has the best score according to Equation 13. Due to the large amount of candidate translations per source sentence, pruning techniques are used to focus the search on the most promising candidate translations.

The Viterbi algorithm was implemented using a stack decoder, which translates the linearised AST of a Python function (source) into a docstring (target). In each translation the amount of stacks is equal to the number of words in the source sentence and each stack contains all partial translations that have covered the same amount of source words. A state contains the following properties:

The log-probability of the partial translation.

A history of the previous n−1 words in the translation, necessary to update the language model probability. A new target phrase that translates the current source phrase.

A coverage vector denoting the source words that have been translated. A back-pointer to the previous state.

The index of the last translated source word. This is used when calculating the linear distortion cost. An estimation of the future cost.

A state is expanded by translating a source phrase uncovered by the coverage vector. The log-probability of a state Sj that expands state Si is calculated by adding the logarithm of the weighted features, discussed in

Section 4.3, to the log-probability of state Si:

logprob_S

j =logprobSi+ λpf elog(p( ¯f |¯e)) + λpeflog(p(¯e| ¯f )) + λlexf elog(lex( ¯f |¯e)) + λlexeflog(lex(¯e| ¯f ))

+ λlm |¯e|

X

i=1

log(p(ei|ei−1i−n+1)) + λpp· −1

| {z } phrase penalty + λwp· |¯e| | {z } word penalty +λldLD( ¯fprev, ¯f ) (24)

(14)

where ( ¯f , ¯e) is the phrase pair that is used to translate the uncovered source phrase and ¯fprev refers to the

previously translated source phrase. If an empty translation is used, no new target words are generated and thus the language model probability and word penalty are not updated:

logprob_S

j =logprobSi+ λpf elog(p( ¯f |<empty>)) + λpeflog(p(<empty>| ¯f )) + λlexf elog(lex( ¯f |<empty>)) + λlexeflog(lex(<empty>| ¯f )) + λpp· −1

| {z }

phrase penalty

+λldLD( ¯fprev, ¯f ) (25)

Starting with the rst stack, the decoder expands all states until it reaches the nal stack. If two states are equal, i.e. they have the same coverage vector, history and index of last translated source word, then the state with the highest probability is kept. In the nal stack the coverage vectors cover all source words and each state corresponds to a full translation of the source sentence. A full translation can be generated by recursively following the states' back-pointers and fetching each part of the translation. In the end, the decoder returns the full translation with the highest probability.

Figure 6 shows an example of how the decoder translates the function in Figure 5. The following 4 phrase translations are used by the decoder:

source phrase target phrase

return get

height the height height arguments the height arguments <empty>

The coverage vector, indicated by the numbers 0 through 5, shows the indices of the source words that have been translated, i.e. indices that have been covered. A white background indicates an uncovered source word, a grey background indicates a covered source word and a black background indicates the index of the last translated source word. States can be identied by their unique id. Furthermore, each state shows its log-probability p, history h and target phrase ¯e. Initially there is only a single starting state, located in stack-0. Since the initial state has not translated any source words, it has a log-probability of 0.0, an empty history and an empty target phrase. The arrows point towards expansions of the initial state, three of which are placed in stack-1 and one of which is placed in stack-2. State 2, for example, expands the initial state by applying the phrase pair height-the height which covers index 0. All states in stack-n are expanded before moving to stack-n + 1. Similarly, an expansion of a state in stack-n will always produce a state in stack-n + 1 or higher, since an expansion always involves covering one or more words in the coverage vector.

State 2 expands to state 5 by applying the empty translation arguments-<empty>. However this state was already constructed when expanding the initial state. The red cross between state 2 and 5 indicates that this expansion has a lower probability than the expansion between state 0 and 4. Therefore, state 4 stores a back-pointer to state 0. Similarly, state 6 can be the result of expanding state 1 as well as expanding state 4, i.e. by covering indices 0 and 1 with the phrase pairs height-the height and arguments-<empty> or by covering indices 0 and 1 simultaneously with the phrase pair height arguments-the height. The red cross between state 1 and 6 indicates this expansion has a lower probability than the expansion of state 4.

To improve the speed of decoding, two types of pruning methods are applied to the stacks:

Beam size The beam size refers to the maximum dierence in log-probability between the state with the highest log-probability and any other state in the stack. For example, in Figure 6 a beam size of 10.0 would have the eect of removing state 7 from stack-3 since its log-probability diers more than 10.0 with the log-probability of state 6.

Stack size A maximum number of states are allowed in each stack. If a stack exceeds its maximum number of states, the state with the lowest probability is removed.

While these restrictions improve the speed of decoding they are also prone to introduce search errors. For example, a state that is removed during decoding due to its low probability could possibly have lead to a nal state, i.e. a state with a fully covered coverage vector, that has the highest probability and would thus produce the most likely translation. The future cost is used to help prevent such search errors from occurring by estimating the expected minimum cost of covering all uncovered words.; or put in other words, the expected maximum probability of covering all uncovered words. A state's current cost together with its future cost is then used when applying the aforementioned pruning methods. A state that currently has a low probability but has an expected high probability when accounting for the future cost is more likely to survive the pruning methods and avoid a search error.

(15)

def height ( s e l f ) : return s e l f . height height0 arguments1 self2return3 height4 self5

Figure 5: Python function and its linearised form. The numbers denote the indices of the source words. a source phrase ¯f:

F [i, j] =   

max¯elog pLM(¯e)λlm· p( ¯f |¯e)λpf e· p(¯e| ¯f )λpef· lex( ¯f |¯e)λlexf e· lex(¯e| ¯f )λlexef

if ¯f ∈ TM

−10.0 + log(pLM( ¯f )λlm) else if i = j

−10000.0 else

(26) Then for each span [i, j], such that 2 ≤ j − i + 1 ≤ n, where n is the number of source words:

F [i, j] = max

max

i≤k<jF [i, k] + F [k + 1, j], F [i, j]

(27) The future cost of a state is calculated by summing all uncovered spans in its coverage vector and adding the weighted log of the minimum linear distortion. For example, the future cost of state 1 is:

F [0, 2] + F [4, 5] + λld·   log(hLD(f3, f0)) | {z } −4 + log(hLD(f2, f4)) | {z } −1   

Additionally, restrictions are put on the translation model to reduce the number of target phrases per source phrase and thus further improve the speed of the decoder:

Phrase length A restriction on the amount of words in the source phrase of a phrase pair.

Top translations Each source phrase has a maximum number of target translations. The target phrases used are those with the highest score according to the weighted conditional probabilities and lexical weights:

p( ¯f |¯e)λpf e_{· p(¯}_{e| ¯}_{f )}λpef _{· lex( ¯}_{f |¯}_e)λlexf e_{· lex(¯}_{e| ¯}_{f )}λlexef

Note that the top translation restriction has impact on the extra empty translation model. Source phrases with more target phrases than the maximum are likely to exclude the extra empty translation. The source phrase of an extra added empty translation will often have a low frequency and thus have a low probability compared to the probabilities of the non-empty target translations. This means that these source phrases will not have an empty translation during decoding.

The following two strategies were devised to account for words unknown to the translation model:

self The unknown word is translated with itself, i.e. the unknown word is used in the target sentence without being translated.

empty The unknown word is translated with an empty target phrase. This is equivalent to removing the unknown word from the source sentence.

Since an unknown source word does not have any conditional probabilities or lexical weights, these features were assigned an ad hoc low log-probability of −10000.0.

4.5 Optimization

A pattern search method was used to optimize the features' weights. In each iteration, a xed step size is added to and subtracted from each of the feature weights. Each set of feature weights is assigned a score according to an evaluation metric that measure the quality of the candidate docstrings generated by the decoder using those feature weights. The pattern search method keeps track of the feature weights that achieved the highest score and uses these weights as the starting point in the next iteration. Once the score no longer improves, the step size is halved. This continues until either the dierence between the current best and previous best score is small enough, a maximum number of iterations has been reached or a minimum stepping size is reached. This optimization technique belongs to the family of local search algorithms and is capable of nding a local optimum. However, it is not guaranteed to nd the best set of feature weights, i.e. the global optimum. The pseudo-code is shown in Algorithm 1. The function get_score refers to the process of constructing and evaluating candidate docstrings.

(16)

0 1 2 3 4 p= 0.0 id=0 h= ē= 5 0 1 2 3 4 p= -12.0 id=1 h= get ē= get 5 0 1 2 3 4 p= -18.0 id=4 h= the height ē= the height 5 0 1 2 3 4 p= -25.0 id=6 h= the height ē= 5 0 1 2 3 4 p= -6.0 id=2 h= the height ē= the height 5 0 1 2 3 4 p= -10.0 id=5 h= the height ē= the height 5 0 1 2 3 4 p= -50.0 id=7 h= height get ē= get 5

stack 0 stack 1 stack 2 stack 3

0 1 2 3 4

p= -5.0 id=3 h=

5

ē=

Figure 6: Example of decoding the function shown in Figure 5. Algorithm 1 Pattern Search

Require: weights, get_score(·), step_size, min_di, min_step, max_iterations

1: iteration ← 0 2: di ← ∞

3: best_weights ← weights

4: best_score ← get_score(weights)

5: length ← length(weights)

6: while di ≥ min_di and iteration ≤ max_iteration and step_size ≥ min_step do 7: previous_weights ← best_weights

8: previous_score ← best_score 9: for pos: 0 to length do 10: step ← zeros(length) 11: step[pos] ← step_size

12: for weights ∈ {best_weights + step, best_weights − step} do 13: score ← get_score(weights)

14: if score > best_score then 15: best_weights ← weights 16: best_score ← score

17: end if

18: end for

19: end for

20: if previous_weights = best_weights then 21: step_size ← step_size/2

22: else

23: di ← best_score - previous_score 24: end if

25: iteration ← iteration + 1 26: end while

(17)

tune data linearised ASTs docstrings Software Projects ASTs docstrings annotated functions parallel corpus filtering alignments train data word alignment tool statistical models p(f|e) p(e|f) lex(f|e) lex(e|f) p(e) decoder

test data evaluation

optimizer candidate docstrings feature weights final score score

Figure 7: Overview of the docstring generation system.

4.6 Overview

Figure 7 gives an overview of the phrase-based SMT system. First, annotated functions are extracted from software projects by converting the source code into ASTs and selecting the AST of functions that are doc-umented. Next, the annotated functions are converted into a parallel corpus by linearising the ASTs. The ltering loop refers to the removal of unwanted pieces of documentation from the docstrings and the removal and tokenization of sentence pairs. The parallel corpus is then split into three datasets: the training set, tune set and test set. The training set is passed to the alignment tool which generates word alignments between the sentence pairs. The word alignments together with the training set are used to build the statistical models. The decoder uses the statistical models and feature weights to construct candidate docstrings for the linearised AST of the tune set. A score is computed by comparing the candidate docstrings with the reference docstrings of the tune set. This score is passed to the optimizer which uses the score to update the feature weights for the next iteration of candidate docstrings. After tuning the feature weights, the performance of the system is measured by evaluating the quality of the candidate docstrings generated for the test set. The candidate docstrings are compared with the reference docstrings of the test set to evaluate the performance of the system.

5 Experiments

This section details the experiments conducted to evaluate the performance of candidate docstrings generated by the phrase-based SMT system. First, an overview is given of the software projects used to construct the parallel corpus and the eect is shown of the lters applied to the docstrings. Next, a description is given of how the quality of the generated candidate docstrings is measured using the BLEU metric. The performance of the system is evaluated by optimizing the feature weights using the pattern search algorithm. This is done for three types of translation models and two methods of translation unknown words. Additionally, the following experiments were conducted: (1) an experiment on the inuence the amount of training data has on the performance of the system, (2) a qualitative analysis of the generated candidate docstring, (3) an experiment to improve the length of the candidate docstrings and (4) an experiment to determine the performance of the system on out-of-domain datasets.

(18)

5.1 Data

Python projects where taken from web-based hosting services for software development projects. Most projects were taken from Github13_{, Bitbucket}14 _{and Sourgeforge}15_{. A full list of the Python projects used in this work}

and their hosts can be found in appendix A.

The constructed parallel corpus contained 71200 lines. 2000 random lines were used as tune data and another 2000 random lines were used as test data. The remaining 67200 lines were used as train data. MGIZA (Gao and Vogel, 2008), a multi-threaded word alignment tool based on GIZA++ (Och and Ney, 2003), was used to automatically construct the word alignments. The optimization involved 5 iterations of IBM-1, 5 iterations of HMM, 3 iterations of IBM-3 and 3 iterations of IBM-4. The resulting word alignments were then used to train the conditional probabilities and lexical weights of the translation model. The reference docstrings in the training set were used to train the language model.

5.2 Filtering

Figure 8 shows the mean and standard deviation of the number of words in a docstring before (red) and after (blue) applying the docstring lters discussed in Section 3.2. The gure shows that applying the lters lowers the mean and standard deviation of the number of docstring words per source code words. Note that the lower bound of the standard deviation for the docstring prior to ltering (red) often have a negative value. Obviously, this does not imply that there are docstrings with a negative amount of words. This occurs, however, when the standard deviation of the number of words in a docstring is larger than their mean. This can happen when one docstring has many more words than the other docstrings. Take, for example, a list of docstrings that has the following number of words: [4, 5, 60]. From this list the mean µ and standard deviation σ can be calculated:

µ = 1 N N X i=1 xi= 4 + 5 + 60 3 = 23 σ = v u u t 1 N N X i=1 (xi− µ)2= r (4 − 23)2_{+ (5 − 23)}2_{+ (60 − 23)}2 3 ≈ 26.166

Since the standard deviation is greater than the mean, the lower bound lb will be negative: lb = µ − σ = 23 − 26.166 < 0.

Furthermore the gure shows that there is only a small dierence in docstring size between small functions and long functions. A function of 10 source code words and a function of 100 source code words have on average a summary of 14 and 25 words, respectively. This will become problematic as the decoder will need to be able to simultaneously translate long source sentences into smaller target sentences and small source sentences into longer target sentences. This problem is further discussed in Section 5.7.

5.3 Evaluation

The quality of docstrings generated by this work were evaluated with BLEU (Bilingual Evaluation Understudy) (Papineni et al., 2002), a metric commonly used to evaluate the quality of machine translated text. The BLEU score is based on the precision of n-grams in a translated sentence occurring in the reference sentence. Since it has been shown that the BLEU score correlates with human judgement, an improvement in BLEU score is taken as evidence for improvement in translation quality. Similarly, in this work an improvement in BLEU score will be seen as evidence for improvement in documentation quality16_.

The BLEU score is calculated as the product of the brevity penalty (BP) and the geometric mean of a modied n-gram precision:

BLEU = BP · exp 1 N N X n=1 log pn ! (28) where N is set to 4. 13_{https://github.com/} 14_{https://bitbucket.org/} 15_{http://sourceforge.net/}

16_{This is based on the assumption that the resemblance between a candidate docstring and a reference docstring can be measured}

in the same way as the resemblance between a candidate sentence and a reference sentence in a machine translation task. To verify this, a correlation must be shown to exist between automatically generated BLEU scores and human judgement scores of documentation quality. This is left as future work.

(19)

The brevity penalty penalizes translated sentences that contain fewer words than their reference sentence. The brevity penalty is computed over the entire corpus as follows:

BP = 1 if c > r exp(1 −r c) else (29) where c corresponds to the number of words in all candidate translations and r corresponds to the number of words in all references translations17_.

The n-gram precision pn is computed by dividing the total number of n-grams in each candidate translation

that match an n-gram in the corresponding reference translation by the total number of n-grams in each candidate translation. The precision is modied such that the number of matches of each n-gram is clipped to the largest count of that n-gram observed in the reference translation. Similarly to the brevity penalty, the modied n-gram precision is calculated over the entire corpus:

pn=

P

C∈Candidates

P

ngram∈CMatchesclip(ngram)

P

C∈Candidates

P

ngram∈CCount(ngram)

(30) As an example, take the following candidate translations and reference translations:

candidate reference

extract even even numbers return the even numbers the sum of all items return the sum of all items in iterable

The rst candidate translation matches 3 out of 4 uni-grams: once for the word numbers and twice for the word even. However, since the candidate word even only occurs once in the reference translation the number of matches for even is clipped from 2 to 1. Therefore, only 2 out of 4 uni-gram matches will be considered. The rst candidate sentence also matches 1 out of 3 bi-grams, but no tri-grams and no 4-grams. The second candidate sentence matches 5 out of 5 uni-grams, 4 out of 4 bi-grams, 3 out of 3 tri-grams and 2 out of 2 4-grams. This gives the following n-grams precisions:

p1= 2+5₄₊₅ = 7₉ p2=1+4₃₊₄ = 5₇ p3=0+3₂₊₃ = 3₅ p4= 0+2₁₊₂ =2₃

There are 9 candidate words and 12 reference words, hence the brevity penalty is applied when calculating the BLEU score: BLEU = exp 1 −12 9 · exp 1 4 log7 9 + log 5 7 + log 3 5 + log 2 3 = 0.4920

This paper uses the %-BLEU notation. For example, 0.4920 BLEU will be reported as a BLEU (%) score of 49.20.

5.4 Optimization

After building the models for the decoder, pattern search was used to optimize the feature weights. Table 1 shows the BLEU (%) scores on the tune set with the corresponding weights for the three types of translation models (normal, empty and extra empty) and two approaches to translating unknown words (self and empty). The pattern search used a minimum step size of 0.25 and a minimum score dierence of 0.1 BLEU (%). The decoder used a beam size of 1.0, a stack size of 100, a maximum phrase length of 7 and 10 top translations. The results show that all methods using empty translations improve the BLEU score. The empty translation model in combination with empty translation of unknown words achieved the highest BLEU (%) score of 14.51. Table 2 shows an additional BLEU (%) score of 14.6 for this combination when setting the minimum step size to 0.03125 and the minimum score dierence is set to 0.01 BLEU (%). For all settings the optimal linear distortion weight λldwas 0.0, indicating that the best results are achieved when the order in which source code

segments are translated does not abide to a monotonic translation. This can indicate that (1) there is another linearisation of a function's AST that better reects the order in which source code segments are translated or (2) the structure of a function's AST inherently does not reect the order in which source code segments are translated. Analysing alternative linearisation strategies of a function's AST is left as future work.

Table 3 shows the BLEU (%) score when using dierent decoder parameters. Overall, increasing the decoder's beam size improves the decoder's performance. Increasing the amount of top translations has a positive eect on the normal translation model, but mostly negative eect on the empty and extra empty translation models.

17_{If a candidate translation has multiple references translations the reference translation with the closest number of words is}

(20)

20 40 60 80 100 source code words

150 100 50 0 50 100 150 200 250

docstring words: mean and standard deviation

docstring words per source code words

raw clean

Figure 8: The y-axis shows the mean and standard deviation of the number of words in a docstring. The x-axis shows the corresponding number of words in the source code. Data before and after applying lters are denoted in red and blue, respectively.

feature weights

TM unknown word λpf e λpef λlexf e λlexef λlm λpp λwp λld BLEU (%)

normal _emptyself 6.0_6.0 0.75_0.5 _0.00.0 0.5_0.5 1.0_1.0 _1.01.0 _{1.0 0.0}1.0 0.0 10.80_11.11 empty _emptyself 6.0_8.0 1.0_1.0 1.5_1.0 -0.5_-1.0 1.0 -1.0_1.0 _1.0 1.0 0.0_{1.0 0.0} _14.5113.91 extra empty _emptyself 5.0_5.0 1.0_1.0 _0.00.0 0.0_0.0 1.0_1.0 _2.02.0 _{1.0 0.0}1.0 0.0 13.82_13.80 Table 1: BLEU score on the tune set and corresponding feature weights after applying pattern search with minimal step size 0.25 and min score dierence 0.1 BLEU (%).

This is most likely caused by an increase in empty translations being used during decoding. An increase in top translations allows empty translations, that would otherwise have been removed, to be used during decoding. To ensure that the increase of empty translations does not lead to too small candidate translations, the feature weights would need to be re-optimized. Note, however, the number of seconds it takes to decode the tune set, shown between brackets in Table 3. An increase in any of the search parameters corresponds to a considerable increases in decoding time. Consequently, optimizing with increased search parameters would also take considerably longer.

Table 4 shows the results on the test set when using the optimized features weights and decoder parameters. The extra empty translation model combined with the self method of translating unknown words reached the best BLEU (%) score of 14.94.

feature weights

TM unknown word λpf e λpef λlexf e λlexef λlm λpp λwp λld BLEU (%)

empty empty 8.0 1.0 1.0 -0.96875 1.0 1.03125 1.0 0.0 14.6

Table 2: BLEU (%) scores on the tune set and corresponding feature weights after applying pattern search with minimal step size 0.03125 and min score dierence 0.01. p.p. is the phrase penalty, w.p. is the word penalty and l.d. is the linear distortion.

(21)

Beam Size TM TT 1 10 100 1 10 100 normal 1050 10.80 (280)10.95 (822) 11.37 (588)11.55 (2040) 11.45 (737)11.66 (2583) 11.11 (283)11.26 (809) 11.79 (622)12.02 (2064) 12.07 (2564)11.82 (723) 100 10.94 (1374) 11.56 (3593) 11.64 (4488) 11.27 (1375) 12.03 (3638) 12.05 (4499) empty 1050 13.91 (311)13.91 (893) 14.49 (665)14.51 (2348) 14.50 (3180)14.59 (945) 14.51 (268)14.45 (744) 14.80 (552)14.77 (1906) 14.73 (2630)14.84 (742) 100 13.91 (1446) 14.44 (4047) 14.48 (5553) 14.45 (1276) 14.80 (3313) 14.69 (4624) extra empty 1050 13.82 (324)12.82 (899) 14.56 (623)13.71 (2085) 14.61 (781)13.84 (2696) 13.80 (319)12.68 (878) 14.55 (628)12.60 (2125) 13.72 (2664)14.60 (770) 100 12.81 (1470) 13.82 (3530) 13.94 (4697) 12.68 (1487) 13.71 (3685) 13.82 (4674) self empty unknown word

Table 3: BLEU (%) scores on the tune set for dierent decoder parameters settings. In brackets shows the number of seconds it took to generate the translations. TM is the type of translation model and TT is the number of top translations.

TM unknown word Top Translations Beam Size BLEU (%)

normal _emptyself 50₅₀ 100₁₀₀ 12.47_12.74

empty _emptyself 50₁₀ ₁₀₀10 14.58_14.85

extra empty _emptyself 10₁₀ 100₁₀₀ 14.94_14.92

Table 4: BLEU (%) scores on the test set. The optimal feature weights were set according to Table 1. The optimal phrase translations and beam size were set according to Table 3.

5.5 Training Data

As with any machine learning problem the amount of training data can eect the performance of the system. To investigate the eect of the amount training data on the quality of the generated docstrings, the training data was split into 10 random segments and then combined to form cumulative portions of the full training data. Figure 9 shows the performance on the test set for the dierent translation models and translation of unknown words. 100% corresponds to all 67200 lines of training data. The gure shows an improvement in BLEU score as the amount of training data increases. While the improvements gradual decrease with each increment of training data, a further increase of training data will still likely improve the system's performance. The gure shows only small dierence in performance between the empty and extra empty translation models, and an increasing dierence between the normal translation model and the empty and extra empty translation models. When training on 0% of the training data, applying the self method of translating unknown words is equivalent to not translating the source code at all. This results in a BLEU (%) score of 0.26.

5.6 Qualitative analysis

What follows is a qualitative analysis of some test set docstrings generated with the extra empty translation model and the self method for translating unknown words, the combination that scored the highest BLEU (%) score of 14.94 on the test set after the optimization. Each example shows the original Python function documented by the reference docstring alongside the alignments between the function's linearised AST (top) and the generated docstring (bottom). Each alignment corresponds to a phrase translation used by the decoder. Source code phrases that have no alignment indicate the use of an empty translation.

Figure 10 shows a function that deletes the children of a node. The alignment between children arguments self item and children of the given item intuitively seems to be good. Similarly, the alignment between keys nodes self and list of underlying nodes seems to be good, although in this context an empty translation would have been better. The generated docstring does not refer to any deletion since an empty translation was applied to delete. Similarly, in Figure 11 the candidate docstring misses important information by applying an empty translation to remove.

Figure 12 shows a case in which the reference docstring is a subset of the candidate docstring. The phrases line height, call get and scale self would be better o with an empty translation. The phrase line height, however, does have an agreeable alignment with item height'. The phrases bin op and div, corresponding

(22)

0 20 40 60 80 100 portion of all training data (%)

0 2 4 6 8 10 12 14 16 BLEU score (%)

tm: normal, unknown: self tm: normal, unknown: empty tm: empty, unknown: self tm: empty, unknown: empty tm: extra empty, unknown: self tm: extra empty, unknown: empty

Figure 9: BLEU scores on the test set when using dierent portions of the training data. The feature weights were set according to Table 1. The number of phrase translations was set to 10 and the beam size was set to 1. to the division operation, are correctly not used in the translation.

The candidate docstring in Figure 13 gives a more abstract summary of the function's behaviour than the reference docstring. The reference docstring names all the dierent time units that are returned by the function, while the candidate docstring captures this by simply stating that the function returns the time. While still descriptive of the function's behaviour, there is little overlap between the candidate and reference docstring, for which the overall BLEU score will be penalized. Also note that some of the alignments between the source and target phrases do not intuitively seem as good translations. The translation of year self month self day self hour to return does not seem to make much sense, however it is an important part of the nal candidate docstring.

Figure 14 shows another example of a candidate docstring that, while having little overlap with the reference docstring, still gives a reasonable description of the function. The candidate docstring gives a description as to what the purpose is of the method __lt__, i.e. determining the less-than relation between two objects, but does not explain how the less-than relation is evaluated. The fact that this phrase translation exists indicates that there are cases in which this phrase was used to document the less-than function.

Figure 15 shows an almost perfect translation of the function's linearised AST, thus showing that it is possible to generate the required reference docstring. Figure 16 shows another example in which the candidate docstring perfectly matches the reference docstring.

This analysis gives insight into some of the problems that are faced when constructing a candidate docstring. First of all, the empty translations applied by the decoder can improve as well as harm the quality of the candidate translations. The empty translations of delete and remove in Figures 10 and 12, respectively, show that the empty translation can cause valuable information about a function's behaviour to be excluded from the docstring. However, the decoder has no way of knowing that some words should be preferred over others. For example, the decoder can only choose between a candidate translation (a) that excludes the word delete but correctly ends the sentence with a period and a candidate translation (b) that includes the word delete but leaves out the period, according to the probabilities of the two candidate translations. It has no knowledge of how informative one candidate translation is over the other. Similarly, the BLEU score will not penalize one of these two candidate translations more heavily than the other. In order to dierentiate between candidate translations (a) and (b), an evaluation metric must be used that takes into account how informative the n-grams are that co-occur in the reference translation. The evaluation metric NIST (Doddington, 2002) does exactly that. NIST is an adaptation of the BLEU metric which considers n-grams that occur less often to be more informative and weights them more heavily according to their information value:

Info(w1. . . wn) = log2

count(w1. . . wn−1)

count(w1. . . wn) (31)

Automated Docstring Generation for Python Functions