Improve code embedding model by changing the code representation method

(1)

Improve code embedding model by changing the code

representation method

Dongliang Liu

ABSTRACT

In recent years, the techniques that have driven significant progress on natural language processing are being applied to source code. Although source code has many commonalities with natural lan-guage, it also possesses attributes not present in natural language that can be leveraged. Code embedding is a technology migrated from word embedding, to map code to a vector space. In this thesis, we aim to explore how we can utilize specific attributes of code to improve code embedding for common downstream tasks. We plan to use abstract syntax tree (AST) paths to represent the code context and apply code-specific tokenization methods to improve the performance.

We will use the path-attention network described in prior litera-ture and train the model with our code representation data. Some training approaches in NLP, such as Replaced Token Detection (RTD) and Masked Language Modeling (MLM) can be applied as our training objectives. We will evaluate the model in a variety of ways, including code method name prediction, measuring code similarity, and cloned code detection. Based on these criteria, we compare our model against the state-of-the-art, to evaluate if our method results in better embeddings.

1 PERSONAL DETAILS

My email dongliang.liu@student.uva.nl

My supervisors email UvA: Simon Baars s.j.baars@uva.nl ING: luiz.vilas.boas.oliveira@ing.com

The wiki on my github account https://github.com/Dongliang7/ CodeEmbedding

2 RESEARCH QUESTION

The research question of this thesis is: Can we improve code embed-ding models by utilizing code structure and grammar information? More specifically, we want to improve the performance of state-of-the-art code embedding network[1] by proposing approaches in data preparation, code representation, and training steps and evaluate it in program-understanding tasks, such as code clone detection and method names prediction.

In this thesis design we propose our plan to address the following research questions:

(1) What data sources offer code embeddings that can be used to verify our model? There are available large corpuses of various code languages that can be used as our dataset, like CodeSearchNet[2], Google Code Jam[3] and BigCloneBench[4]. We may also collect source code data ourselves if currently existing data sources don’t meet our training and experiment requirements.

(2) What code representations can be used to improve state-of-the-art embedding models? The previous research on code language processing work used sequence of tokens, abstract

syntax tree (AST) and graph representation, for example data flow and control flow graphs. We need to figure out which way will better reveal the information inside the code and its structure.

(3) What training objectives are fit for our model? There are many training objectives, such as measuring the code similar-ity, predicting the method names, replaced token detection (RTD)[5] and masked language modeling (MLM)[6]. (4) What methods are available evaluate our code embeddings?

Do we plan to measure the performance in a qualitative or quantitative way? In literature, several evaluation methods are presented, for example, predicting code functions name, code similarity measurement, and semantic clone detection. In this thesis we will propose our own appropriate evaluation plan.

3 RELATED LITERATURE

The programming language processing area has been expanding recently and code embedding is one of the key aspects to analyze source code. The following table lists some of the most relevant articles and work on programming language processing and code embedding.

In Code2seq[7] and Code2vec[1], the authors use Java projects from GitHub as their data source and they represent a code snippet as a set of compositional paths in its abstract syntax tree (AST). Code2seq use a bi-directional LSTM to encode the entire path sequence and attention mechanism to select the most relevant paths while decoding. In Code2vec, a path-attention network is constructed. This attention-based neural network can identify the importance of multiple path-contexts and aggregate them accord-ingly to make a prediction. Each node of the AST paths also contains additional movement direction information (either up or down in the AST tree). Compared to the standard tokenized code sequences, these abstract syntax tree (AST) path approaches are able to lever-age the structural information of code fragments. They evaluate the performance by predicting a Java method’s name given its func-tion body and the results indicate that these models significantly outperform the previous NMT and transformer models that were specifically designed for programming languages.

The path-attention network described in Code2vec is a good example of learning vectors from arbitrary-sized snippets of code. This model allows us to embed a program, which is a discrete object, into a continuous space, such that it can be fed into a machine learn-ing pipeline for various tasks. The key point is that a code snippet is composed of a bag of contexts, and each context is represented by a vector whose values are learned. In Code2vec each context is a triplet that contains the AST path and two terminal nodes of the AST path. Then, every context vector is formed by a concate-nation of three independent vectors, a fully connected layer learns

(2)

Dongliang Liu Data Source Code Represen-tation Model Evaluation Code2seq (Alon et al. (2018)) Java projects from GitHub AST paths BiLSTM with attention mechanism Predicting Java meth-ods’ names Code2vec (Alon et al. (2019)) Java projects from GitHub AST paths Path-attention neural net-work Predicting Java meth-ods’ names Detecting Code Clones (Wang et al. (2020)) Google Code Jam and Big-CloneBench Graph represen-tation of syntax tree GNN models Measuring the similar-ity of code pairs Neural Code Compre-hension Code in various source lan-guages Contextual Flow Graph (XFG) RNN models Analogies, Algorithm classifica-tion codeBERT Code Search Net Token Se-quences Pre-trained BERT model Code search and code documen-tation generation CuBERT Python files from GitHub Python tokenize package Pre-trained BERT model Classification tasks

Table 1: Representative literature on code embedding

to combine its components. At last, the attention mechanism com-putes a weighted average over the combined context vectors. The aggregated code vector which represents the whole code snippet, is a linear combination of the combined context vectors factored by their attention weights.

Figure 1: Path-attention network in Code2vec

Detecting semantic code clones [8] is another interesting pro-gram analysis task. To leverage control and data flow information in such a task, Wang et al. (2020)[9] construct graph representa-tions of programs to reflect code information from ASTs. Then, they calculate vector representations for code fragments using graph neural networks. They applied their model to two Java datasets: Google Code Jam and BigCloneBench [4]. The results show that this approach outperforms most existing clone detection approaches.

Besides processing the code directly or using a syntactic tree representation, some other novel processing techniques to learn code semantics are also proposed. Ben-Nun etal. (2018) [10] use the LLVM Compiler Infrastructure to convert the code in various languages to an Intermediate Representation (IR) and then process them to ConteXtual Flow Graphs (XFG). XFGs are constructed from both the data- and control-flow of the code and used to train an embedding space for individual statements, which is fed to RNNs for high level tasks.

CodeBERT [11]and CuBERT[12] are BERT-like models pre-trained by replaced token detection (RTD) [5] and masked language model-ing (MLM) tasks. They use code functions from GitHub repositories as pre-training data, and regard a piece of code as a sequence of tokens as input representation. These BERT models are fine-tuned for downstream tasks, such as natural language code search.

4 METHODOLOGY

Prior research has shown that NLP techniques can be applied to code embeddings. This thesis aims to explore an approach to im-prove the performance of code embeddings as measured by the downstream task. This section explains how we plan to collect data, represent code snippets, train our model and evaluate performance.

4.1 Data

In this thesis, I plan to use the CodeSearchNet[2] Java Corpus as data source. Some experiments have been conducted on this data at the first stage of my research, like AST generation and tokenization. There are other open source programming language corpuses available as the backup plan.

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code search using natural language. To enable evaluation of progress on code search, they collect the corpus by scraping GitHub repositories and pairing individual functions with their documentation as natural language annotation. They use libraries.io to identify all projects which are used by at least one other project, and sort them by “popularity” as indicated by the number of stars and forks. Then, they remove any projects that do not have a license or whose license does not explicitly permit the re-distribution of parts of the project. Finally, they tokenize all functions (or methods) using TreeSitter. The corpus contains about 6 million functions spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). Data statistics are shown in Table 2.

There are a number of preprocessing steps implemented to re-move duplicates, auto-generated code, some cases of copy & pasting, constructors and standard extension methods, test functions, etc. More details of the CodeSearchNet Corpus can be found in their paper[2].

(3)

Improve code embedding model by changing the code representation method Number of Functions w/ documentation All Go 347 789 726 768 Java 542 991 1 569 889 JavaScript 157 988 1 857 835 PHP 717 313 977 821 Python 503 502 1 156 085 Ruby 57 393 164 048 All 2 326 975 6 452 446 Table 2: CodeSearchNet Size Statistics

As we mentioned in the related literature section, this Code-SearchNet Corpus is used in some related research work, for ex-ample pre-training the codeBERT model[11], thus the feasibility of using it in the code embedding area has been proven. We believe by using this existing, preprocessed, multi-language corpus will reduce our work on the data preparation step and is also convenient for us to compare the performance with other code embedding models at evaluation stage.

4.2 Code Representation

How to represent snippets of code is a core problem in code em-bedding work, we want to leverage the structural and grammar information of code in a good manner by representing the code. According to the related literature[13], there are three major meth-ods: 1) Token sequences, this can be done by using some available tools or packages, for example thetokenize package in Python. 2) AST paths, compared to directly using code tokens more grammar and structural information can be revealed in this way. 3) Graph representation, such as constructing graphs by utilizing control flow and data flow information.

Considering AST and graph representation methods are more promising to make use of the structural information and used as in-put to achieve significant performance advantages in prior research. In addition, graph representation requires more work on construct-ing graphs and doesn’t prove to outperform the AST methods in various tasks. AST paths representation is planned to be used to represent code snippets in my thesis and some methods to improve it are proposed.

Figure 2: An example of AST paths generated from a Java method

Figure 2 shows how an AST look like in Code2seq [7] and Code2vec [1]. The four different colors paths represent the top-4 most relevant AST paths that are selected by the attention mech-anism. In this thesis, we plan to generate the AST paths by using javalang, a Python library that provides a lexer and parser target-ing Java source code. Ustarget-ingjavalang, AST paths with value and type information of each token node can be generated in Java file, function or statement levels.

In an AST, the paths are still represented by token sequences. In our work, in order to improve the code embedding performance, we will conduct experiments in tokenization. One is adding the type information into value of the nodes. For example, there’s a statementint count = 0; where the type of the int is a basic type, count is identifier, = is a operator, and 0 is an integer value. We keep a map between these types and the Greek alphabet, such that the new node contains type information. Constructing new AST paths representations by including these new nodes may improve the embedding performance. Another experiment is splitting some specific nodes in AST paths, for example a long method name or an identifier name. Subtokens are already applied in a lot of code embedding works so in this thesis we will leave programming language keywords intact and produce subtokens according to common heuristic rules (e.g., snake or camelCase) or by using subword algorithms such as WordPiece.

4.3 Model and Training

We will make use of the path-attention network that is mentioned in the 3 to build our own model. The code, data, and trained models in prior research work are available on GitHub1. We will use the new generated AST paths as the bag of path-contexts to represent the code snippets (or code function in our case) to generate code context vectors.

Here we propose to use a new training task Replaced Token Detection (RTD) to train our model instead of using the one in Code2vec. The idea of it is training the model with the corrupted data in which some nodes or statements inside the code function may be replaced. The objective is to determine whether each token is the original one or one that we replaced. We can use a generator, typically a small masked language model, to produce replaced to-kens. We set labels for them to indicate whether it is corrupted, then our model is trained to distinguish real input tokens from plausible but synthetically generated replacements. As a backup plan, we will continue to apply the identical training method as in Code2vec, predicting the tag/name of the code function, to train our model in case we fail to build the model by using the RTD method.

4.4 Evaluation

The impact of implementing the new proposed code tokenization plans can be evaluated at an earlier stage. Similar to a natural lan-guage processing task, we propose to use fastText as a baseline. fast-Text is a library for efficient learning of word representations and sentence classifications that provides models (skipgram and cbow) for computing word vectors. We can compare the results between: 1) fastText; 2) fastText + type information; 3) fastText + WordPiece

(4)

Dongliang Liu

algorithms; 4) fastText + subtokens according to snake_case or camelCase rule.

The trained network can be used to perform a downstream task using the code vector. In the 3, there are a number of tasks that we can perform to compare the results between different models. One popular evaluation task is code clone detection which aims to measure the similarity between two code snippets. In our case, we measure code similarity by measuring the similarity of code vectors. In order to determine whether a pair of code functions is a clone pair, we set a threshold value between potential clone pairs. Then, it is a true clone pair if their similarity score is larger than the threshold value and vice versa. In this task, we may want to introduce another data source, for example BigCloneBench, because it is a famous code clone detection challenge with many other approaches we can compare with.

5 RISK ASSESSMENT

At this moment, there are still many uncertainties in this thesis project.

(1) Is the CodeSearchNet dataset appropriate to be used in our model? For some specific code understanding tasks we may need to use other datasets. For example, we may switch to use BigCloneBench in a similarity measurement task. There are plenty of open source corpus available for code embedding research and the effort of processing them won’t be costly. (2) Is Replaced Token Detection the optimal option for

train-ing the network? This is a newly introduced pre-traintrain-ing objective that has not been applied in the majority of code embedding models. The outcome and complexity of applying RTD is uncertain now. So we propose a backup plan to use method prediction task to train the model.

(3) Is it realistic to implement all components of the design in four months? This design overall is manageable and achiev-able. We plan to start from only adding type information in AST representation and finish most work of this part by April to ensure we are able to deliver valuable result. After that we will implement splitting nodes method, evaluate the performance with more tasks and compare our model with more state-of-the-art code embedding models.

6 PROJECT PLAN

The timeline of this project is as follows:

Figure 3: Roadmap for the thesis

Preparation (now - 3/31): Finish thesis design, clean the data source if necessary.

Step 1 (now - May): Type information addition method: imple-ment an application to add type information in AST representation, evaluate it in FastText, apply it in code2vec model, and evaluate the new model with method name prediction task in the end.

Step 2 (May-June): Experiment splitting methods, also evalu-ate the representation method in FastText. If time allows, we will evaluate the model with other tasks, such as clone detection, and compare it with more state-of-the-art code embedding models.

Step 3 (June): Write the final thesis and prepare presentation.

REFERENCES

[1] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3(POPL):40, 2019.

[2] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge evaluating the state of semantic code search. 2019.

[3] Gang Zhao and Jeff Huang. Deepsim: deep learning code functional similarity. 2018.

[4] Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal K. Roy, and Moham-mad Mamun Mia. Towards a big data curated benchmark of inter-project code clones. 2019.

[5] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. Electra: Pre-training text encoders as discriminators rather than generators. 2020.

[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. [7] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq: Generating

sequences from structured representations of code. 2018.

[8] Lutz Buch and Artur Andrzejak. Learning-based recursive aggregation of abstract syntax trees for code clone detection. 2020.

[9] Wenhan Wang, Ge Li, and Bo Ma. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. 2020.

[10] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. Neural code comprehension: A learnable representation of code semantics. pages 3585–3597, 2018.

[11] Zhangyin Feng, Daya Guo, and Duyu Tang. Codebert: A pre-trained model for programming and natural languages. 2020.

[12] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and evaluating contextual embedding of source code. 2019.

[13] Zimin Chen and Martin Monperrus. A literature study of embeddings on source code. 2019.