Statistical Lexical Analysis

(1)

Statistical Lexical Analysis

Ton Heijligers

March 31, 2017

Master’s Thesis

Supervisor: dr. Vadim Zaytsev

(2)

Abstract

How does a statistical lexer, created with a sophisticated Natural Language Processing (NLP) algorithm in the field of word segmentation and POS tagging, named Conditional Random Fields, compare to a deterministic one? Can it be used for determining source lines of code, multilingual lexing or as a language detection tool?

Segmenting code into tokens works best when the model learns if a character is a token beginning, token end, token internal or from a single character token as apposed to if a character is a token beginning or not. Labeling code is done by providing 1-gram, 2-gram and 3-gram fragments. Lexing could be done by doing segmentation and labeling at the same time (1-step method) or by first segmenting and then labeling (2-step method). The 1-step method is much more effective.

Creating a monolingual statistical lexer for Python, Java, HTML, Javascript, CSS and Rascal from up to 50 files has an overall f1-score between 0.89 en 0.99. When the statistical lexer is used to decide the lines of code, the results are disappointing. The statistical lexer is often confused about string literals, variable names and comments.

A multilingual statistical lexer for HTML, Javascript and CSS has low f1-scores. The multilingual lexer is not capable of recognising what languages are presented. Further investigation could improve those results.

Title: Statistical Lexical Analysis

Authors: Ton Heijligers, tonheijligers@gmail.com Supervisor: dr. Vadim Zaytsev

Date: March 31, 2017

Master Software Engineering University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl

(3)

1. Introduction

The high availability of open source software gives the possibility to extract knowledge through data mining and machine learning, that can be used for useful software engi-neering tools, as autocompletion [1], code templates [19], translation between natural language and source code [12].

Since software seems to be as natural as spoken language [1], the NLP research could be applied on software as well. This project focuses on applying a state of the art NLP algorithm for word segmentation and Part Of Speech (POS) tagging on software in order to tokenise it. This tokenisation is normally done by a compiler or interpreter and in particular by the lexer [2]. The lexical analysis converts code into a list of tokens with on every token a label that is meaningful for the syntactic analysis. In Table 1.1 a piece of Python code is separated into tokens and the tokens are labeled with a token type, and for the operators there is also an exact token type. This is done by the deterministic Python scanner from the module tokenize [3]. This project will give insight in the feasibility and performance of a statistical lexer.

(6)

Code Token Tokentype Exact tokentype # this is Python code ‘# this is Python code‘ COMMENT COMMENT

def method(self): ‘\n‘ NEWLINE NEWLINE

x = ”%s”% str(1) ‘def‘ NAME NAME

‘method‘ NAME NAME

‘(‘ OP LPAR

‘self‘ NAME NAME

‘)‘ OP RPAR ‘:‘ OP COLON ‘\n‘ NEWLINE NEWLINE ‘ ‘ INDENT INDENT ‘x‘ NAME NAME ‘=‘ OP EQUAL ‘”%s”‘ STRING STRING ‘%‘ OP PERCENT ‘str‘ NAME NAME ‘(‘ OP LPAR ‘1‘ NUMBER NUMBER ‘)‘ OP RPAR “ DEDENT DEDENT “ ENDMARKER ENDMARKER

(7)

2. Motivation

2.1. Monolingual statistical lexing

It is not very likely that the reason of existence for a statistical lexer is to replace the deterministic lexer. The deterministic lexer is a step in the compiler frontend or interpreter of a programming language. In order to be able to develop software, the lexer should be 100% accurate and fast. If for a certain code analysis task lexical information is needed and a deterministic lexer for the code is available, then a statistical lexer could only have value if the statistical lexer is faster and speed is more important than accuracy. But not for all programming languages are deterministic lexers available as a separate tool. In such a case a statistical lexer can help doing code analysis, where the results are as accurate as the accuracy of the statistical lexer. As an example, the statistical lexer could separate comments from code, and count the lines of code (LOC) [9] as a measure for the size of a software system [9]. An important condition is that the creation of the statistical lexer is easier than the creation of a deterministic version.

For a lot of code analysis tasks the lexical information is not enough and syntactical information is needed [9] [12]. The feasibility of a statistical lexer will make it easier to do statistical syntax analysis, where the code is not only tokenised but also converted to an abstract syntax tree. In the compiler frontend the lexical analysis and syntactical analysis are separate steps, mainly for simplicity [2], so that the syntax analysis does not have to be bothered with comments and whitespace. This simplicity-argument is valid for statistical analysis too.

2.2. Statistical multilingual lexing

Another application of statistical lexing is creating a statistical multilingual lexer. Such a lexer can make lexical sense out of a piece of code, without being told the language. The piece can be monolingual, with embedded code or contain fragments of multiple languages. If this lexer can make for multiple languages a difference between comment and code, it can be used to give an indication of the lines of code of a software project constructed with multiple languages.

When the statistical multilingual lexer is learning from examples with awareness from what language the examples are, it can be used as a language recognition tool. It could be able to recognise if a piece of code is monolingual or contains multiple languages, and what the languages are. Embedded languages may distort the recognition of a programming language by generative methods, like n-gram and Naive Bayes [20]. If there were a tool to recognise if a piece of code contains embedded code, the effect of

(8)

(9)

3. Research Method

1. Create statistical lexer with state of the art NLP techniques for Python, Java, HTML, CSS and Javascript, that have deterministic lexers as separate tools avail-able, and measure accuracy, precision and f1-score.

2. Find optimum for available variables in the segmentation and labeling techniques. 3. Create a statistical lexer for Rascal and determine lines of source code as proof of

concept.

4. Create a statistical multilingual lexer for HTML, CSS and Javascript as proof of concept.

5. Create a language recognition tool from a statistical multilingual lexer for HTML, CSS and Javascript as proof of concept.

(10)

4. Background and Context

In 2012 Hindle et al. [1] presented the hypothesis that software is as natural as English or any other spoken language. They support this vision by showing that the perplexity-measure for source code is not higher, and often much lower, than for English text bodies. This makes it very likely that other research done in the field of Natural Language Processing (NLP) is applicable to source code and tools to do statistical lexing, shallow parsing or statistical parsing become more and more indispensable.

The NLP word segmentation research was for a long time pushed by improving seg-menting Asian languages like Chinese and Japanese, since these language do not use explicit word delimiters [8] [23]. Word segmentation recently got a new impulse, be-cause of the need to make sense out of language used on the web. In tweets words are concatenated or abbreviated, for example ‘w84u’, meaning ‘wait for you’, should be seg-mented as [w8, 4, u]. Also in search queries, people type fast and forget spaces [22]. One way to divide the current state of research is supervised versus unsupervised segmenta-tion [8] [23] [22]. The words in the supervised models are called In Vocabulary (IV). Due to the constant change of a language there will always be Out Of Vocabulary (OOV) words. The unsupervised segmentation tries to segment those as well and amplify the supervised segmentation. Especially when the group of OOV words is big, then the unsupervised segmentation becomes more important [22].

From the field of NLP CRF emerges as the best way to segment words for Chi-nese [8] [23] and to tag part of speech [8] [10]. CRF was introduced in 2001 by Lafferty et al. [10]. This algorithm is not only optimal for word segmentation or POS tagging, but for labeling sequence data in general. Hidden Markov Model (HMM) was until 2001 considered the best algorithm [10].

The proposed CRF algorithm computes the conditional probabilities as follows:

P (Y | X) = argmaxΛ ( 1 Z(X)exp( X i λjFj(Y, X)) ) , with λ ∈ Λ (4.1) where Z(X) is called a normalising factor and F(Y, X) is the collection of feature functions. The learning process selects the set of feature weights from Λ that maximises the label sequence probability [10].

4.1. Segmentation

In this research a piece of source code is seen as a sequence of unicode characters, Cunicode, according to:

(11)

Segmentation is then labeling the characters with a label from a label set in LSeg =

{LSeg2, LSeg4}, where LSeg2= {B, I} and LSeg4 = {B, I, E, S}:

(c1, l1)(c2, l2)...(ck, lk), ∀ci ∈ C, lj ∈ LSeg2, k ∈ N (4.3)

or

(c1, l1)(c2, l2)...(ck, lk), ∀ci ∈ C, lj ∈ LSeg4, k ∈ N (4.4)

The labels are abbreviations for begin (B), internal (I), end (E) and single (S). Zhao et al. showed [23] that for Chinese word segmentation with four labels works better than two labels. For this research the effectivity of segmentation with LSeg2 is compared with

the one with LSeg4 by subsuming as in Equation 4.5.

f : LSeg4 → LSeg2, f (x) =

(

B, x = B ∨ x = S

I, x = I ∨ x = E (4.5)

Let YSeg denote the chunk label sequence, and XSeg denote the corresponding code

observation sequence. YSeg2 and YSeg4 are used to specify to what label set is inferred,

LSeg2 or LSeg4. The statistical segmenter needs to calculate the conditional probability

P ( YSeg | XSeg) and chooses the label sequence with the highest probability.

The observed pieces of code for this research are complete files and not lines. The reason for this decision is that a newline character is not always segmenting two tokens. For example a multiline comment that contains a newline character is for the lexer one token.

4.2. Labeling

Labeling a piece of code is creating a sequence from the set of tokens, T, where a token is a piece of code as described in Equation 4.2. A comment or a method name causes the vocabulary to be an uncountable set. Developers come up with their own names.

Different tokenisers come with different label sets. One tokeniser labels an integer to NUMBER and another to INTEGER. Let LLabbe the set of labeling label sets. In order

to be able to compare the results of the different tokenisers, a small base set is formed, LLab base ∈ LLab.

LLab base { COM M EN T, LAY OU T, N AM E, N EW LIN E, N U M BER, OP, ST RIN G }

(12)

Labeling is then adding a label to a token as follows:

code = t1t2...tk, ∀ti∈ T, k ∈ N (4.6)

(t1, l1)(t2, l2)...(tk, lk), ∀ti ∈ T, lj ∈ ls ∈ LLab, k ∈ N (4.7)

The statistical labeler needs to calculate P ( YLab | XLab), where X is an observation

of a list of tokens from T. The statistical labeler chooses the label sequence with the highest probability.

4.3. Lexing

With lexing in this research is meant that every character in the text gets a label from the cartesian product of a segmentation label set (∈ LSeg) and a labeling label set (∈ LLab),

called LLex, according to:

(c1, l1)(c2, l2)...(ck, lk), ∀ci ∈ Cunicode, lj ∈ LLex, k ∈ N (4.8)

An example is LLex 4 base= LSeg4

×

LLab base. A newline token is always a single

char-acter so (B, NEWLINE), (I, NEWLINE) and (E, NEWLINE) can be removed from the set. A newline character in a multiline comment is labeled in general as (I, COMMENT). There are two ways to create a statistical lexer, the one-step and two-step method. The one-step method learns one model from the examples how to give a label from LLex to a single character. When lexing code with the learned model it applies the

labels directly on the characters. The one-step method uses CRF to find the optimal P ( YLex| XLex)

The two-step method first learns from the examples a model how to segment and a model how to label. When inferring it on code the two-step lexer first segments the code and after that it labels these segmented tokens. As the last step the typed tokens are split into characters again. The two-step method uses CRF to find the optimal for P ( YSeg | XSeg) and P ( YLab| XLab)

Both methods are compared to find out if the methods structurally yield different f1-scores. Perhaps one of the methods is better than the other.

(13)

5. Research

5.1. Creating learning data

This research makes use of open source software available on the website GitHub.com. The experiments are executed on repositories containing Python, Java, HTML, CSS and Javascript source code. For every language 30 repositories, that have more than 1000 stars, are randomly picked, assuming that this high rating leads to the selection of representative code. Since 30 CSS repositories yielded less than 50 files, 100 repositories were selected. See Appendix A for the selected repositories.

The selected git projects are cleaned by throwing away all files that do not end with .py for Python, .java for Java, .html for HTML, .css for CSS and .js for Javascript to enhance the chance that the file contains the objected code. Empty files are removed as well. See Table 5.1 for the number of files per language.

The Python code is tokenised with the Python scanner from the module tokenize [3]. This breaks code into tokens and with every token comes a token type, and for the operators, there is also an exact token type. See for an example Table 1.1.

To convert those tokens into learning data the white space tokens of type LAYOUT have to be added to the list. On the other hand the tokens INDENT and DEDENT are removed from the list for simplicity, because they are assigned to an empty string. Leaving the indentation tokens in, would double the length of the token list, where all characters should be surrounded by an empty string token, that can be typed as INDENT, DEDENT or LAYOUT. See Tables 5.2, 5.3 and 5.4 for examples of the learning data.

The learning data for the languages Java, HTML, CSS and Javascript is created as from Python. See Table 5.5 what modules are used. How the label set of the tokeniser javalang is converted to LLab base is shown in Figure B.1. For the other languages see

Appendix B.

Language Number of files

Java 3785

Python 8733

HTML 6019

CSS 531

Javascript 4688

Table 5.1.: Number of files containing a specific language found in the repositories, Ap-pendix A

(14)

Code ‘x‘ ‘ ‘ ‘=‘ ‘ ‘ ‘”‘ % ‘s‘ ‘”‘ ‘%‘ ‘ ‘ ‘s‘ ‘t‘ ‘r‘ ‘(‘ ‘1‘ ‘)‘

Seg S S S S B I I E S S B I E S S S

Table 5.2.: Segmented piece of Python snippet with LSeg4

Code ‘x‘ ‘ ‘ ‘=‘ ‘ ‘ ‘”%s”‘ ‘%‘ ‘ ‘

Lab NAME LAYOUT OP LAYOUT STRING OP LAYOUT

Code ‘str‘ ‘(‘ ‘1‘ ‘)‘

Lab NAME OP NUMBER OP

Table 5.3.: Labeled piece of Python snippet with LLab base

Code ‘x‘ ‘ ‘ ‘=‘ ‘ ‘ ‘”‘

Lex (S, NAME) (S, LAYOUT) (S, OP) (S, LAYOUT) (B, STRING)

Code ‘% ‘ ‘s‘ ‘”‘ ‘%‘ ‘ ‘

Lex (I, STRING) (I, STRING) (E, STRING) (S, OP) (S, LAYOUT)

Code ‘s‘ ‘t‘ ‘r‘ ‘(‘ ‘1‘

Lex (B, NAME) (I, NAME) (E, NAME) (S, OP) (S, NUMBER)

Code ‘)‘

Lex (S, OP)

Table 5.4.: Lexed piece of Python snippet with LLex 4 base

Language Tokenizer name Label set name Python tokenize [3] LLab base

Java javalang [4] LLab javalang

HTML html-tokenizer [7] LLab html−tokenizer

CSS css-tokens [16] LLab css−tokens

Javascript js-tokens [17] LLab js−tokens

(15)

• KEY W ORD ⇒ N AM E • M ODIF IER ⇒ N AM E • BASICT Y P E ⇒ N AM E • LIT ERAL ⇒ ST RIN G • IN T EGER ⇒ N U M BER

• DECIM ALIN T EGER ⇒ N U M BER

• OCT ALIN T EGER ⇒ N U M BER

• BIN ARY IN T EGER ⇒ N U M BER

• HEXIN T EGER ⇒ N U M BER

• BIN ARY IN T EGER ⇒ N U M BER

• F LOAT IN GP OIN T ⇒ N U M BER

• DECIM ALF LOAT IN GP OIN T ⇒ N U M BER

• HEXF LOAT IN GP OIN T ⇒ N U M BER • BOOLEAN ⇒ N AM E • CHARACT ER ⇒ ST RIN G • ST RIN G ⇒ ST RIN G • N U LL ⇒ N AM E • COM M EN T ⇒ COM M EN T • J AV ADOC ⇒ COM M EN T • N EW LIN E ⇒ N EW LIN E • LAY OU T ⇒ LAY OU T • SEP ARAT OR ⇒ OP • OP ERAT OR ⇒ OP • AN N OT AT ION ⇒ OP • IDEN T IF IER ⇒ N AM E

(16)

5.2. Baseline performance

To determine if the CRF algorithm has added value, a baseline performance has to be de-termined for segmentation, labeling and lexing. This baseline performance is dede-termined on the Python corpus.

5.2.1. Baseline Segmentation

The vocabulary of the Python corpus, that is a subset of Cunicode, has around 180

ele-ments. From the learning examples per character is counted how often it is categorised to a label in LSeg2. Then the most used label is assigned to the character in the

base-line model. This yields, with 10-fold cross validation, a f1-score of 0.58 correct word beginnings.

5.2.2. Baseline Labeling

To determine the baseline performance for labeling, a baseline model is learned so that every token has a counter for every label ∈ LLab base. Because of the uncountability

of set T, this model can never be complete. In case the model does not have a most used type for a certain token, the overall most used label is selected. With the Python corpus, it turns out to be STRING. See Table 5.6 for the baseline results. This baseline implementation is doing already really good, so the statistical labeler has to be very good as well. The main challenge that emerges from this baseline model is the correct classification of COMMENT, NAME and STRING, when segmented correctly.

5.2.3. Baseline Lexing

The baseline score for lexing is determined for both the one-step and two-step method. Both methods use the label set LLex 2 base. See the baseline scores of the one-step

method in Table 5.7. The f1-score for the two-step method is a bit better in recognising (I, COMMENT), 0.242 versus 0.080, and (I, STRING), 0.392 versus 0.218, but on 7 other labels it has a performance of 0.

precision recall f1-score support

COMMENT 1.000 0.386 0.557 85,119 LAYOUT 1.000 1.000 1.000 1,902,555 NAME 1.000 0.909 0.952 1,990,948 NEWLINE 1.000 1.000 1.000 869,930 NUMBER 1.000 0.880 0.936 178,092 OP 1 1.000 1.000 2,587,250 STRING 0.565 1.000 0.722 331,947 avg/ total 0.982 0.968 0.970 7,945,841 Table 5.6.: Baseline results of labeling Python

(17)

precision recall f1-score support (B, COMMENT) 0.767 1 0.868 85,119 (I, COMMENT) 0.600 0.043 0.080 3,402,383 (B, LAYOUT) 0.000 0.000 0.000 1,902,555 (I, LAYOUT) 0.612 1.000 0.758 5,814,334 (B, NAME) 0.376 0.015 0.030 1,990,948 (I, NAME) 0.528 0.952 0.679 10,833,347 (B, NEWLINE) 0.884 1 0.938 869,930 (B, NUMBER) 0.339 0.312 0.325 178,092 (I, NUMBER) 0.207 0.178 0.191 398,146 (B, OP) 0.830 0.984 0.900 2,587,250 (I, OP) 0.000 0.000 0 28,411 (B, STRING) 0.000 0.000 0 331,947 (I, STRING) 0.438 0.145 0.218 8418,249 avg/ total 0.510 0.574 0.540 36,840,714

Table 5.7.: Results of baseline score for lexing code, 1-step method

5.3. Feature Templates

5.3.1. Segmentation

The feature templates for segmentation, as suggested for Chinese word segmentation [23], from which the feature functions of equation 4.1 are created, contain 1-gram, 2-gram, 3-gram and jump features, shown in Table 5.8. For this research trigram features are added.

5.3.2. Labeling

When for labeling 1-gram, 2-gram and 3-gram features as in Table 5.8 are used, but with tokens instead of characters, the STRING, COMMENT and NUMBER are confused a lot with NAME. It is the same confusion the baseline labeling had. This confusion can be mitigated with the addition of two more features: first characters and numerical majority as in Table 5.9.

type feature function

1-gram C−1, C0, C1 The previous, current and next character

jump C−1/C1 The previous and next characters

2-gram C−1/C0, C0/C1 The previous (next) and current characters

3-gram C−2/C−1/C0, C−1/C0/C1, C0/C1/C2 The previous (next) and current characters

(18)

type feature function/example

1-gram T−1, T0, T1 The previous, current and next token

jump T−1/T1 The previous and next Tokens

2-gram T−1/T0, T0/T1 The previous (next) and current tokens

3-gram T−2/T−1/T0, T−1/T0/T1, T0/T1/T2 The previous (next) and current tokens

first characters T0[: 1], T0[: 2] ‘#comment’=> [‘#’, ‘#c’]

numerical majority True if nr of num. chars > ‘3.4’=> True non-num. chars else False ‘4or’=> False Table 5.9.: Feature templates for CRF labeling

5.3.3. Lexing

For one-step lexing the same templates are used as for segmentation. Two-step lexing is built from segmentation and labeling and therefore does not need a separate feature template.

5.4. CRF algorithm selection

For the implementation of the CRF algorithm is used Pycrfuite [18], which is a shell around CRFSuite [13]. Pycrfuite/CRFSuite provides 5 algorithms to maximise: Limited-memory BFGS (LBFGS) [14], Stochastic Gradient Descent with L2 regularization (L2SGD) term [15], Averaged Perceptron (AV) [5], Passive Aggressive (PA) [6] and Adaptive Reg-ularization Of Weight Vector (AROW) [11].

To determine what is the best algorithm for the job, all algorithms are used on a set of 20 random files. LBFSG can be tuned by setting the coefficient for L1 and L2 regularisation. LBFSG is then applied on the files with all possible combinations of the values c1, c2 ∈ {0.0, 0.2, 0.4, ..., 2.0}. L2SGD can be tuned on the L2 regularisation. All values c2 ∈ {0.0, 0.2, 0.4, ..., 2.0} are used. The feature templates are for segmentation and one-step lexing fixed on 1-gram, 2-gram and jump. The features for labeling are fixed in 1-gram, 2-gram, jump, first characters and numerical majority. The algorithms are then compared by a 4-fold cross validated f1-score. It turns out that for segmentation and one-step lexing LBFSG can reach the highest accuracy, but the coefficients are not converging to a fixed value. Tuning these values can lead to overfitting, and so in further experiments the default values C1: 0.0 and C2: 1.0 are used for LBFSG. For labeling PA reaches structurally the highest f1-score.

5.5. Minimum Number of Files

The computational cost increases when learning from more files. This raises the question with how many files one can get already a reasonable result. This experiment is executed for every number of files between and including 4 and 50. The files that are bigger than 1MB are excluded from this experiment, because bigger files can cause the CPU to be

(19)

• unigram = 1-gram

• unigram+ = 1-gram + jump

• bigram = 1-gram + jump + 2-gram

• trigram = 1-gram + jump + 2-gram + 3-gram

Figure 5.2.: Feature sets, built from feature templates in Table 5.8 and 5.9, see section 5.3

overloaded when learning from around 50 files. The CPU that is used is 2,9 GHz Intel Core i7, memory 16 GB 1600 MHz DDR3. The experiment randomly picks the objected number of files. Then it calculates the weighted f1-score, with 4-fold cross validation, for the four different feature sets, called: unigram, unigram+, bigram and trigram as in Figure 5.2. For labeling and lexing are the first characters and numerical majority templates added to all sets. This results in 91 plots, shown in Appendix C, having the elements from {Python, Java, HTML, CSS, Javascript}, LSeg and LLab as variables.

See Figure 5.3 as an example of such a plot. Higher n-gram features for the same files are not always a recipe for higher f1-scores. By manually inspecting those cases where trigram results in the lowest f1-score, the set of files contain a lot of comments or tokens of type STRING and the f1-score suffers from this confusion more for higher n-grams. When the number of files is low, part of the confusion is solved by increasing the number of files, so that the model has more examples to learn from. The f1-score for trigram features becomes then in most cases higher than bigram, unigram+ and unigram. But it still occurs that the f1-score for trigram is the lowest when there are 30 or 40 files to learn from. In these cases the confusion could in theory be solved by adding even higher n-gram features. Further investigation is needed to support this idea. A downside of higher n-gram is that they come with the price of a substantially higher computational cost.

From the plots is extracted how many files are needed to get a result that has converged to a f1-range with a distance smaller 0.1. Formally described in Equations 5.1 - 5.4.

fk(n) = f 1(n, k), n ∈ [4, 50], k ∈ {unigram, unigram+, bigram, trigram} (5.1)

mk(n) = min{fk(n), fk(n + 1), .., fk(50)} (5.2)

Mk(n) = max{fk(n), fk(n + 1), .., fk(50)} (5.3)

(20)

Figure 5.3.:

One-step lexing Language: Python Segmentation set: LSeg2

Label set: LLab base

Algorithm: lbfgs C1: 0.0

C2: 1.0

Cross-validation: 4

The tables 5.10, 5.11 and 5.12 show for every of the plots the best range for the trigram feature set. Table 5.13 shows the averages and sample standard deviations for the 91 plots. It shows that for labeling are less files needed than for segmentation. Lexing needs even more files to get a well converged result. The deviation is high, so it is not possible to extract a trustworthy minimum number of files. It would be interesting to enlarge the range until 100 files, but this is computationally very expensive. And if for the statistical

(21)

lexer more than 50 files have to be manually tokenised, for some developers creating the deterministic version becomes more attractive.

Language Segmentation set Lowest number of files for best range Best range for trigram feature set Python LSeg2 21 0.829 - 0.965 Python LSeg4 25 0.919 - 0.966 Java LSeg2 19 0.914 - 0.965 Java LSeg4 7 0.891 - 0.984 HTML LSeg2 26 0.964 - 0.998 HTML LSeg4 4 0.973 - 0.999 CSS LSeg2 7 0.940 - 0.999 CSS LSeg4 4 0.904 - 0.998 Javascript LSeg2 27 0.894 - 0.993 Javascript LSeg4 45 0.951 - 0.980

Table 5.10.: Best f1-score range for segmentation, interpreted from plots in Appendix C.1

(22)

Language Labeling set Lowest number of files for best range Best range for trigram feature set

Python LLab base 4 0.962 - 1.000

Java LLab base 4 0.968 - 1.000

Java LLab javalang 4 0.923 - 0.999

HTML LLab base 48 0.996 - 1.000

HTML LLab html−tokenizer 40 0.966 - 1.000

CSS LLab base 4 0.944 - 1.000

CSS LLab css−tokens 5 0.975 - 1.000

Javascript LLab base - 0.764 - 1.000

Javascript LLab js−tokens 6 0.963 - 1.000

(23)

Language Lexing Seg- menta-tion set

Labeling set Lowest number of files for best range Best range for trigram feature set

Python 1-step LSeg2 LLab base 21 0.861 - 0.958

Python 2-step LSeg4 LLab base - 0.712 - 0.928

Java 1-step LSeg2 LLab base 37 0.918 - 0.986

Java 1-step LSeg2 LLab javalang 29 0.899 - 0.985

HTML 1-step LSeg2 LLab base 5 0.973 - 0.998

HTML 1-step LSeg2 LLab html−tokenizer 7 0.950 - 1.000

CSS 1-step LSeg2 LLab base 39 0.916 - 0.996

CSS 1-step LSeg2 LLab css−tokens 48 0.909 - 0.996

CSS 1-step LSeg4 LLab css−tokens - 0.992 - 0.992

Javascript 1-step LSeg2 LLab base 49 0.705 - 0.980

Javascript 1-step LSeg4 LLab base - 0.825 - 0.981

Javascript 1-step LSeg2 LLab js−tokens 48 0.859 - 0.964

(24)

Table 5.12.: Best f1-score range starting for lexing, interpreted from plots in Appendix C.3

unigram unigram+ bigram trigram

Avg nr. of files σ Avg best range Avg nr. of files σ Avg best range Avg nr. of files σ Avg best range Avg nr. of files σ Avg best range Segmen-tation 28.3 18.8 0.839 -0.928 20.7 14.3 0.839 -0.975 16.5 16.1 0.928 -0.980 18.5 13.2 0.918 -0.985 Labe-ling 9.0 19.2 0.979 -1.000 4.0 19.5 0.979 -1.000 9.0 14.5 0.977 -1.000 6.0 20.9 0.963 -1.000 1-step lexing 32.6 17.0 0.884 -0.976 33.4 17.5 0.894 -0.984 32.9 17.4 0.884 -0.983 32.2 17.4 0.891 -0.985 2-step lexing 45.9 5.1 0.626 -0.810 42.9 9.1 0.626 -0.912 42.2 7.6 0.815 -0.930 38.4 12.4 0.831 -0.944 Table 5.13.: Average and Sample Standard Deviation for the number of files needed to do

Segmentation, Labeling and Lexing, interpreted from the plots in Appendix C

(25)

6. Results

6.1. Segmentation

Table 5.10 shows that for Python, Java, HTML and Javascript LSeg4 is performing

bet-ter than LSeg2. CSS is doing a bit worse for LSeg4. This overall result was expected

because of the results for Chinese word segmentation [23]. For Chinese word segmen-tation word beginnings were recognised better, when characters are not only labeled as ’word beginning’ and ’not word beginning’ (LSeg2), but as ’word beginning’, ’word

ending’, ’character surrounded by other characters of word’ and ’single character word’ (LSeg4). Since tokens are the words for a source code language and Hindle et al. showed

that source code is as natural as English [1], one can expect better results with LSeg4

than with LSeg2. The usage of the trigram feature set pays off, because they are doing

in all cases better than the bigram and unigram feature sets. The usage has added value because the baseline implementation for segmentation was 0.58 and CRF can reach the f1-score range 0.89 - 0.99.

6.2. Labeling

The average baseline f1-score was 0.97. Table shows 5.11 shows ranges with a lower bound that are lower than 0.97, but when more than 40 files are used the f1-score is for all languages but Javascript structurally higher than the baseline implementation. The confusion that the baseline implementation has for COMMENT, NAME and STRING is gone as you can see in the Table 6.1 and 6.2, showing score per token type. The intuition would expect that it is easier to label to LLab base than a bigger label set, but

Table 5.11 does not show that.

precision recall f1-score support COMMENT 1.0000 1.0000 1.0000 1,123 LAYOUT 0.9999 0.9995 0.9997 17,288 NAME 0.9961 0.9984 0.9972 18,988 NEWLINE 1.0000 1.0000 1.0000 8,652 NUMBER 0.9718 1.0000 0.9857 1,032 OP 1.0000 0.9994 0.9997 21,618 STRING 0.9996 0.9796 0.9895 2,644 avg/ total 0.9985 0.9985 0.9985 71,345

(26)

precision recall f1-score support ANNOTATION 1.0000 1.0000 1.0000 121 BASICTYPE 1.0000 0.9093 0.9525 529 BOOLEAN 1.0000 1.0000 1.0000 139 COMMENT 1.0000 1.0000 1.0000 271 DECIMALFLOATINGPOINT 0.7000 0.3182 0.4375 44 HEXINTEGER 0.0000 0.0000 0.0000 1 IDENTIFIER 0.9872 0.9954 0.9913 13,141 JAVADOC 0.9969 1.0000 0.9984 320 KEYWORD 0.9930 0.9694 0.9811 2,650 LAYOUT 1.0000 1.0000 1.0000 54,502 MODIFIER 0.9980 0.9922 0.9951 1,025 NEWLINE 1.0000 1.0000 1.0000 7,964 NULL 0.9915 1.0000 0.9957 233 OPERATOR 0.9997 0.9917 0.9957 3,137 SEPARATOR 0.9996 0.9998 0.9997 17,820 STRING 0.9896 0.9761 0.9828 586 avg/ total 0.9973 0.9973 0.9972 102,910

Table 6.2.: Results of CRF labeling Java to LLab javalang, 50 files, 5-fold cross validation

6.3. Lexing

The results for lexing in Table 5.12 and 5.13 show clearly that one-step lexing has higher f1-scores than two-step lexing. When labeling is used in the two-step method, it benefits from the templates first characters and numerical majority, but suffers more from mistaken word beginnings in the segmentation step. The computational cost, however, is much higher for one-step lexing than two-step lexing. The duration can be up to 8 times longer. In one-step lexing the CRF algorithm has to determine probabilities for (|LSeg| × |LLab|)2 transitions. In 2-step lexing |LSeg|2+ |LLab|2 transitions has to be

calculated, which is for both LLab baseand LLab javalangless than the one-step transitions.

Table 5.12 shows also that the usage of LSeg4yields higher f1-scores than LSeg2for both

one-step and two-step lexing. See Table 6.3 and 6.4 for a score per token type. Though the confusion for COMMENT, STRING and NAME was solved by the templates first characters and numerical majority for labeling, it still exists for lexing.

precision recall f1-score support (B,COMMENT) 0.9740 0.9266 0.9497 1213 (E,COMMENT) 0.8660 0.7988 0.8310 1213 (I,COMMENT) 0.8940 0.8033 0.8462 50748 (S,COMMENT) 1.0000 0.4762 0.6452 63 (B,LAYOUT) 0.9814 0.9199 0.9497 1661 (E,LAYOUT) 0.9915 0.9163 0.9524 1661

(27)

(I,LAYOUT) 0.9844 0.9257 0.9541 13802 (S,LAYOUT) 0.9633 0.9746 0.9689 75261 (B,NAME) 0.9664 0.9630 0.9647 22154 (E,NAME) 0.9676 0.9643 0.9659 22154 (I,NAME) 0.9707 0.9543 0.9624 101010 (S,NAME) 0.7675 0.6945 0.7292 851 (S,NEWLINE) 0.9930 0.9726 0.9827 10530 (B,NUMBER) 0.9498 0.5638 0.7076 470 (E,NUMBER) 0.9498 0.5638 0.7076 470 (I,NUMBER) 0.8774 0.2386 0.3752 570 (S,NUMBER) 0.8873 0.7524 0.8143 1256 (B,OP) 0.9747 0.8221 0.8919 281 (E,OP) 0.9747 0.8221 0.8919 281 (S,OP) 0.9603 0.9491 0.9547 27354 (B,STRING) 0.9410 0.9413 0.9411 3202 (E,STRING) 0.9475 0.9475 0.9475 3202 (I,STRING) 0.8534 0.9671 0.9067 67985 avg / total 0.9387 0.9372 0.9366 407392

Table 6.3.: Results of CRF one-step lexing Python to LLex 4 base, 50 files, 5-fold cross

validation

(S,ANNOTATION) 1.0000 0.7205 0.8375 254 (B,BASICTYPE) 0.9918 0.8297 0.9036 1022 (E,BASICTYPE) 0.9918 0.8297 0.9036 1022 (I,BASICTYPE) 0.9958 0.8396 0.9111 2544 (B,BOOLEAN) 0.9856 0.8012 0.8839 171 (E,BOOLEAN) 0.9856 0.8012 0.8839 171 (I,BOOLEAN) 0.9886 0.8184 0.8955 424 (B,COMMENT) 1.0000 0.9632 0.9813 761 (E,COMMENT) 0.7967 0.7004 0.7455 761 (I,COMMENT) 0.7870 0.9443 0.8585 113318 (B,DECIMALFLOATINGPOINT) 0.0000 0.0000 0.0000 34 (E,DECIMALFLOATINGPOINT) 0.0000 0.0000 0.0000 34 (I,DECIMALFLOATINGPOINT) 0.0000 0.0000 0.0000 35 (B,DECIMALINTEGER) 0.5500 0.0651 0.1164 169 (E,DECIMALINTEGER) 0.5238 0.0651 0.1158 169 (I,DECIMALINTEGER) 0.1111 0.0068 0.0128 147 (S,DECIMALINTEGER) 0.5514 0.6747 0.6068 501 (B,HEXINTEGER) 1.0000 0.4917 0.6592 600

(28)

(E,HEXINTEGER) 0.9728 0.2383 0.3829 600 (I,HEXINTEGER) 0.9935 0.5142 0.6776 4726 (B,IDENTIFIER) 0.9585 0.9359 0.9471 24053 (E,IDENTIFIER) 0.9566 0.9346 0.9455 24053 (I,IDENTIFIER) 0.9737 0.9077 0.9396 150512 (S,IDENTIFIER) 0.9502 0.5738 0.7155 732 (B,KEYWORD) 0.9923 0.9826 0.9874 4473 (E,KEYWORD) 0.9923 0.9826 0.9874 4473 (I,KEYWORD) 0.9934 0.9822 0.9878 11807 (S,LAYOUT) 0.9780 0.9762 0.9771 95439 (B,MODIFIER) 0.9954 0.9827 0.9890 2191 (E,MODIFIER) 0.9954 0.9827 0.9890 2191 (I,MODIFIER) 0.9955 0.9818 0.9886 9247 (S,NEWLINE) 0.9645 0.9842 0.9743 12922 (B,NULL) 0.9912 0.9684 0.9797 348 (E,NULL) 0.9912 0.9684 0.9797 348 (I,NULL) 0.9912 0.9684 0.9797 696 (B,OPERATOR) 0.9828 0.7374 0.8426 773 (E,OPERATOR) 0.9828 0.7374 0.8426 773 (I,OPERATOR) 0.0000 0.0000 0.0000 6 (S,OPERATOR) 0.9407 0.7125 0.8108 3739 (S,SEPARATOR) 0.9787 0.9704 0.9745 35093 (B,STRING) 1.0000 0.7580 0.8624 624 (E,STRING) 0.9959 0.7837 0.8771 624 (I,STRING) 0.9780 0.4025 0.5703 7280 avg / total 0.9327 0.9251 0.9238 519860

Table 6.4.: Results of CRF one-step lexing Java to LLex 4 javalang, 50 files, 5-fold cross

validation

6.4. Lines of Comment

The statistical lexer can be used to determine LOC. A line is a line of code, when there is at least one non-whitespace or non-comment token on that line. For this experiment a statistical lexer is created from 50 files, one-step lexing and LSeg4. The model is

tested on 1000 files that weren’t used for the creation of the model, with 4-fold cross validation. Table 6.5 shows the results. The f1-score for comment is the combined scores for (B,COMMENT), (I, COMMENT), (E,COMMENT) and (S,COMMENT). The f1-score for source code is the combination of the tokens that are not in the set { (B,COMMENT), (I, COMMENT), (E,COMMENT),(S,COMMENT), (S,LAYOUT), (S,NEWLINE)}. The columns LOC actual and LOC predicted show the summed num-ber of lines from the four folds. The adaption of the definition of LOC on HTML yields 0

(29)

actual lines of code, because in the used files the comment tokens are always surrounded with code tokens. The difference between the actual lines of comment and the predicted lines of comment is in all cases substantial. This can be explained by the confusion the lexing has for the labels COMMENT, NAME and STRING.

(30)

Language LLex f1-score

LOC LOC Difference

actual predicted (%)

Python LLex 4 base

comment 0.9461 94041 91035 -3 source code 0.9674 729289 732414 0 total 0.9646 1005756 Java LLex 4 base comment 0.9138 40495 66387 64 source code 0.9434 548599 522950 -5 total 0.9332 802151 LLex 4 javalang comment 0.8842 1715 2191 28 source code 0.9762 575182 574734 0 total 0.9739 793078 HTML LLex 4 base comment 0.9014 0 0 0 source code 0.9979 1422249 1422249 0 total 0.9967 1588887 LLex 4 html−tokenizer comment 0.9099 0 0 0 source code 0.9919 651113 651136 0 total 0.9908 1120162 CSS LLex 4 base comment 0.7000 10599 21608 104 source code 0.9743 669325 659696 -1 total 0.9646 768239 LLex 4 css−tokens comment 0.7282 10494 10532 0 source code 0.9633 689527 689519 0 total 0.9553 794012 Javascript LLex 4 base comment 0.6722 39818 49992 26 source code 0.5182 542261 532601 -2 total 0.5391 660815 LLex 4 js−tokens comment 0.8246 40138 87364 118 source code 0.9510 540212 493392 -9 total 0.9390 655370 Rascal LLex 4 base comment 0.9823 34 32 -6 source code 0.9949 5139 5141 0 total 0.9956 7235 LLex 4 rascal comment 0.9721 34 32 -6 source code 0.9973 5139 5141 0 total 0.9966 7235

Table 6.5.: Result for determining LOC. The model has learned from 50 files and is tested on other 1000 files, with 4-fold cross validation. The f1-scores are normalised to results with only comment tokens and source code tokens. The rascal model is tested on only 27 files.

(31)

6.5. Proof of Concept 1: Statistical Lexing of Rascal

The lexing results for Python, Java, HTML, CSS and Javascript make it assumable that a statistical lexer for Rascal, can reach a f1-score between 0.89 and 0.99, when used one-step lexing, with LSeg4 and 50 files for learning. The lexer should be easier to produce

than creating a deterministic lexer. A Rascal project [21], written by master students Software Engineering, consisting of 77 files, is chosen for this experiment. To verify how well the statistical lexer performs, all the files have to be tokenised by hand. The code is first tokenised by the java lexer javalang and then corrected manually. Rascal syntax has a Java-like syntax [11] and therefore javalang does most of the work. See Table 6.6 for some similarities and differences. Let’s say that the rascal code is lexed to the label set LLab rascal as in Equation 6.1.

LLab rascal= LLab javalang∪ {LOCAT ION } (6.1)

Javalang lexes ’module’, ’list’, ’set’, ’str’, ’test’, ’loc’ e.a. as an IDENTIFIER, but in Rascal those are of type KEYWORD. Javalang makes ’::’ an operator, but it is in Rascal a separator. Then ’<-’ is one operator in Rascal and not two, as Javalang thinks. Rascal has a special syntax for instantiating a location of type loc: ’|project://xxxx|’. This is manually adjusted to a new type: LOCATION.

The first step of creating the statistical rascal lexer is lexing the code with an available lexer from another language, in this case the deterministic Java lexer javalang. This lexer has an f1-score around 0.80 for lexing Rascal. The second step is inspecting the lexed code and correcting it manually, to the labels from LLab rascal. Manually correcting

77 files took around 6 hours. Then a statistical lexer can be made for the label sets: LLex 4 rascal and LLex 4 base. The token types are converted from LLab rascal to LLab base

as in Figure B.1. The new token type LOCATION is converted to STRING. This is an incorrect summarisation, but the small label set exists to get a rough idea about the score between different languages, instead of getting an exact typed token. See Table 6.7 for a f1 score per token type.

The f1-score for of this statistical lexer reaches 0.99, see Figure 6.1. If this lexer is wrong about (I,STRING), it’s mostly lexed as (I,COMMENT) and the other way around. The f1-score for (I,COMMENT) is 0.83, which is not very high if the main goal for the statistical lexer is making the difference between comments and code. Both STRING and COMMENT can be whole sentences and the only difference between is their surrounding characters. This is the same confusion as discussed in Section 5.5. In theory higher n-gram features could solve this confusion, but higher n-gram features come with the price of computational cost. Further investigation is needed.

(S,ANNOTATION) 1.0000 0.7692 0.8696 13

(B,BASICTYPE) 0.9904 0.9829 0.9867 527

(E,BASICTYPE) 0.9904 0.9829 0.9867 527

(32)

(B,BOOLEAN) 1.0000 0.9504 0.9745 141 (E,BOOLEAN) 1.0000 0.9504 0.9745 141 (I,BOOLEAN) 1.0000 0.9340 0.9659 318 (B,COMMENT) 1.0000 0.9680 0.9837 469 (E,COMMENT) 1.0000 0.9168 0.9566 469 (I,COMMENT) 0.8426 0.8205 0.8314 5,683 (B,DECIMALFLOATINGPOINT) 1.0000 0.8872 0.9402 399 (E,DECIMALFLOATINGPOINT) 1.0000 0.8872 0.9402 399 (I,DECIMALFLOATINGPOINT) 1.0000 0.7609 0.8642 460 (B,DECIMALINTEGER) 1.0000 0.0962 0.1754 52 (E,DECIMALINTEGER) 1.0000 0.0962 0.1754 52 (I,DECIMALINTEGER) 1.0000 0.4211 0.5926 19 (S,DECIMALINTEGER) 0.9469 0.7759 0.8529 598 (B,IDENTIFIER) 0.9593 0.9829 0.9710 9,234 (E,IDENTIFIER) 0.9628 0.9865 0.9745 9,234 (I,IDENTIFIER) 0.9750 0.9977 0.9862 86,001 (S,IDENTIFIER) 0.9044 0.8309 0.8661 615 (B,KEYWORD) 0.9957 0.9727 0.9840 2,377 (E,KEYWORD) 0.9948 0.9727 0.9836 2,377 (I,KEYWORD) 0.9981 0.9469 0.9718 5,535 (S,LAYOUT) 0.9934 0.9934 0.9934 17,745 (B,LOCATION) 1.0000 0.9573 0.9782 117 (E,LOCATION) 1.0000 0.9402 0.9692 117 (I,LOCATION) 0.9998 0.9777 0.9886 6,135 (B,MODIFIER) 1.0000 1.0000 1.0000 222 (E,MODIFIER) 1.0000 1.0000 1.0000 222 (I,MODIFIER) 1.0000 1.0000 1.0000 979 (S,NEWLINE) 0.9975 0.9996 0.9986 5,535 (B,OPERATOR) 0.9928 0.9436 0.9676 727 (E,OPERATOR) 0.9928 0.9436 0.9676 727 (S,OPERATOR) 0.9794 0.9659 0.9726 1,819 (B,SEPARATOR) 1.0000 1.0000 1.0000 865 (E,SEPARATOR) 1.0000 1.0000 1.0000 865 (S,SEPARATOR) 0.9906 0.9847 0.9876 13,834 (B,STRING) 0.9732 0.6473 0.7775 224 (E,STRING) 0.9735 0.6562 0.7840 224 (I,STRING) 0.9547 0.8402 .8938 7,333 avg / total 0.9752 0.9753 0.9746 184,115

Table 6.7.: Results of CRF one-step lexing Rascal to LLex 4 rascal, 50 files, 5-fold cross

(33)

Figure 6.1.:

One-step lexing Language: Rascal

Segmentation set: LSeg4

Label set: LLab rascal

Algorithm: lbfgs C1: 0.0

C2: 1.0

(34)

type Java Rascal

Comment //Comment //Comment

Javadoc/Comment /*Comment*/ /*Comment*/

method declaration public void foo() {} public void foo() {}

variable declaration int var = 0; int var = 0;

Import import java.util.Math; import util::Math;

class/module declaration class A {} module A

for loop for(int n <- [1..5]) println(”<n>”); for(int n = 1; n < 5; n++) { sys.out.println(n); }

Location - loc file = |project://foo/src/1.java|

Table 6.6.: Some equalities and differences between Java and Rascal

Table 6.5 shows that the statistical lexer for Rascal shows substantial percentage differences on determining LOC. There must be noted that the number of files used to test it on for Rascal is small, only 27. The files have to be manually tokenised to be used for this test. The result is in line with the results for the other languages, namely the statistical lexer is not good in determining the lines of comment on a code base.

6.6. Proof of Concept 2: HTML/CSS/Javascript Lexer

A statistical multilingual lexer is able to segment and label a source file containing multiple languages or containing one language that has an embedded language. In this proof of concept a statistical lexer for HTML-CSS-Javascript is created, to lex embedded CSS and Javascript in HTML.

To create learning data the same corpora are used as mentioned in chapter 5.1. The labels for CSS and Javascript in the lexing sets get a language prefix. For example, a CSS comment becomes CSS COMMENT.

The html files are initially normally lexed with the model html-tokenizer [7]. After that it another pass is executed to lex the embedded CSS code. CSS can be found between script tags: <style>...</style> or inline <div style=”...”></div>. Javascript is found between script tags or inline behind a list of keywords, like onload, onclick, onchange and other.

The html files are divided on what embedded language they contain: none, CSS, Javascript or both. Then the lexer is created by learning from plain CSS and Javascript files and from HTML files with embedded from every category, See Table 6.8 how many files are used from every category.

The results for this lexer are shown in Table 6.9. What strikes is that the f1-score for Javascript is structurally low. When only focussing on how well comments are lexed

cor-rectly, then it shows that the results are poor. For the tokentype (S,HTML ATTRIBUTE NAME) both the precision and the recall are 0.000. On the selected corpus the single characters

attribute names are ’d’, ’x’ and ’y’. These values are confused with (S,JS NAME). The single character attribute names are rare, so the chance that a model has the opportunity

(35)

Language Number of learning files Number of test files CSS 20 60 Javascript 20 60 HTML 20 60 HTML+CSS 20 60 HMTL+Javascript 20 60 HMTL+CSS+Javascript 20 60

Table 6.8.: Number of files used for creating and testing statistical multilingual HTML/CSS/Javascript lexer

in one of the folds to learn from rare tokens is low as well. Instead of randomly choose files to learn from, one could carefully choose the files to learn from. If this leads to better f1-scores needs to be investigated further.

Label Preci-sion Recall F1 Support (B,CSS COMMENT) 0.9397 0.6653 0.7753 2175 (E,CSS COMMENT) 0.9404 0.7122 0.8059 2175 (I,CSS COMMENT) 0.7162 0.2766 0.3955 220616 (B,CSS NAME) 0.9902 0.9595 0.9743 449608 (E,CSS NAME) 0.9919 0.9610 0.9759 449608 (I,CSS NAME) 0.9927 0.9720 0.9821 2904296 (S,CSS NAME) 0.9935 0.9338 0.9616 7386 (S,CSS NEWLINE) 0.9971 0.9799 0.9884 205665 (B,CSS NUMBER) 0.9736 0.9715 0.9726 61013 (E,CSS NUMBER) 0.9715 0.9694 0.9704 61013 (I,CSS NUMBER) 0.9886 0.9865 0.9876 98998 (S,CSS NUMBER) 0.8978 0.9628 0.9225 63820 (B,CSS PUNCTUATOR) 0.9863 0.4938 0.6399 1867 (E,CSS PUNCTUATOR) 0.9863 0.4944 0.6405 1867 (S,CSS PUNCTUATOR) 0.9756 0.9657 0.9706 551969 (B,CSS STRING) 0.9994 0.9104 0.9520 13417 (E,CSS STRING) 0.9979 0.9116 0.9520 13417 (I,CSS STRING) 0.9913 0.7278 0.8329 99414 (B,CSS UNQUOTEDURL) 0.4602 0.1150 0.1676 113 (E,CSS UNQUOTEDURL) 0.4602 0.1062 0.1613 113 (I,CSS UNQUOTEDURL) 0.6425 0.8451 0.7078 99481 (S,CSS WHITESPACE) 0.9754 0.9736 0.9742 560032 (S,HTML ATTRIBUTE ASSIGNMENT) 0.9945 0.9936 0.9941 278904 (B,HTML ATTRIBUTE NAME) 0.9948 0.9935 0.9942 279258

(36)

(I,HTML ATTRIBUTE NAME) 0.9942 0.9852 0.9897 752391

(S,HTML ATTRIBUTE NAME) 0.0000 0.0000 0.0000 20

(B,HTML ATTRIBUTE VALUE) 0.9930 0.9937 0.9934 276228

(E,HTML ATTRIBUTE VALUE) 0.9929 0.9934 0.9932 276228

(I,HTML ATTRIBUTE VALUE) 0.9197 0.7980 0.8501 6101085

(S,HTML ATTRIBUTE VALUE) 0.9588 0.7900 0.8648 5339

(B,HTML CLOSING TAG) 0.9977 0.9990 0.9984 274653

(E,HTML CLOSING TAG) 0.9977 0.9990 0.9983 274653

(I,HTML CLOSING TAG) 0.9947 0.9970 0.9958 943834

(B,HTML COMMENT END TAG) 0.9915 0.9888 0.9901 5443

(E,HTML COMMENT END TAG) 0.9987 0.9985 0.9986 5443

(I,HTML COMMENT END TAG) 0.9987 0.9985 0.9986 5443

(B,HTML COMMENT OPENING TAG) 0.9940 0.9993 0.9966 5443

(E,HTML COMMENT OPENING TAG) 0.9906 0.9894 0.9900 5443

(I,HTML COMMENT OPENING TAG) 0.9901 0.9993 0.9946 10886

(B,HTML COMMENT TEXT) 0.9886 0.9679 0.9782 4458

(E,HTML COMMENT TEXT) 0.9895 0.9601 0.9744 4458

(I,HTML COMMENT TEXT) 0.6643 0.7338 0.6939 259905

(S,HTML COMMENT TEXT) 1.0000 1.0000 1.0000 985 (B,HTML DATA) 0.9824 0.9976 0.9898 170604 (E,HTML DATA) 0.9893 0.9980 0.9937 170604 (I,HTML DATA) 0.4993 0.9720 0.6332 4601197 (S,HTML DATA) 0.9985 0.9972 0.9979 16696 (B,HTML LAYOUT) 1.0000 0.9193 0.9563 3471 (E,HTML LAYOUT) 1.0000 0.9193 0.9563 3471 (I,HTML LAYOUT) 1.0000 0.9193 0.9563 13884 (S,HTML LAYOUT) 0.9984 0.9950 0.9966 1473124 (S,HTML NEWLINE) 0.9982 0.9981 0.9982 422067

(B,HTML OPENING TAG BEGIN) 0.9973 0.9981 0.9977 317243

(E,HTML OPENING TAG BEGIN) 0.9970 0.9976 0.9973 317243

(I,HTML OPENING TAG BEGIN) 0.9952 0.9890 0.9921 448745

(B,HTML OPENING TAG END) 0.9759 0.9793 0.9773 3572

(E,HTML OPENING TAG END) 0.9759 0.9793 0.9773 3572

(S,HTML OPENING TAG END) 0.9969 0.9979 0.9973 313661

(S,HTML OPENING TAG LAYOUT) 0.9949 0.9838 0.9893 309251

(B,JS COMMENT) 0.8413 0.3679 0.4816 3797 (E,JS COMMENT) 0.7744 0.3152 0.4255 3797 (I,JS COMMENT) 0.5926 0.3092 0.3453 241205 (B,JS NAME) 0.9156 0.2481 0.3432 320128 (E,JS NAME) 0.9152 0.2484 0.3434 320128 (I,JS NAME) 0.9198 0.2656 0.3612 1399245

(37)

(S,JS NAME) 0.8579 0.0689 0.1170 71742 (S,JS NEWLINE) 0.9409 0.2818 0.4225 116035 (B,JS NUMBER) 0.9641 0.0095 0.0158 150838 (E,JS NUMBER) 0.9545 0.0095 0.0157 150838 (I,JS NUMBER) 0.9622 0.0046 0.0067 434518 (S,JS NUMBER) 0.8635 0.0786 0.1290 59179 (B,JS PUNCTUATOR) 0.8720 0.1751 0.2843 24449 (E,JS PUNCTUATOR) 0.8728 0.1753 0.2845 24449 (I,JS PUNCTUATOR) 0.8691 0.1851 0.2972 2810 (S,JS PUNCTUATOR) 0.8668 0.1839 0.2803 689650 (B,JS REGEX) 0.5487 0.1004 0.1405 269 (E,JS REGEX) 0.7867 0.0892 0.1293 269 (I,JS REGEX) 0.5314 0.0346 0.0614 5347 (B,JS STRING) 0.8745 0.4030 0.5385 13788 (E,JS STRING) 0.8870 0.4033 0.5418 13788 (I,JS STRING) 0.6638 0.0341 0.0560 1522669 (S,JS WHITESPACE) 0.9289 0.2229 0.3528 1705108 avg 0.8697 0.7518 0.7325 31780272

Table 6.9.: Results of statistical multilingual lexing HTML, CSS and Javascript, 4-fold cross validation

6.6.1. Language recognition

Table 6.10 shows how well the characters are lexed to the correct language. This overview shows again that Javascript is not lexed to the correct language, but maybe well enough to determine if the html file contains CSS or Javascript as embedded language. A file can be categorised to one of the elements of the set: {HTML, CSS, Javascript, CSS+Javascript, HTML+CSS, HTML+Javascript, HTML+CSS+Javascript }. If a file contains one or more characters that are lexed to a language, then the file is categorised to an element with that language. For the category CSS+Javascript are no test files. Table 6.11 shows how well the language recognition works in this way. Table 6.12 shows the confusion matrix for this language recognition. That tells that plain CSS code is difficult to recognise. In almost half of the cases it is interpreted as something with Javascript or HTML. That plain Javascript is not recognised very well, can be caused by embedded HTML. The learning data is not aware of it.

The language recognition could be tuned in assigning the language category, by using a threshold. The current implementation already removes a category candidate, when only one character for a particular language is found.

(38)

CSS 0.9847 0.9447 0.9643 5868063

HTML 0.7868 0.9886 0.8662 18638163

Javascript 0.9179 0.1832 0.2877 7274046 avg/total 0.8534 0.7962 0.7519 31780272

Table 6.10.: Results of statistical multilingual lexing HTML, CSS and Javascript, to what language a character is lexed, 4-fold cross validation.

CSS 1.0000 0.5209 0.6782 240 HTML 0.7977 0.9250 0.8530 240 HTML+CSS 0.7613 0.8583 0.7967 240 HTML+CSS+Javascript 0.6108 0.8750 0.7155 240 HTML+Javascript 0.6505 0.8375 0.7292 240 Javascript 1.0000 0.3042 0.4576 240 avg/total 0.8034 0.7202 0.7050 1440

Table 6.11.: Results of language recognition with statistical multilingual lexing HTML, CSS and Javascript, 4-fold cross validation

Predicted CSS CSS HTML HTML HTML HTML JS Total +JS +CSS +CSS+JS +JS Actual CSS 125 17 2 40 56 0 0 240 CSS+JS 0 0 0 0 0 0 0 0 HTML 0 0 222 5 3 10 0 240 HTML+CSS 0 0 3 206 30 1 0 240 HTML+CSS+JS 0 0 1 24 210 5 0 240 HTML+JS 0 0 31 0 8 201 0 240 JS 0 6 23 2 42 94 73 240 Total 125 23 282 277 349 311 73 1440

Table 6.12.: Confusion matrix of language recognition with statistical multilingual lexing HTML, CSS and Javascript, 4-fold cross validation

(39)

7. Threats to Validity

The feature selection unigram, unigram+, bigram and trigram are just a start. The addition of the features first characters and numerical majority improve the results, but the addition of these is a bit random. To check if a quote is preceded with a space and an equal sign could have been more effective to decide if the characters are part of string literal or comment.

There is no effort put in optimising the data that is provided to the Conditional Random Fields program. Initially the research focussed on the feasibility of creating a lexer by manually tokenising some files, with 50 files as a maximum. Not converting the learning data into sparse arrays and see if higher number of files yields much better f1-scores is a missed chance.

Given the non optimised learning data, the lack of availability of high memory pro-cessors limited the research to create models with more than 50 files.

The file selection happens automated, by selecting repositories from gitHub that have more than 1000 stars and filtering the files with the right extension. From all the files a random selection is made to create learning data from. This method gives no certainty about the diversity of the cases the model learns from. If accidentally there are hardly files added with comments, then the model fails to interpret comments in the test files. Removing files that are bigger than 1MB, for the experiment Minimum number of files finding, can have lead to distorted results.

For creating the multilingual lexer, the javascript files could contain embedded HTML or CSS. This could have polluted the results for both multilingual lexing and the language detection tool.

(40)

8. Analysis and Conclusions

The conclusion of Zhao et al. that the four label set (L Seg4) gives better results for the segmentation of Chinese sentences than the two label set (L Seg2) [23] is applicable on source code too.

To create a statistical lexer with Condition Random Fields, one should use LSeg4

as segmentation label set and the one-step lexing method. To get the best score the trigram feature set should be used, including the first characters and numerical majority templates. The more files are used for learning the more reliable the lexer becomes. For Java, Python, HTML, CSS and Javascript and Rascal an overall f1-score between 0.89 and 0.99 can be reached, when learning from 50 files.

Though the overall f1-scores for lexing are high, the statistical lexer performs bad on determining the lines of comment, because there is still too much confusion between STRING, NAME and COMMENT. Further research could lead to a feature set that solves this confusion.

Creating a statistical lexer from up till 50 manually tokenised and labeled files is a few hours work. Manually tokenising and labeling is easier than creating a deterministic lexer, but it depends on the skills of a developer, which one is created faster.

The statistical lexer is not able to give high f1-score when it is used for multilingual lexing HTML, CSS and Javascript. The multilingual lexer is not reliable to detect if a file contains HTML, CSS, Javascript or a combination of those languages. Further research is needed to discard the multilingual statistical lexer as language recognition tool.

(41)

Bibliography

[1] Abram Hindle and Earl T. Barr and Zhendong Su and Mark Gabel and Premkumar T. Devanbu. On the naturalness of software. In Martin Glinz, Gail C. Murphy, and Mauro Pezz, editors, Proceedings of the 34th International Conference on Software Engineering, pages 837–847. IEEE, 2012.

[2] Aho, Alfred V. and Lam, Monica S. and Sethi, Ravi and Ullman, Jeffrey D. Com-pilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006.

[3] Benjamin Peterson. Lib/tokenize.py. https://docs.python.org/3.5/library/ tokenize.html, 2016.

[4] Chris Thunes. Javalang. https://github.com/c2nes/javalang, 2012.

[5] Collins, Michael. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA, 2002. Association for Computa-tional Linguistics.

[6] Crammer, Koby and Dekel, Ofer and Keshet, Joseph and Shalev-Shwartz, Shai and Singer, Yoram. Online Passive-Aggressive Algorithms. J. Mach. Learn. Res., 7:551–585, December 2006.

[7] greim. html-tokenizer. https://github.com/greim/html-tokenizer, 2015. [8] Hai Zhao and Chunyu Kit. Integrating unsupervised and supervised word

segmen-tation: The role of goodness measures. Inf. Sci., 181(1):163–183, 2011.

[9] Heitlager, Ilja and Kuipers, Tobias and Visser, Joost. A Practical Model for Measur-ing Maintainability. In ProceedMeasur-ings of the 6th International Conference on Quality of Information and Communications Technology, QUATIC ’07, pages 30–39, Wash-ington, DC, USA, 2007. IEEE Computer Society.

[10] Lafferty, John D. and McCallum, Andrew and Pereira, Fernando C. N. Condi-tional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learn-ing, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

(42)

[11] Mejer, Avihai and Crammer, Koby. Confidence in Structured-prediction Using Confidence-weighted Models. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 971–981, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[12] Miltiadis Allamanis and Daniel Tarlow and Andrew D. Gordon and Yi Wei. Bi-modal Modelling of Source Code and Natural Language. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Proceedings, pages 2123–2132. JMLR.org, 2015.

[13] N. Okazaki. Crfsuite: a fast implementation of conditional random fields (crfs). http://www.chokkan.org/software/crfsuite/, 2007.

[14] Nocedal, Jorge. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.

[15] Shalev-Shwartz, Shai and Singer, Yoram and Srebro, Nathan. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 807–814, New York, NY, USA, 2007. ACM.

[16] Simon Lydell. css-tokens. https://github.com/lydell/css-tokens, 2014-2015. [17] Simon Lydell. js-tokens. https://github.com/lydell/js-tokens, 2014-2015. [18] tpeng. Pycrfsuite. http://github.com/tpeng/python-crfsuite/tree/master/

pycrfsuite/, 2014. [Online; Latest commit 4e77222 on Apr 10, 2016].

[19] Tu, Zhaopeng and Su, Zhendong and Devanbu, Premkumar. On the Localness of Software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pages 269–280, New York, NY, USA, 2014. ACM.

[20] van Dam, Juriaan Kennedy and Zaytsev, Vadim. Software Language Identification with Natural Language Classifiers. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 1, pages 624–628. IEEE, 2016.

[21] Verhoofstad, E. and Heijligers, A.M.A. nlamah/uva-software-evolution. https: //github.com/nlamah/uva-software-evolution, 2016.

[22] Wang, Kuansan and Thrasher, Christopher and Hsu, Bo-June Paul. Web Scale NLP: A Case Study on Url Word Breaking. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 357–366, New York, NY, USA, 2011. ACM.

[23] Hai Zhao, Chang-Ning Huang, and Mu Li. An Improved Chinese Word Segmenta-tion System with CondiSegmenta-tional Random Field. In Proceedings of the Fifth SIGHAN

(43)

Workshop on Chinese Language Processing, pages 162–165, Sydney, Australia, July 2006. Association for Computational Linguistics.

(44)

A. Selected repositories

1. Python https://github.com/vinta/awesome-python.git https://github.com/scrapy/scrapy.git https://github.com/pydata/pandas.git https://github.com/divio/django-cms.git https://github.com/numenta/nupic.git https://github.com/zulip/zulip.git https://github.com/nylas/sync-engine.git https://github.com/cyrus-and/gdb-dashboard.git https://github.com/aziz/PlainTasks.git https://github.com/donnemartin/dev-setup.git https://github.com/honza/vim-snippets.git https://github.com/lektor/lektor.git https://github.com/flask-admin/flask-admin.git https://github.com/DanMcInerney/wifijammer.git https://github.com/cython/cython.git https://github.com/mininet/mininet.git https://github.com/facelessuser/BracketHighlighter.git https://github.com/i-tu/Hasklig.git https://github.com/imwilsonxu/fbone.git https://github.com/felixonmars/dnsmasq-china-list.git https://github.com/graphite-project/carbon.git https://github.com/srsudar/eg.git

https://github.com/mhartl/rails tutorial sublime text.git https://github.com/michael-lazar/rtv.git

https://github.com/asciimoo/searx.git https://github.com/divio/django-filer.git https://github.com/srusskih/SublimeJEDI.git

(45)

https://github.com/timothycrosley/isort.git https://github.com/jrnewell/spotify-ripper.git https://github.com/jjlee/mechanize.git 2. Java https://github.com/winterbe/java8-tutorial.git https://github.com/clojure/clojure.git https://github.com/jgilfelt/SystemBarTint.git https://github.com/rzwitserloot/lombok.git https://github.com/SkillCollege/SimplifyReader.git https://github.com/java-native-access/jna.git https://github.com/square/javapoet.git https://github.com/JulienGenoud/android-percent-support-lib-sample.git https://github.com/gabrielemariotti/RecyclerViewItemAnimators.git https://github.com/apache/zookeeper.git https://github.com/Ramotion/folding-cell-android.git https://github.com/pedrovgs/Algorithms.git https://github.com/MinecraftForge/MinecraftForge.git https://github.com/johannilsson/android-actionbar.git https://github.com/wangdan/AisenWeiBo.git https://github.com/iPaulPro/aFileChooser.git https://github.com/alibaba/otter.git https://github.com/wordpress-mobile/WordPress-Android.git https://github.com/geftimov/android-pathview.git https://github.com/purplecabbage/phonegap-plugins.git https://github.com/mixi-inc/AndroidTraining.git https://github.com/jenkinsci/blueocean-plugin.git https://github.com/toddway/MaterialTransitions.git https://github.com/jgilfelt/android-mapviewballoons.git https://github.com/antoniolg/MaterializeYourApp.git https://github.com/txusballesteros/welcome-coordinator.git https://github.com/ManuelPeinado/MultiChoiceAdapter.git https://github.com/alibaba/mdrill.git https://github.com/couchbase/couchbase-lite-android.git https://github.com/Dreampie/Resty.git

(46)

3. HTML https://github.com/uikit/uikit.git https://github.com/cheeaun/life.git https://github.com/usmanhalalit/charisma.git https://github.com/andris9/mailtrain.git https://github.com/alexazhou/VeryNginx.git https://github.com/romannurik/LayerVisualizer.git https://github.com/dxa4481/Pastejacking.git https://github.com/zTree/zTree v3.git

https://github.com/sinaweibosdk/weibo android sdk.git https://github.com/IonicaBizau/gridly.git https://github.com/codrops/SeatPreview.git https://github.com/codrops/BookBlock.git https://github.com/me115/linuxtools rst.git https://github.com/konmik/konmik.github.io.git https://github.com/kenjis/php-framework-benchmark.git https://github.com/EzoeRyou/cpp-book.git https://github.com/shengxinjing/my blog.git https://github.com/dmytrodanylyk/dmytrodanylyk.git https://github.com/x3dom/x3dom.git https://github.com/TriumphLLC/Blend4Web.git https://github.com/Blizzard/d3-api-docs.git https://github.com/Aufree/ting.git https://github.com/dwyl/learn-hapi.git https://github.com/jonschlinkert/gulp-htmlmin.git https://github.com/madhur/PortableJekyll.git https://github.com/PolymerElements/seed-element.git https://github.com/MeCKodo/forchange.git https://github.com/MoOx/pjax.git https://github.com/johnkil/Android-Icon-Fonts.git https://github.com/maciej-gurban/responsive-bootstrap-toolkit.git

(47)

4. CSS https://github.com/twbs/bootstrap.git https://github.com/numbbbbb/the-swift-programming-language-in-chinese.git https://github.com/amazeui/amazeui.git https://github.com/json-api/json-api.git https://github.com/18F/web-design-standards.git https://github.com/mojotech/jeet.git https://github.com/barryclark/jekyll-now.git https://github.com/tapquo/Lungo.js.git https://github.com/BonsaiDen/JavaScript-Garden.git https://github.com/rstacruz/flatdoc.git https://github.com/wavded/humane-js.git https://github.com/maxogden/screencat.git https://github.com/justspamjustin/junior.git https://github.com/hakimel/Avgrund.git https://github.com/fians/marka.git https://github.com/vendocrat/PaymentFont.git https://github.com/nathansmith/formalize.git https://github.com/designmodo/startup-demo.git https://github.com/yanhaijing/zepto.fullpage.git https://github.com/flatlogic/awesome-bootstrap-checkbox.git https://github.com/dowjones/intentionjs.git https://github.com/cutestrap/cutestrap.git https://github.com/jasonlong/isometric-contributions.git https://github.com/maxogden/monu.git https://github.com/guari/eclipse-ui-theme.git https://github.com/pikock/bootstrap-magic.git https://github.com/geddski/csstyle.git https://github.com/codrops/NotificationStyles.git https://github.com/wintercn/dog-fucked-zhihu.git https://github.com/resin-io/etcher.git https://github.com/usds/playbook.git https://github.com/we-are-next/cssco.git

(48)

https://github.com/danielfarrell/bootstrap-combobox.git https://github.com/jlong/css-spinners.git https://github.com/necolas/css3-facebook-buttons.git https://github.com/konpa/devicon.git https://github.com/bagder/http2-explained.git https://github.com/typeplate/starter-kit.git https://github.com/jescalan/rupture.git https://github.com/reimertz/brand-colors.git https://github.com/malarkey/Rock-Hammer.git https://github.com/HubSpot/tooltip.git https://github.com/koudelka/visualixir.git https://github.com/DevTips/DevTips-Starter-Kit.git https://github.com/thomaspark/pubcss.git https://github.com/RichardLitt/awesome-conferences.git https://github.com/cbrandolino/camvas.git https://github.com/dnomak/rocket.git https://github.com/codrops/Blueprint-VerticalTimeline.git https://github.com/zellwk/typi.git https://github.com/ctfs/write-ups-2016.git https://github.com/Igosuki/compass-mixins.git https://github.com/robertpiira/ingrid.git https://github.com/zapier/resthooks.git https://github.com/crushlovely/skyline.git https://github.com/pinggod/hexo-theme-apollo.git https://github.com/jbranchaud/splitting-atoms.git https://github.com/tommy351/Octopress-Theme-Slash.git https://github.com/rupl/unfold.git https://github.com/silktide/cookieconsent2.git https://github.com/10up/Engineering-Best-Practices.git https://github.com/JoelSutherland/LESS-Prefixer.git https://github.com/codrops/Blueprint-SlidePushMenus.git https://github.com/KOWLOR/DaftPunKonsole.git https://github.com/transcranial/atom-transparency.git

(49)

https://github.com/c0bra/markdown-resume-js.git https://github.com/ecomfe/saber.git https://github.com/evernote/sass-build-structure.git https://github.com/zeljkoprsa/waterlee-boilerplate.git https://github.com/less/less-docs.git https://github.com/arnaudleray/pocketgrid.git https://github.com/danielstern/ngAudio.git https://github.com/mdo/table-grid.git https://github.com/Code52/metro.css.git https://github.com/space150/flyLabel.js.git https://github.com/chris-pearce/scally.git https://github.com/lzjun567/XiYuanFangApp.git https://github.com/walkor/web-msg-sender.git https://github.com/lolmaus/breakpoint-slicer.git https://github.com/kerphi/phpfreechat.git https://github.com/idibidiart/AllSeeingEye.git https://github.com/ezekg/flint.git https://github.com/popcorn-time/popcorn-time.github.io.git https://github.com/amail/Verimail.js.git https://github.com/codrops/AnimatedBorderMenus.git https://github.com/minixalpha/StrayBirds.git https://github.com/domenic/html-as-custom-elements.git https://github.com/18F/college-choice.git https://github.com/yearofmoo/ngAnimate-animate.css.git https://github.com/andjosh/naked-wordpress.git https://github.com/Layerful/sassy-flags.git https://github.com/sindresorhus/bower-components.git https://github.com/matt-harris/outline.git https://github.com/kenwheeler/guff.git https://github.com/luigiplr/netify-jump.git https://github.com/sahat/instagram-hackhands.git https://github.com/harvesthq/tick.git https://github.com/catc/iGrowl.git

Statistical Lexical Analysis