Crawling online repositories for OpenMP/OpenACC code

(1)

Bachelor Informatica

Crawling online repositories

for OpenMP/OpenACC code

Derk van Gulick

June 8, 2018

Supervisor(s): Ana Lucia Varbanescu

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

This document describes the design and implementation of a crawler for OpenMP and OpenACC code inside online repositories. We implement our crawler using the Github API, which allows us to search inside repositories for specific patterns. The system allows the user to search for language-specific directives and builds a database to show in which repositories they have been found. We show that we can successfully find OpenMP/OpenACC code on Github and report the efficiency of the system.

(4)

(5)

2 System design 9 2.1 Data sources . . . 9 2.2 The Front-end . . . 10 2.2.1 Pragmas . . . 10 2.2.2 Project structure . . . 10 2.3 The Crawler . . . 10 2.3.1 Searching . . . 10 2.3.2 Caching . . . 11 2.4 Filtering . . . 11 3 Implementation 13 3.1 Pragmas . . . 13 3.2 Database . . . 13 3.3 Crawler . . . 14 3.4 Filter . . . 14 3.5 Configuration file . . . 14 4 Evaluation 17 4.1 Experimental setup . . . 17 4.1.1 Metrics . . . 17 4.1.2 System configuration . . . 17

4.2 Case study 1: OpenMP query . . . 17

4.2.1 Matches per repository . . . 18

4.3 Case study 2: OpenACC query . . . 18

4.3.1 Matches per repository . . . 19

4.4 Other observations . . . 20

5 Conclusion and future work 21 5.1 Conclusion . . . 21

5.2 Future work . . . 21

5.2.1 Alternative crawling strategy . . . 21

5.2.2 Verifying completeness . . . 22

5.2.3 Additional metrics . . . 22

Appendices 23

(6)

(7)

CHAPTER 1

Introduction

High performance computing is done on many types of hardware such as CPUs, GPUs and FP-GAs. Developing applications for these platforms can be challenging, since many of them require specific toolchains and programming models. In recent years, new programming models such as OpenCL, OpenMP(4.5+) and OpenACC have been introduced in order to allow developers to target multiple hardware platforms using a single software implementation. This introduced the concept of “performance portability”, a single implementation that will run on any hard-ware configuration with the performance being the same everywhere. Research on performance portability is still limited and the exact definition differs between different communities. J. Pen-nycook et al. have recently proposed a measureable definition of performance portability and its implications [6].

In order to analyze the performance portability of the aforementioned programming models, a large amount of source code is required. Acquiring large amount of code with specific directives and clauses by hand is a tedious task that we would like to automate. This project looks at the design, implementation and validation of a framework that automatically crawls code repositories containing OpenACC and OpenMP code, based on an user defined search query. The query could be something as simple as the file size or the file extension, all the way up to querying specific patterns within the code or project structure.

1.1 Related work

A lot of research focuses on collecting code from online repositories. Most of these papers simply state “we have crawled xGB of code from Github” and further dive in different research altogether, without any mentioning of the process/method used for the collection. In many cases, the search is only based on analyzing metadata obtained from repository APIs and only pull what is required, like in the work done by R. Hebig et al[3]. Since we require knowledge on the content of a repository, a different approach is used, as is discussed in section 2.1.

Allamanis et al. [1] mined 350 million lines of Java code to train a language model. This language model is then used to determine the quality of other software projects. They show that probabilistic language models are an effective way to analyze code. Wong et al. [7] used a similar approach by using natural language models to find clones of pieces of code to provide automatic code commenting. However, our criteria for whether or not a repository is relevant is much more simplistic and does not require a language model. We therefore put the focus on effectively finding substrings representing OpenMP/OpenACC pragmas in the source code.

1.2 Research question

Manual search and retrieval of code from online repositories remains a difficult task. The task gets increasingly complex if users have specific requirements for the search, e.g., different programming models and/or different code patterns. In this work, we aim to address this challenge by providing

(8)

an automated framework for searching and retrieving code from online repositories. In this context, the research question is:

Can we design a framework for efficient crawling of online code repositories, based on a user-specified query and validate it?

To answer this question, we provide a design and implementation of a first prototype of a crawler for OpenMP/OpenACC code. We then assess its performance by running it for a period of time while we gather statistics.

1.3 Thesis structure and contributions

Chapter 2 presents our system’s design. Specifically, we introduce the main challenges and propose a system design/architecture to address them. Chapter 3 describes the implementation and the decisions that have been made to make it work. In chapter 4 we analyze the results and validate our design decisions. Chapter 5 discusses our main findings and future work directions.

(9)

CHAPTER 2

System design

This section describes the design of the OpenMP and OpenACC code crawling framework. Figure 2.1 shows an overview of the framework. The framework consists of three distinct parts: A frond-end to allow user queries, a crawler to collect repositories, and a filter to separate only the repositories matching the query.

Figure 2.1: An overview of the framework

2.1 Data sources

While there are many online repositories that offer public APIs, this work specifically focuses on the Github API (v3). Although OpenMP and OpenACC are mature programming models, they are not considered programming languages by online code repositories. This means that a simple ”search-by-language” using the API will not be sufficient when looking for OpenMP and OpenACC code; instead, (almost) every C/C++ project has to be analyzed to determine if it is useful for our goals. The Github API largely solves this problem with one important feature called “Code Search”. The Code Search API can be used by providing it with a string to be looked for inside the contents of repositories. This feature allows the generation of a list of results that will, for sure, contain a string that matches an OpenMP/OpenACC pragma.

The Code Search API itself is quite limited in the complexity of the patterns it can look for. There is no support for regex-like pattern matching. The results therefore will still require additional filtering, as there is no guarantee that a matched result actually contains parallelized code. For example: an OpenMP pragma could be commented out and not actually do anything, but this wouldn’t be known unless the repository is actually analyzed further. Details on this additional filtering step are described in section 2.4.

Furthermore, there are a three other Code Search API drawbacks that could affect the quality of the search results:

• Code Search will only look in files smaller than 384KB. • Repositories with 500, 000 files or more are not searched.

(10)

• Code in forks1_{is only searchable if it has more stars (Github’s system to mark repositories}

as favorites) than the parent repository.

Lastly, the GitHub Code Search API is limited to 1000 results per query. We do not consider this a drawback, because of a straightforward workaround we found for this: using date ranges for search! Github’s search API allows the query to be extended to specify a date range in which the repository was created. We use this to slice the query into multiple queries over different time segments. However, we need to be cautious with this approach, because if the range for each segment is too wide, more than 1000 repositories will still be found in one segment, but only the first 1000 will be crawled; of course, if the range is too small, the overhead of launching many (fine-grain) queries and patch their results together might also become significant.

2.2 The Front-end

The front-end of the framework allows a user to construct queries that allow for more specific search results. A user can search for specific directives and clauses in pragmas, or for specify details about how a project’s structure should look like.

2.2.1 Pragmas

OpenMP and OpenACC use pragmas that are placed in front of blocks of code to instruct the compiler on how the corresponding block of code should be parallelized. A pragma consists of directives and clauses. Directives tell the compiler in what way the code block should be parallelized. Clauses are used to specify how data in memory is shared among parallel regions and to allow synchronization between regions.

Both the directives and the clauses that the user can specify are pre-selected from a finite list specified by the language specifications [5, 4]. The user is only allowed to query for these specific directives and clauses.

2.2.2 Project structure

The user also has the possibility to provide additional specifications about the project itself. The ones we have selected for our proof of concept are:

• Project size

• Project has MakeFile (or means to generate makefiles) • Allowed file extensions

2.3 The Crawler

The crawler connects with the repository host API(s) in order to retrieve repositories that might contain OpenMP or OpenACC code. Crawling works in two stages: searching and filtering.

2.3.1 Searching

The crawler starts out by querying the Code Search API. This returns a list of results with files that matched the query. From there, the information about each repository associated with the file in the results will be retrieved. In turn, from this new information, the size of the master branch will be determined. If the repository is small enough, the master branch will be downloaded and added to the cache.

(11)

2.3.2 Caching

If a repository is found that contains OpenMP/OpenACC code, it is useful to remember this for future queries, as a “shortcut” check to see if the repository also matches those. We propose to use a cache-like mechanism to keep track of these results and enable fast searching. We also expect a lot of repositories to not contain anything relevant to OpenMP or OpenACC at all, as any C/C++ repository could potentially hold relevant code. Keeping track of repositories that contained no relevant code is also important to reduce the amount of API calls that need to be made.

The crawler itself has little information about the contents of new repositories. Therefore, it only keeps track of repositories it has visited. The filtering phase will then update the cache to provide additional information about the contents of the repositories, to be used for future crawls. Once a new repository is found, it is downloaded and stored in the local filesystem. The system is now ready to proceed to the filtering phase.

2.4 Filtering

The final step in the crawling process is to filter all the collected repositories based on their contents. In this step, the code within the repository is ultimately analyzed to see if it contains OpenMP/OpenACC code that matches the user’s query. As soon as any OpenMP/OpenACC pragma is found, the cache will be updated to specify that this repository contains relevant code. If a pragma is found that matches the search query, the repository is considered a match.

When a match is found in a repository that has not been seen before, a new entry is created in the database with the following information:

• the directives/clauses this project contains; • the associated code block/filename/line number; • the location of repository in the local file system; • a flag to mark if the repository is already downloaded.

The addition of the final flag in the database record allows a strict record of what was in a certain repository without the need to keep the repository stored on disk. Repositories can then be re-downloaded if needed.

(12)

(13)

CHAPTER 3

Implementation

This chapter describes the implementation of our system, based on the design presented in the previous chapter. The implementation has been divided into several Python submodules, each serving their own purpose. Each of the following sections cover one of these submodules in our system.

3.1 Pragmas

The pragma module implements a Pragma object that, upon initialization, separates a given pragma string into the directives and clauses it consists of. The following example code illustrates this operation:

>>> pragma = Pragma( ’# pragma omp p a r a l l e l p r i v a t e ( i d ) ’ ) >>> pragma . d i r e c t i v e s ( )

[ ’ p a r a l l e l ’ ]

The splitting of the input string is done based on the formatting described in the OpenMP and OpenACC language specifications [5, 4].

3.2 Database

The database module serves two purposes: building the database schema, if not yet present, and providing an API for the other modules, allowing them to interact with the SQLite database more easily. The database schema consists of three tables:

cache The cache table (appendix A.1) holds information about all repositories that have been visited recently, to avoid visiting them again.

matches The matches table (appendix A.2) contains entries for every pragma found that matches the search query. Multiple matches can be associated with one repository. The purpose of the matches table is to provide a clear overview of the crawl results without having the need to look inside saved repositories again. Additionally, the full repositories often require a lot of disk space. By just storing what type of pragmas a certain repository contains as a database entry, it is now possible to delete the repository and only retrieve it if it is actually required.

repository The repository table (appendix A.3) contains information about every repository that matched the search query. Again, this information is used to keep track of repositories without having the need to keep them stored.

The database module also implements the DatabaseConnection class. This class implements functions that perform queries on the SQLite3 database. We have specifically made a special

(14)

class for this, because both the crawler and the filter require to access the database. This way we keep all code concerned with interaction with the database in one single class.

3.3 Crawler

The crawler module takes care of all the interaction with the API provided by Github. It uses HTTP GET requests to api.github.com using the PyGitHub Python module. This module represents JSON responses from the API as Python objects, thus simplifying interaction with the API .

The crawler starts off by using the Code Search API with a basic query: "#pragma omp" or "#pragma acc", depending on which type of code is required. Surrounding the basic query in quotes is important, because we only want results that exactly match these strings. If only a specific compiler directive is required, the query is extended with the name of the directive. Since a pragma can contain multiple directives, the directive name has to be added to the query outside of the quotes. For example, looking for all OpenMP pragmas that contain the “parallel” directive with "#pragma omp parallel" as query would fail to find pragmas that use multiple directives, such as: #pragma omp for parallel. Having the desired directive name outside of the quotes within the query will still match these cases where multiple directives are used.

Querying the Code Search API results in a list of repositories that matched the query. Besides this, the Code Search API also sends metadata about the repository. From this metadata we can determine the file size of the master archive before downloading it. An upper bound for the repository size can be set in the configuration file (section 3.5) to avoid downloading extremely large repositories. For any given repository, only the master branch will be downloaded and filtered. If the file size of the repository is below the user defined limit, the repository is added to the cache and downloaded as a .zip file to a temporary location on disk. A tuple with the name of the repository and where to find it on disk is then added to a job queue, waiting to be filtered by the filter module.

In rare circumstances, the Github API would sometimes reject requests due suspected abuse, making the API unavailable for several minutes. This happens even though the crawler respects the “Best practice” guidelines provided by Github [2]. To avoid such events, the crawler waits 1 second after each request. Even though this modification seems sufficient, an additional procedure is implemented that will pause the crawler until the abuse limit has expired. This ensures that the crawler will continue to operate in the rare event an abuse limit arises.

3.4 Filter

The filter module runs in a separate thread from the crawler module and waits for new incoming repositories to enter the job queue. Once a new repository has entered the queue, the filter module will extract the .zip file from disk into memory. All files matching the allowed file extensions configured by the user (see Section 3.5) will then be analyzed. For every pragma that is found to match the search query, a new entry is created in the matches table inside the database. After all files have been checked and all matches have been found, the entire repository will either be removed or moved to a more permanent location (again, this is configurable).

Our implementation only uses one filter thread, since the time the crawler takes to download repositories is larger than the time it takes to filter them. The implementation can be easily extended to support more filter threads if a single one proves to be insufficient. The job queue is created by using Python’s queue module. This module makes sure that the job queue can be read by multiple threads safely, without concerns for race conditions.

(15)

results, since it defines restrictions on allowed file extensions and on the repository size. The following is an example of a configuration file (it is also the configuration used to obtain all results in chapter 4): [GLOBAL] g i t h u b t o k e n = ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ; The d i r e c t o r y r e p o s i t o r i e s c o n t a i n i n g ; matching c o d e s h o u l d be s t o r e d i n r e p o d i r = . / r e p o s ; Temporary d i r e c t o r y t o s t o r e r e p o s i t o r i e s b e f o r e f i l t e r i n g t m p d i r = . / . tmp [DATABASE] dbname = c o d e c r a w l e r . db [CRAWLER] ; max s i z e o f a r e p o s i t o r y ( i n KB) m a x r e p o s i z e = 60000 a l l o w e d e x t e n s i o n s = . c , . cc , . C, . cpp

(16)

(17)

CHAPTER 4

Evaluation

In this chapter we focus on the empirical evaluation of the performance of our search system using two different case-studies: one for OpenACC and one for OpenMP.

4.1 Experimental setup

4.1.1 Metrics

In order to evaluate the search system, we have formulated several metrics. Intuitively, we would like to know how many of the search results provided by the Code Search actually match the query, and how many individual matches are found in a single repository. Thus, we propose the following evaluation metrics:

Number of repositories The number of repositories that were filtered and contained a match. This metric allows us to determine how well the Code Search API performs when it comes to finding pragmas fitting the query.

Number of repositories skipped The number of repositories that were skipped due to being too large or already in the cache from earlier searches. It is important to know how many repositories we did not check and for what reason.

Number of matches per repository The average number of matches per repository. This metric gives insight in the distribution of pragmas over the repositories.

The system allows extension for future metrics that have not been described here. Section 5.2.3 goes into detail about metrics-related future work possibilities.

4.1.2 System configuration

The system is configured to run one crawling thread and one filter thread. The maximum allowed repository size has been set at 60MB. In addition to that, the filter is restricted to only explore files with the following extensions: .c, .cc, .C and .cpp. For each of the case studies, the system runs for four hours. If the system is still completing a job at the four hour mark, this job is first finished before the crawling is terminated. If we run out of search results before the four hours mark, the crawl will be terminated. The query is split in time segments of one year each.

4.2 Case study 1: OpenMP query

The first case study has been performed with a query for OpenMP code. The query looks for any openMP pragmas that contain the parallel compiler directive. The crawler ran for the entire four hours. During the crawl, the rate limit of the API was never reached before the reset time.

(18)

During the crawl the system found 447 repositories that matched the query after filtering. In total 4000 repositories were retrieved after the search. 424 of these repositories were too large and therefore were skipped. 37 repositories turned out to contain no matches after filtering. The vast majority of the repositories turned out to be duplicate results, with a total number of 3092. Table 4.1 shows an overview of these numbers and the corresponding percentages.

Amount Percentage Matching repositories 447 11.2% Nonmatching repositories 37 0.9% Too large repositories 424 10.6% Already cached repositories 3092 77.3%

Total 4000 100%

Table 4.1: Crawling results for the OpenMP query.

The high amount of “Already cached repositories” seems strange, but it actually makes sense: the Code Search API returns the files that match. We therefore often get multiple matches for the same repository. The filter stage will check out every file in the repository the first (and only) time the repository is downloaded. At this point, every match already has been extracted, and there no longer is a point in visiting the repository again for other file matches.

4.2.1 Matches per repository

We further analyze the matches per repository. Figure 4.1 shows a histogram of the number of matches per repository. We observe that the vast majority of the results contain less than 100 matches. Repositories with less than 10 matches make up 54.6% of all matching repositories. In total, 5920 matches were found inside 447 repositories.

0 50 100 150 200 250 0 50 100 150 200 Number of matches F requency

Figure 4.1: Histogram of OpenMP matches

4.3 Case study 2: OpenACC query

The second case study has been performed with a query for OpenACC code. Once again we look for pragmas containing the parallel directive. For the OpenACC query, the search terminated before the 4 hour mark, i.e., after 3 hours and 18 minutes. Once again the rate limit of the API

(19)

because we actually processed more results during the OpenACC query, with a total of 5000. 585 of these repositories were too large. The number of repositories that contained no match after searching was higher, with a total of 135. 4219 results were already found in the cache and were therefore skipped. Table 4.2 shows an overview of these numbers and the corresponding percentages.

Amount Percentage

Matching repositories 61 1.2%

Nonmatching repositories 135 2.7% Too large repositories 585 11.7% Already cached repositories 4219 84.4%

Total 5000 100%

Table 4.2: Crawling results for the OpenACC query.

We observe that the percentage of matching repositories is much lower than for the OpenMP search. This seems to be caused by two main factors. First, the percentage of nonmatching repositories is almost 3 times larger compared to the OpenMP query. The Github API seems to sometimes reconize OpenMP code as OpenACC code with our query. We suspect that the same problem appears the other way around. However, this is much less noticeable, because there is a lot more OpenMP code on Github compared to OpenACC code. The filter does a good job of catching this, however, and ensures that the final results are all OpenACC matches. Second, the search results with the OpenACC query contain more duplicates than the OpenMP query. This simply means that we have less unique repositories to check and therefore have less matching results.

4.3.1 Matches per repository

Again, we further analyze the matches per repository. Figure 4.2 shows a histogram of the number of matches per repository. Two repositories are not shown in this figure: one of them contained 1425 matches, and one contained 29,849. These two repositories both are compilers that implement OpenACC, and therefore have such extreme amounts of matches. If we exclude these, the distribution appears to be quite similar to the one of the OpenMP query. Repositories with less than 10 matches make up exactly 50% of the matching repositories we found. In total, 31,897 matches were found inside the 61 repositories.

0 20 40 60 80 0 5 10 15 Number of matches F requency

(20)

4.4 Other observations

A crawl with the maximum repository size set to 200MB resulted in most results being giant repositories with very high amounts of matches. It turned out that these repositories were often tool-chains with a C compiler packaged inside. The compilers themselves have a lot of matches with our query and therefore rank higher in the Github API search results. Because we process the results starting from the highest ranking result, the crawler becomes very slow due to these large repositories and the large amount of duplicates. The crawling results also turn out to be pretty poor, because a lot of them have identical matches because multiple tool-chains used the same compiler. Ideally, one should slightly expand the filter to recognize and exclude tool-chain and compiler repositories, to remove such false positives from the search.

(21)

CHAPTER 5

Conclusion and future work

This section summarizes our findings and proposes several promising directions for future work.

5.1 Conclusion

We have designed a system that automatically collects OpenMP and OpenACC code from Github. We have further provided a prototype implementation, which empirically demonstrates the system is able to find code that matches a user’s query.

Our empirical analysis shows that, for the OpenMP query, the search results from the Github API are good. From the 484 repositories that were downloaded, 92% contained at least one match to the query. In contrast, only 31% of the repositories downloaded during the OpenACC crawl contained at least one match. It appears that queries for OpenACC code often get confused with OpenMP code, causing a lot more of the repositories in the search results to be filtered out as false positives. Both case studies showed that the Code Search API results in a lot of duplicate repositories. Some projects contain a lot of OpenMP/OpenACC pragmas inside many files and therefore appear in the results a lot. We also found from both case studies that the majority of matched repositories contained less than 10 matching pragmas. The OpenACC study also showed that some repositories have a very high amount of matches - just one repository accounted for 94% of all the matches found - but since this repository is a compiler, it is not really useful.

In summary, our prototype provides empirical evidence that our research question can be answered positively: it is indeed possible to design and implement an automated search system for crawling Github repositories in search of OpenACC and OpenMP. Although limited to two case-studies, our validation experiments show promising results for the efficiency and usability of the tool.

5.2 Future work

5.2.1 Alternative crawling strategy

Github’s Code Searching API is a useful tool to find specific patterns within individual files inside a repository. Instead of using Github’s Code Searching API, other ways of finding repositories that may contain relevant code are available. One alternative would be to start the crawl at a repository that is known to contain useful code, and moving ahead by visiting the repositories that are owned by the contributors of this project. The idea here is that developers contributing to repositories containing OpenMP/OpenACC code are more likely to own repositories with other OpenMP/OpenACC code that is also relevant for the query.

A drawback here is that this strategy is suspected to do a lot more API requests than a single Code Search. The amount of repositories that have no relevancy at all is likely to be much higher. In contrast, the amount of duplicates should be lower too, which could balance this out.

(22)

Future research into this strategy could still provide insight into whether or not this approach leads to a more efficient search.

5.2.2 Verifying completeness

The vast majority of the projects found in online code repositories will not match the Open-MP/OpenACC query. Using GitHub’s Code Search API significantly increases the chances of finding a repository matching the query compared to checking every repository exhaustively. However, the Code Search API applies restrictions to which repositories it will search in (section 2.1). A problem here is that, while the probability for a match is increased, the probability that a matching repository is skipped also increases. It would be useful to have an indication of how much of the matching repositories are missed using the Code Search API or a different crawling strategy.

5.2.3 Additional metrics

Besides the metrics described in section 4.1.1, the system also allows to be extended to collect more. The Github API offers a wide range of information about repositories which can be processed while the crawler runs. This information is all stored in the JSON response of the API, which our implementation uses only to determine the repository size. The filtering phase could also be extended to collect additional metrics based on the contents of the repositories.

(23)

APPENDIX A

Database schema

A.1 Database schema for the cache table

Field name Type Specification(s) Clarification

id INTEGER PRIMARY KEY

repo name TEXT NOT NULL The name of the repository last visit REAL NOT NULL Unix timestamp specifying the

moment the entry was added to the cache

has pragma INTEGER DEFAULT 0 marks if a repository contained OpenMP/OpenACC pragmas (used for future crawls)

A.2 Database schema for the matches table

repo id id NOT NULL The repository the match was

found in

directive TEXT The directive(s) found. Pragmas

with multiple directives are stored as one string.

clause TEXT The clause(s) found. Pragmas

with multiple clauses are stored as one string.

in file TEXT NOT NULL The path to the file this match was found in within the reposi-tory.

found on REAL NOT NULL Unix timestamp specifying the moment the entry was added to the cache

has pragma INTEGER DEFAULT 0 marks if a repository contained OpenMP/OpenACC pragmas (used for future crawls)

(24)

A.3 Database schema for the repository table

repo name TEXT NOT NULL The name of the repository

repo url TEXT NOT NULL URL of the repository

date added REAL NOT NULL The timestamp of when the repos-itory was added

(25)

Bibliography

[1] Miltiadis Allamanis and Charles Sutton. “Mining source code repositories at massive scale using language modeling”. In: Proceedings of the 10th Working Conference on Mining Soft-ware Repositories. IEEE Press. 2013, pp. 207–216.

[2] Github. Best practices for integrators. https://developer.github.com/v3/guides/best-practices-for-integrators/. [Online; accessed on 2018-06-06].

[3] Regina Hebig et al. “The quest for open source projects that use UML: mining GitHub”. In: Proceedings of the ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Systems. ACM. 2016, pp. 173–183.

[4] OpenACC. “The OpenACC Application Programming Interface v2.6”. In: (2017). [5] OpenMP. “OpenMP Application Programming Interface 4.5”. In: (2015).

[6] Simon J Pennycook, JD Sewall, and VW Lee. “A metric for performance portability”. In: arXiv preprint arXiv:1611.07409 (2016).

[7] Edmund Wong, Taiyue Liu, and Lin Tan. “Clocom: Mining existing source code for auto-matic comment generation”. In: Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE. 2015, pp. 380–389.

Crawling online repositories for OpenMP/OpenACC code

Bachelor Informatica