Analysing Android testing techniques using the navigation flow

(1)

Analysing Android testing

techniques using the navigation

flow

Sangam Kumar Gupta

sangam.gupta@student.uva.nl

August 24, 2018, 34 pages

Academic supervisor: Ivano Malavolta Company supervisor: Kevin Bankersen

Host organisation: KPMG,https://home.kpmg.com

Universiteit van Amsterdam

Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering

(2)

2.2 Static analyses . . . 6 2.3 Instrumentation. . . 7 3 Experimental planning 8 3.1 Goals . . . 8 3.1.1 Research questions . . . 8 3.2 Experimental Units. . . 9 3.3 Experimental material . . . 10 3.4 Tools. . . 10 3.4.1 Overview . . . 10 3.4.2 Code coverage . . . 10 3.4.3 Testing tools . . . 11 3.4.4 Statical analysis . . . 12 3.4.5 Analysis procedure . . . 14 4 Experimental execution 15 4.1 Adding instrumentation . . . 15 4.2 Testing . . . 15 4.3 Generating WTG . . . 16 5 Result 17 5.1 Analysing the difference in coverage . . . 17

5.2 Analysing the difference in transitions . . . 17

5.3 Analysing the difference in screens . . . 19

5.4 Analysing the missed widgets . . . 22

6 Discussion 24 6.1 Evaluation of results . . . 24 6.1.1 RQ1 . . . 24 6.1.2 RQ2 . . . 24 6.1.3 RQ3 . . . 25 6.1.4 RQ4 . . . 25 6.2 Threats to validity . . . 26 7 Future work 28

(3)

8 Conclusion 30

(4)

Abstract

The Android research community developed many automated test input generation techniques to aid developers in their testing practices. These techniques allow developers to test their Android app without the need for test scripts, thus increasing its quality without much effort. Unfortunately, one of the issues these techniques have is that they vary in performance; it is still unclear why that is. In this research, therefore, we look at a new way of assessing two automated testing techniques (random and active learning) based on the navigation flow.

Specifically, we compare the test graphs created by the random and the active learning testing techniques on three aspects concerning the coverage difference. We look at activities, i.e. screens represented as nodes; transitions between activities represented as edges; widgets represented as user actions that trigger transitions. The navigation flow is depicted with a Window Transition Graph (WTG) [1]. WTG is a model representing the possible GUI window sequences and their associated events and callbacks.

We ask the following research questions:

RQ1 What is the coverage difference between random and active learning testing techniques?

RQ2 What effect does the difference in (ICC)transitions per app have on the delta coverage?

RQ3 What effect does difference in the number of unique screens visited by each technique have on the delta coverage?

RQ4 What are the characteristics of the widgets that random and active learning testing techniques failed to execute?

We successfully ran tests on 412 industrial Android apps using AndroidRipper [2]. Also, we generated 375 WTGs using a custom solution on top of the Argus static analysis framework [3].

Our results show that random achieves a slightly higher coverage, finds more transitions and screens compared to active learning. Additionally, we found that both techniques have difficulty with widgets that require more than one action (complex interaction) to trigger a transition.

We conclude that:

• the navigation flow has a low influence on the difference in coverage between random and active learning testing techniques;

• both techniques have difficulty with complex interactions.

The contributions of this paper include a statical analysis tool that is capable of producing a WTG, a tool for adding instrumentation to APKs without source code, and a new way of assessing the performance of the testing techniques.

(5)

Chapter 1

Introduction

1.1 Context

Over the years mobile phones have undergone enormous changes. But not only has the phone changed, so has the Android platform, which currently sits at API level 27. Similarly, in the same period, practice and research developed various tools to aid developers in their Android app development. One of the core areas of research for Android is testing. Specifically, for this paper, we focus on testing of stability using automated test input generation techniques. It is often clearly visible in reviews when an app is unstable. The users flock to give the app one-star review because they can not use the app correctly. In fact, during Google IO 2018 Google mentioned in one of its talks that 42% of one-star ratings mention stability and bugs. 62% uninstall if they experience instability, i.e. crashes, freezes or errors [4]. The idea is that developers who use the tools that implement these techniques do not need to write test scripts; which can save time in the relatively short development cycles that apps usually have. Thereby in return, without much effort, testing stability and producing higher quality apps.

1.2 Problem statement

Unfortunately, the adoption of these tools leaves something to be desired. Developers prefer manual testing, some of the reasons include: lack of knowledge of the tools and techniques, usability, the learning curve of the existing tools and lack of time [5–7]. One way in which the tools can become more attractive is by using the right techniques for the right apps. To determine the most effective testing technique for a particular app, we need to understand what characteristics have a high influence on the technique.

Currently, in most papers, the more common (and traditional) metrics are LOC, cyclomatic com-plexity and the number of activities (screens) or a combination thereof. With these, we can indicate the complexity and size of an app. So by running tests for varying sizes of apps with different complex-ities, we can assess the performance of the testing techniques. With the thought that these metrics will tell something about the coverage. However, due to the gesture-based nature of mobile apps, these metrics are not enough to understand why the techniques perform a certain way for an app. For example, one screen could contain most of the code (LOC), but if it is not reachable, then LOC as a measurement unit does not adequately represent the complexity of the app. In the same way, the number of screens does not mean anything if all the screens are merely reached by swiping left or right which should not be difficult for most tools.

1.3 Research objective

We propose an approach that complements the previously mentioned metrics. The approach we believe that will improve our understanding of the techniques and complexity of an app is the (navigation)flow between screens. How does one navigate from one activity to another? Reaching all the screens is

(6)

a huge challenge in the current set of techniques. Moreover, why is it so hard? In this research, therefore, we analyse what impact the navigation flow of an Android app has on the automated test input generation techniques. As a result, we hope to gain a better understanding of the bottlenecks in the current techniques, introduce a new way of assessing the tools, and provide guidelines for developers when creating their navigational flow in respect to stability.

The thesis is structured as follow. Chapter2contains the necessary background information, followed by chapter3with the research plan. Next, chapter 4 covers the execution of the plan and chapter5

displays its results. In chapter 6 we discuss our results followed by future work in chapter7. Last, chapter8contains the conclusion.

(7)

Chapter 2

Background

2.1 Automated testing techniques

There is a wide variety of automated test input generation techniques in research [8–30]. Amalfitano et al. [2] proposed a framework for comparing the many automated testing techniques which we will use from here on forward. The authors separate two kinds of testing directions: offline and online. Offline testing entails creating the test cases before execution whereas online testing entails generating test cases during execution. In this case, automated test input generation hereafter referred to as testing techniques is an online testing technique; it is black box testing and there is no need to write test cases before execution.

Most online testing can be further broken down into techniques that fall under random or active learning. Random testing techniques do not use a model to generate test cases. Instead, the tools execute and generate test cases at random. Active learning, on the other hand, has a variety of implementations, but the crucial part is that it generates and executes tests methodically by exploiting a model. Thus these are repeatable and produce the same results every time. Amalfitano, Mahmood and Choudhary [2,20,31] showed in their comparison studies that random based testing techniques on average achieve higher coverage than active learning. Conversely, in the papers of the tools, the authors claim that their tool outperforms random [13, 14,21, 24–26, 29] thus contradicting the comparison study. [32–34] further highlight the current challenges in Android testing. Notably, Rubin et al. [34] state the same challenges as we identify with Android testing: dynamic screens causing state explosion, inter-component communication (ICC) and complex interaction. We believe we can capture these issues by examining the navigational flow of an app and thus explaining the difference in performance between the techniques and among papers.

Interestingly, both Song and Gu [14, 24] circumvent the issue of state explosion by directly going to the activity under test instead of going through all the steps necessary to reach that activity. This approach ignores state explosions, but also the traditional way in which a user would navigate to the activity, so it does not help with understanding and finding a solution to this issue.

2.2 Static analyses

An Android app consists of one or multiple activities. An activity is an isolated component and represents a single screen. Activities are glued together with intents at runtime. There are two forms of intents: explicit and implicit. Explicit intents start a specific activity within the scope of the app, and implicit intents start any component that can handle the intended action. The mix of implicit and explicit communication makes static analyses of inter-component communication (ICC) and inter-app communication (IAC) hard to extract. With ICC representing the communication between Android components including activities within the scope of the app and, IAC representing the communication between apps.

Li et al. [35] performed a systematic review of static analysers. Some notable findings included that most static analysers are used and built for security-related issues, and lack implicit communication.

(8)

It also names Amandroid, IC3 and IccTA as one of the few analysers that can perform ICC analysis. Qiu et al. [36] performed a large, controlled, and independent comparison of the three most promi-nent static analysis tools: FlowDroid [37] combined with IccTA [38], Amandroid [3], and DroidSafe [39]. All three analysers can perform ICC analysis. In his study, Qiu presented the strengths and weakness of each tool. The results show that overall Flowdroid performed best, followed closely by Amandroid and, with Droidsafe lacking behind mostly due to its limitation to SDK level 19.

Flowdroid [37] and Amandroid [3] are both made to perform taint analysis. This type of analysis finds potential security threats related to data input from all sources. For example, data injection from an input field causing the app to crash. Both analysers can also be used as a library, thus providing the API to built a custom analysis tool for Android navigation. Both are state of the art tools that can perform ICC analysis. Amandroid has built-in ICC analysis. Flowdroid, on the other hand, uses IccTA for ICC analysis. IccTA, in turn, uses IC3 to prepare an APK for analysis. Flowdroid uses Soot as underlying static analysis framework with Jimple as their intermediate representation (IR) language. Amandroid is part of the Argus static analysis framework and uses Jawa as their IR language.

Yang et al. [1] proposed the window transition graph (WTG), a model representing the possible GUI window sequences and their associated events and callbacks. The authors implemented WTG analysis in Gator a Public Program Analysis Toolkit For Android. Gator like Flowdroid is built on top of Soot. They evaluate the model using 20 open source applications using a mix of manual validation and test generation. The authors found with manual validation of 6 apps that 6% of the routes found were infeasible across the applications.

2.3 Instrumentation

Android APK does not have a built-in way to measure code coverage. So to track code coverage, we need to use an external tool. For this, there are two popular options: EMMA and JaCoCo. EMMA was the popular (and the only) option for a long time. Therefore many older tools use EMMA for instrumentation. JaCoCo which is newer is based on EMMA and receives updates regularly. EMMA received its last update in 2005. It is recommended to use JaCoCo over EMMA because it is newer and Android Studio (Google) also switched from EMMA to JaCoCo. However, they still support both. An alternative to EMMA and JaCoCo is to use a custom code coverage measurement tool. Song [24] created such a tool named Asc. Asc does not require to decompile the APK, and thus it can be used on all APK’s. In comparison to EMMA and JaCoCo which both require source code, it is a huge plus. The conventional way in which a developer would add coverage using both EMMA and JaCoCo is to produce an APK with instrumentation from the source code. Such an APK can be built using a particular build configuration. Unfortunately, in our case, we only have the APK’s of the respective app without its source code, and thus we cannot use the conventional way. Also, Asc lacks a formal validation for its precision. Neither is the tool standalone hence it can not be combined with different testing tools.

Zhaun et al. [30] created a framework called BBoxTester that is able to generate code coverage reports and produce uniform coverage metrics in testing without the source code. The framework uses EMMA for instrumentation and was evaluated by comparing the different testing tools against BBoxTesters own test implementation. However, using this approach does not evaluate the accuracy of their tool; it only tests whether the tool works. Nevertheless, by using EMMA, we know that it tracks code coverage accurately.

(9)

Chapter 3

Experimental planning

3.1 Goals

The goal of this study is to analyse the effects of the of the navigational flow of an Android app on the automated test input generation techniques. As a result, we will be able to predict the performance of a technique based on the navigational flow, and it will also help us identify and understand the cause for the coverage difference among techniques. We analyse the impact by comparing two techniques: random and active learning on their generated (test)flow against the coverage achieved. The test flow is the model (Guitree) that is generated to guide Active learning during testing. We also compare the test flow against the WTG built by a static analyser so that we can detect the missing subtrees. Potentially finding issues in the implementation of the tool. To formalise the goal:

Goal1 Analysing the flow difference between random testing technique and active learning testing technique, to understand its effects on the coverage difference between techniques for Android apps.

The second goal while not formal or measurable is to advise developers concerning testing and the effect of their navigation flow.

3.1.1 Research questions

RQ1 What is the coverage difference between random and active learning testing techniques?

The purpose of this question is to reevaluate the coverage difference between the two techniques on a larger dataset. We can only move forward with the study if there is a difference like previous studies indicate. The difference in coverage hereafter referred to as ∆coverage is calculated per app as following:

∆coverage = randomcoverage − activelearningcoverage

The integer value for ∆coverage ranges between -100 and 100. Negative numbers like -50 indicate that active learning achieved 50% more coverage than random and positive numbers like 50 indicate that random achieved 50% more coverage than active learning. We hypothesise that the Random will on average achieve higher coverage than active learning similar to previous evaluation studies.

RQ2 What effect does the difference in (ICC)transitions per app have on the delta coverage?

We calculate the difference in transitions hereafter referred to as ∆transitions as follow:

∆transitions = uniquetransitionsrandom − uniquetransitionsactivelearning

∆transitions is an integer value. Negative values indicate more transitions in favour of active learning and positive values indicate more transitions in favour of random. This metric will help us better

(10)

understand the impact of transitions on the coverage. With ∆transitions being the dependent variable and ∆coverage being the independent variable. If one technique finds more transitions it should result in a higher coverage because more code is executed. Likewise, a transition also means that there is more code to test because more screens are found. We, therefore, hypothesise that ∆transition has a positive effect on ∆coverage. Additionally, we hypothesise that random will find more transitions simply due to the lack of efficiency in active learning. Active learning is less efficient because it restarts after every test scenario, random does not and will not have any downtime meaning that it will execute more events.

One last note, the difference in transitions is calculated using absolute values instead of the percent-age of total transitions because we know that the coverpercent-age will not be high based on previous studies. So taking the relative value would not clearly show the differences in coverage. For example, if active learning found two transitions and random testing found three transitions with the app having a total of four transitions, then the difference would be 25% but this does not mean random will have 25% higher coverage over active learning. We do however think that a difference of 1 edge should mean (a slightly) higher coverage.

RQ3 What effect does difference in the number of unique screens visited by each technique have on the delta coverage?

We calculate the difference in unique screens found hereafter referred to as ∆screens as follow:

∆screens = uniquescreensrandom − unqiuescreensactivelearning

Like RQ2, ∆screens is an integer value. Negative values are in favour of active learning, and positive values are in favour of random. This metric will help us better understand if the difference in technique is due to one finding more screens than the other. In this relation, the ∆screen is the dependent variable and ∆coverage is the independent variable. Compared to RQ2 this should indicate a stronger relationship with ∆coverage because ∆transition is partially independent of the number of screens. For example, a higher number of transitions does not necessarily result in a higher number of screens found. It could indicate that there are multiple ways in which one can transition between screens. So conversely to transitions, a high difference in screens indicates more code found that can be tested and therefore resulting in a potentially higher coverage. Hence we hypothesise that a higher ∆screens should result in a higher ∆coverage.

Like RQ2 we use absolute values to calculate the difference between screens.

RQ4 What are the characteristics of the widgets that random and active learning testing techniques failed to execute?

For this research question, we focus on all widgets that the lead to a different screen but were not executed during testing by the testing techniques. Widgets allow a user to interact with the application. These widgets can also cause transitions. So if we want to analyse why the testing techniques do not execute specific transitions (actions), then we have to look at the widgets. As a result, we gain knowledge about the limitations of the tool or potentially the technique. For example, we could find that unconventional widgets are often the cause of a technique to stop discovering the rest of the model or that specific widgets like listview can cause state explosions and as a result prevent a technique from finding more screens.

In comparison with the other research questions, this one is more exploratory. RQ3 represents the nodes, and RQ2 represents edges, here we look at what action causes the transition or more interestingly what stops a transition.

3.2 Experimental Units

For this study, we limited the mobile applications to Android. Android in respect to IOS has a higher market share (85,9% vs 14,1%)1_{, is open source, has a more active research community and all prior}

research is limited to Android as well, and therefore the tools provided only work for Android.

(11)

Figure 3.1: Abstract overview

The dataset we used without excluding any category included 15124 Android apps [40]. The sample size for this population based on a confidence level of 95% and an error margin of 5% was to at least test 375 apps. The selection apps was made using Excel via Simple Random Sampling.

3.3 Experimental material

All experiments were performed on Windows 10, with a Core I5-6500, AMD 480 and 16GB RAM. For time purposes, the apps were tested using the strategy random and active learning in parallel, on two emulators running the same configuration. The tools lacked support for hardware devices, and thus the limitation to emulators. The emulators ran the latest non-beta Android OS at the time of the experiment, which was Android API level 27 also known as Android 8.1. The architecture type for the emulator was x86. ARM-based emulators had much difficulty running on the system which resulted in many crashes. Since the system architecture does not have an impact on testing, we chose to limit testing to x86 emulators.

3.4 Tools

3.4.1 Overview

Figure 3.1 gives a general overview of the testing and static analysis process. It consists of three components, two custom made and one (AndroidRipper) made by Amalfitano [2].

The process consists of two parts: static analysis and testing. The former, analyses a clean APK so that it can build a WTG. The latter, first, decompiles the clean APK. Second, adds EMMA instrumentation to the source code. Third, recompiles it into a new ”instrumented” APK that is capable of tracking code coverage. Last, runs AndroidRipper twice, once for random and once for active learning on the instrumented APK resulting in two coverage reports and two test graphs.

The produced artefacts include the Window Transition Graph, Coverage report and the test graph. The following sub-chapters give a more in-depth description of each component and its alternatives.

3.4.2 Code coverage

Android does not have a built-in way to add instrumentation to APK as discussed in chapter 2.3. Therefore our first approach was to use BBoxTester [30]. Unfortunately, we could not get this working. There were two problems, the libraries used by the tool were outdated, and the tool mixed code for instrumentation and testing; as a result, splitting the code was difficult. Nevertheless, BBoxTester provided much information to write our implementation using EMMA v2.1. We specifically chose EMMA because that is what is used by other papers in the past, we do not have any source code and, we could only find (limited) information on how to instrument APK’s without source code.

Due to the limited information and lack of source code, we had many difficulties in producing instrumented APK. It was a lot of trial and error. So we included a step by step process to instrument

(12)

APK’s. We think this will benefit future work a lot.

1. Decompile the APK using apktool version 2.3.3. It is essential to use the latest version for support of the latest API. In our case level 27. This step produces multiple files including the manifest, layout files and *.dex files.

2. Add permission to write to external storage to the manifest. So that EMMA can generate and save coverage files on the sd card of the emulator.

3. The dex files produced at step 2, can be either one or many depending on the size of the APK. Each dex file can hold 64k method references. We transform each dex file to jar using the tool dex2jar version 2.1-nightly-28.

4. We unzip each jar file without modifying the folder structure. Retaining the structure of the jar is essential for rezipping later on.

5. This step involves adding EMMA instrumentation to the *.class files extracted from the jar. To add EMMA instrumentation to the source code and not to any library code we look at the package name. The package name should reflect the folder structure. If this is the case, then we merely add EMMA instrumentation to all the classes inside the (package) folders. However, if the folder structure is different from the package name, then we exclude all well-known library packages and add instrumentation to the remaining classes. We also save potential libraries to an external file which we later review to check if we missed any libraries. If there are any missing libraries, we rerun the instrumentation process for all APK’s with an updated library list.

6. After adding instrumentation to the class files, we zip them back to a jar.

7. We add EMMA runtime and configuration files to the jar and simultaneously generate new dex files. In case of multiple dex files, it is crucial to include the entry class in the main dex file using the main-dex-list. This list contains all method references that the main dex file will include. The structure saved during step 4 determines which files belong to the main dex. This step uses the dx tool in Android SDK API 27.

8. After we have produced the dex file, we recompile it into an APK using apktool version 2.3.3.

9. The last step is to sign and align the APK. We use the standard tools and debug certificate provided by the Android SDK API 27. The debug certificate is made for the development environment and can not be used to upload an APK to the Play store. Using a debug certificate ensures that we can not reupload the modified APK.

The result is an instrumented APK with its respective *.em file. We use these files to generate human-readable coverage reports.

3.4.3 Testing tools

The tool we used for testing is a customised version of AndroidRipper v2017.10. It is the tool created by Amalfitano et al. [2]. We chose this tool since it is customisable, works without modifying the APK source code, is made to compare testing strategies, and most importantly it supports both active learning and random testing.

However, we still made the following modifications to improve the workflow: adding support for multiple emulators; improving the overall stability by fixing bugs and catching errors thrown by the tester; generating a graph of all the screens that the tool visited; producing a coverage report using EMMA.

The tool specific configuration was as follow. First, AndroidRipper is configured for both active learning and random to stop after 25 restarts. 25 is the number of restarts we tested with a few apps which gave us a test time per app of around half an hour. We chose a runtime of half an hour due to time constraints. Second, the active learning configuration was set to use a depth-first traversal. Meaning that planned tasks execute in LiFo order. We chose this order with the reason that it will result in a bigger graph because for a screen its tasks are executed until it finds a new screen, then

(13)

the new screen gains priority. In regards to coverage, Amalfitano et al. [2] showed that there is no difference coverage between depth- and breadth-first. Last, random had no notable configurations.

3.4.4 Statical analysis

1 #Main Layout xml

2 <Button android:onClick=”exampleCallbackMethod” /> 3

4 #MainActivity

5 public void exampleCallbackMethod(View view){

6 methodA()

7 } 8

9 public void methodA(){

10 startActivity (new Intent(this, NextActivity.class)) 11 }

Listing 3.1: Android example

The goal for the static analyser is to build an accurate window transition graph of an app. This graph represents how a user can move from one screen to another [1]. Each node represents a screen, and each edge represents an action with its corresponging widget.

Compared to [1] the WTG we built is slightly different. First, Yang et al. built their WTG by exploiting the stack behaviour of Android. Android tracks the screens by using a stack like structure. Every new screen a user visits in an app is put on top of the stack. By pressing back Android pops the last screen. Yang et al. exploits this behaviour to add back transitions to its WTG. We do not need these transitions for static analysis because we are only interested in all new screens that a user can visit. By design, a back press returns to a state that was already known, without adding code coverage. Therefore we excluded back presses to simplify and increase the speed of statical analysis.

To build WTG, we have to extract information from an APK based on the activities and fragments since these represent a screen. For this, we can use a static analysis framework for Android which gives an intermediate representation of the Java bytecode.

While there are many static analysers [35], we only focus on the three most prominent static analysis tools: FlowDroid [37] combined with IccTA [38], Amandroid [3], and DroidSafe [39] as described by Qiu et al. [36]. In his study Qiu et al. concluded that overall Flowdroid performed best, followed closely by Amandroid and, with Droidsafe being the weakest. More importantly, Droidsafe does not support the latest Android API. Therefore, we only considered Flowdroid v2.0 and Amandroid v3.1.3. We also briefly examined Gator v3.4 [1, 41] since it produces a window transition graph as we need. Unfortunately, the running time in Gator for some apps was more than 8 hours at which point we killed the progress. Furthermore, the graph produced was incomplete. While we could have fixed the latter issue, we had difficulty fixing the former issue. For comparison, the same apps would run under 30 minutes in Flowdroid and Amandroid.

We wrote a static analyser on both frameworks to determine which of the framework will work best for building a Window Transition Graph. We concluded that Amandroid is a lot easier to work with, and produces better results compared to Flowdroid. This difference is for a large part due to Flowdroid using IccTA for ICC analysis. IccTA was released in 2015 and has not received an update since, thus it could be possible that IccTA does not work well with newer apps. As a result, we could not find as many ICC links in IccTA (and therefore Flowdroid) as in Amandroid even with the help of the authors. Moreover, Amandroid recently got a significant update [3]. This update could further explain the difference in performance between Amandroid and IccTA (and thus FlowDroid).

Implementation

We built a custom tool for extracting the WTG with Argus-SAF. Argus-SAF is the successor of Amandroid and incorporates both Amandroid and Jawa. With Jawa being their custom intermediate

(14)

language.

Extracting a window transition graph using Argus can be separated into three phases.

Phase 1 The generation and extraction of ICC links. Start by loading the APK into Argus, which gives us raw information about the classes and their methods. To see the connection between methods we need to use a callgraph. A callgraph in Argus can represent different abstraction levels, e.g. the control flow of a method versus that of a class. The one we focus on is signature based callgraph which means that the call graph will limit itself to methods and show the flow between methods.

To build such a callgraph one needs to input the methods it needs to include in the graph. It is necessary to include only the app methods and not any library methods because this improves the performance and removes any unnecessary data. Argus provides information on whether a method is part of a library or app. So collecting app methods is a simple process. However, Argus distinguishes app code from library code using the package name but also using an external libraries list. We found that this list is incomplete and extended it with the libraries we found during the instrumentation process. We validated every potential library manually and only added them if they were truly an SDK of sorts.

With the callgraph built, we can next analyse its content and extract all methods that start a different activity hereafter referred to as ICC methods. ICC methods are extracted using a build-in bottom-up approach which analyses the content of each method for method calls that start an activity, e.g. void startActivity(). If the content contains an explicit intent, then it resolves the destination, i.e. the next activity. In case of an implicit intent, we analyse the manifest to check if any component (activity) listens to the action of that intent. If so it is added as its destination and else we exclude that intent for the reason that it will be outside the scope of the app. This phase results in a set of ICC methods with its destination activities. For example, based on the code in listing 3.1we will now have (methodA, NextActivity)

Phase 2 During this phase we collect all callback methods. With callback methods, we mean methods that contain listeners to widget actions or are directly called from the widget. In other words the first method after a user action. We need to find the callback methods that lead to the ICC methods so that we know what user action causes a transition between activities. In most cases, the callback method is the ICC method. However, in some cases, the ICC method is in a different method or in a different class altogether. To cover these cases we analyse all callback methods and check if an ICC method is one of the methods that can be reached by it. Or in other words, we find all transitive combinations of methods. It should be noted that we ignore the internal structure of a method, i.e. the conditional statements because it is outside the scope of this study. And currently we assume that a class that initializes the class with ICC method will also call that method. At this moment we only want to know that a user action can lead to a different screen. After the analysis, we will have a map of all the callback methods that lead to a set of ICC methods. With this information, we know what action a user can perform on a screen that will open a new screen. After phase 2 we will have the following based on listing

3.1: (exampleCallBackMethod, {methodA})

Phase 3 For all the callback methods found during phase 2, we have to find a widget or a user action that triggers this widget. In most cases, a user causes the transition. However it also possible that this transition is automatic, for example, a splash screen that loads specific resources before continuing. To collect all the widgets we had to modify Argus, Argus did not provide much information regarding the widgets in each class, and it also did not provide which widget calls which method, it only gave a set of all widgets. The information to bind the set of widgets with its corresponding callback method was however available during the first phase, but only internally. To be exact, during the generation of the signature callgraph widgets were used to determine the (user) callback methods in the first place. By modifying the source code, the information was made public. Now, for each callback method, we can know what user action and widget caused the activity transition. As a result of phase 3, the output based on listing

(15)

Figure 3.2: Example graph based on3.1

With the result of all phases we build multiple graphs varying in detail. For listing 3.1 the graph looks like3.2.

3.4.5 Analysis procedure

The data is analysed using Python and Pandas. Pandas is an open-source data-structure and data analysis tool. Additionally, we use visualisation libraries like Matplotlib. Before analysing the data, we perform a few transformations. First, we transform the raw test and WTG files into CSV files that are readable by Pandas. Next, we calculate the coverage per app. EMMA generates coverage reports for class, method and block. We chose to calculate the coverage using the block report because EMMA works in binaries. A class, method or block, is either 100% or 0% covered. EMMA counts a unit as covered as soon as it is called once, regardless of actual code executed. So one class call will result in a 100% coverage. Block is the smallest measurable unit and therefore the most accurate.

Last, we validate and clean the data.

All tools, raw and transformed data is available on Github2_.

(16)

Chapter 4

Experimental execution

4.1 Adding instrumentation

In the first instance, we ran our instrumentation until we reached the desired 375 apps tested. After we reached this, we kept running the experiment until we ran out of time. In the end, within the project time, we were able to run the instrumentation on 1106 apps of which 741 succeeded, yielding in a success rate of 67%. The number 1106 was not determined beforehand; it was merely the number at which we stopped testing because we ran out of time.

All 741 failures resulted from the recompilation step (7) 3.4.2giving an error in the trend of ”no such label found”. We believe the error is due to intrusive and possibly incorrect changes made by EMMA in the class files. Zhaun et al. also noted the same error [30] indicating the limitations of the library. A workaround for the error is to ignore the files that produce it. Nonetheless, we chose not to ignore these files, because it can skew the coverage report. Instead, we noted these apps as failures.

4.2 Testing

Second, we ran random and systematic testing on 741 apps. Of these apps, 418 completed testing successfully, yielding in a success rate of 56%. Because the apps ran twice once for random and once for systematic, we were able to detect an anomaly. In some cases, one of the strategies would fail. In theory, this should not happen, but it could be due to the emulator crashing and thus resulting in a failure.

The other failures were due to various reasons, with the most common one being that the instru-mentation runner was unable to launch the app. We investigated this problem but could not find any solution. It was hard to debug the issue because the app did not show any error.

Additional reasons for failure were based on two conditions we put on the APKs. First, we skipped ARM-based apps, with the thought to test them later using a real device due to performance issues. Unfortunately, we ran out of time, and thus we listed them as failures. Second, some apps were outdated and therefore did not run correctly. I.e. the server is offline or a non-removable popup message saying that the app should be updated. These apps were manually reviewed and listed as a failure even though the apps tested successfully.

Furthermore, one error resulted from AndroidRipper not being able to the determine the package name of an app due to corrupt manifest files from step2. We did not fix this because it occurred in relatively few apps and there was no one generic fix to this issue.

The remaining errors included constant crashing during testing resulting in a timeout and failing to generate coverage files. These issues were also hard to fix because we could not run a debugger. So the only way to find the problem is by logging every step. Even then, running the tool and waiting for a crash took such a long time that we did not try to fix these issue.

Lastly, we manually validated apps that had 0% coverage and re-instrumented the app and reran the tests in case it was needed. All the other problems we found during testing were relatively simple to fix. These included updating libraries and the code to support Android API level 27, timeout

(17)

tweaks and adding try-catch blocks to prevent crashes.

4.3 Generating WTG

The static analyses were only performed on the remaining 418 apps. Six apps were unable to finish the analysis within our timeout limit of 8 hours. Further investigation showed these apps as being stuck at a single point; therefore these were noted as failures.

Of the remaining 412 apps we filtered out all graphs that had no edges because we are only interested in apps that have a navigation flow. As a result, this gave us 296 apps with a viable WTG.

(18)

Chapter 5

Result

Of the 1106 apps tested and analysed, 412 were successful yielding in a success rate of 37%. Which is higher than our intended sample size of 375. Figure5.1shows that 412 apps are covering all categories.

5.1 Analysing the difference in coverage

Table5.1lists the coverage achieved for 412 apps by random and active learning. Random on average achieved higher coverage than active learning, similarly to the previous study [2] albeit with a smaller mean difference of 1.1%. The table also shows the standard deviation (std), and median (50%) while the std is very comparable between random and active learning; the median has a difference of 2.3%, higher than the mean.

Figure5.2 further highlights the similarities in coverage between random and active learning as is expected with such a low difference in coverage. In contrast, the std difference is 6.1% when comparing the individual apps. Meaning, that there is a difference between random and active learning, but it averages out to 1.1%. Furthermore, the maximum difference achieved was 42.4% in favour of active learning and 27.6% in favour of Random.

5.2 Analysing the difference in transitions

Table 5.2and figure 5.4display the transitions found during testing by each strategy. Random and active learning find on average 2,2 and 1.7 transitions respectively. Consequently, on average random finds, 0.5 transitions more than active learning which is incredibly low. Similarly, the median for random is 2 and for active learning 1 implying that overall, very few transitions occurred. Figure

5.3is a scatterplot with the y-axis showing the ∆coverage and the x-axis showing the ∆transitions. The figure also contains a regression line with a standard 95% confidence interval. This plot shows that there is a weak positive linear relationship between the ∆transitions and ∆coverage. Most of the

Table 5.1: Coverage analysis random and active learning random active learning ∆coverage count 412.000000 412.000000 412.000000 mean 23.586942 22.020925 1.109471 std 17.458280 16.816771 6.069239 min 0.000000 0.000000 -42.345277 25% 10.144928 9.443588 -0.026845 50% 19.610996 17.314267 0.000000 75% 34.527209 32.597924 1.554236 max 100.000000 100.000000 27.551296

(19)

0

5

10

15

20 Analysed Frequency

ART_AND_DESIGN

AUTO_AND_VEHICLES

MEDICAL

TRANSPORTATION

FOOD_AND_DRINK

MEDIA_AND_VIDEO

SOCIAL

PHOTOGRAPHY

HEALTH_AND_FITNESS

PERSONALIZATION

HOUSE_AND_HOME

PRODUCTIVITY

SHOPPING

TRAVEL_AND_LOCAL

COMICS

COMMUNICATION

ANDROID_WEAR

BOOKS_AND_REFERENCE

EDUCATION

ENTERTAINMENT

LIFESTYLE

TOOLS

MUSIC_AND_AUDIO

SPORTS

WEATHER

BUSINESS

FAMILY

FINANCE

NEWS_AND_MAGAZINES

PARENTING

GAMES

DATING

BEAUTY

EVENTS

LIBRARIES_AND_DEMO

random

systematic

Technique

0

20

40

60

80

100 Coverage in %

Figure 5.2: Coverage comparison random and active learning

random active learning ∆transitions count 412.000000 412.000000 412.000000 mean 2.483010 1.548544 0.934466 std 3.754278 2.803382 3.117032 min 0.000000 0.000000 -13.000000 25% 0.000000 0.000000 0.000000 50% 1.000000 0.000000 0.000000 75% 3.000000 2.000000 2.000000 max 26.000000 18.000000 16.000000

Table 5.2: Transitions analysis random and active learning

points are gathered around the x- and y-axis giving a seemingly independent relationship. Calculating the correlation using Spearman gives a weak correlation of 0.32 with a p-value of 5.95e-11.

5.3 Analysing the difference in screens

Table 5.3 is consistent with previous results showing a small difference between random and active learning. Random on average finds 2.2 screens and active learning 1.7 screens resulting in a mean difference of 0.5 screens in favour of random. Moreover, random has double (2) the median over active learning (1). In particular, active learning in most cases does not get past the starting screen. Additionally, figure5.5shows that random has double the range of active learning.

Compared to5.3 plot5.6provides a less random scatter plot. The linear regression line indicates a stronger relationship between the ∆screens and the ∆coverage. We see a stronger but still weak positive linear relationship; suggesting that there is some correlation between these variables. We also see a high density around the zero and one x-axis. Similar to transitions. Calculating the correlation between these two variables using Spearman gives a correlation of 0.39 and a p-value of 6.05e-16.

This result in combination with 5.2 clearly shows that the Guitree built by AndroidRipper is incredibly limited. To demonstrate, table5.4 shows the screens and transitions extracted from the WTG. In total the static analyser created 296 graphs. The table lists a median of ten transitions and seven screens which is more than triple the median of transitions and screens found. This contrast clearly expresses the limitations of AndroidRipper when it comes to creating a full graph.

(21)

10

5

0

5

10

15 transitions

40

30

20

10

0

10

20

30 co

ve

ra

ge

in

%

Figure 5.3: Relation ∆transition and ∆coverage

random

systematic

Technique

0

5

10

15

20

25 Unique transitions found

Figure 5.4: Transitions comparison random and active learning

random active learning ∆screens count 412.000000 412.000000 412.000000 mean 2.220874 1.684466 0.536408 std 1.538556 0.916171 1.129621 min 1.000000 1.000000 -2.000000 25% 1.000000 1.000000 0.000000 50% 2.000000 1.000000 0.000000 75% 3.000000 2.000000 1.000000 max 10.000000 6.000000 8.000000

(22)

random

systematic

Technique

2

4

6

8

10 Unique screens found

Figure 5.5: Screen comparison random and active learning

2

0

2

4

6

8 screens

40

30

20

10

0

10

20

30 co

ve

ra

ge

in

%

Figure 5.6: Relation ∆screens and ∆coverage

transitions screens count 296.000000 296.000000 mean 85.604730 10.976351 std 455.261924 13.854796 min 1.000000 1.000000 25% 3.000000 3.000000 50% 10.000000 7.000000 75% 34.250000 14.000000 max 7260.000000 129.000000 Table 5.4: WTG analysis

(23)

widget count percentage NaN 1048 32.987095 android.view.View 260 8.183821 android.view.MenuItem 245 7.711678 android.widget.Button 189 5.949008 android.widget.ImageView 185 5.823104

Table 5.5: Top 5 widgets missed by active learning, total 3177

method widget count percentage onCreate NaN 173 16.507634 onActivityResult NaN 107 10.209924 onResume NaN 96 9.160305

a NaN 71 6.774809

onStart NaN 39 3.721374

Table 5.6: Top 5 NaN methods by active learning, total 1048

5.4 Analysing the missed widgets

Tables5.7for random and5.5for active learning consists of the top five widgets that both techniques did not execute. These widgets are based on transitions that could be directly executed from one of the screens found. Both techniques show similar results indicating that there is no significant difference between techniques when it comes to finding widgets. Yet, there are a few notable findings. First, random missed more widgets (3788) than active learning (3177), this could be the result of random finding more screens and therefore increasing the total number of transitions it can find.

Second, for both techniques, most widgets missed are of type NaN. NaN is the result of Argus not finding the widget type because it is either, a missing edge case we did not account for in our tool, a limitation of Argus or due to the use of custom widgets.

Third, android.view.View is mostly due to limitations in our tool. It is the parent class for all widgets so in this case, the tool we built could not find the concrete widget because the widget is initialised outside the scope of the method. The tool currently only performs a points-to analysis within the scope of the method thus missing the concrete widget.

Last, both tables contain the widget android.view.MenuItem. This widget usually requires two steps to execute. i.e. open menu then click on MenuItem. Thus it could be classified as a complex interaction. Logically, as the table shows, random has more difficulty with such a widget. However, this is due to random just finding more widgets. Random found 3788 widgets compared to 3177 widgets for active learning. Looking at the percentage of total widgets, as also shown in the tables, active has (1.2%) slightly more difficulty with MenuItems.

Tables 5.8 for random and 5.6 for active learning dive deeper into NaN; it shows the top five methods from which the transition occurs. Most of these methods can be classified as lifecycle methods. Further, the method ”a” is due to Proguard changing method names. It also could have been a lifecycle method.

widget count percentage

NaN 1135 29.963041

android.view.View 272 7.180570 android.view.MenuItem 248 6.546990 android.widget.ListView 191 5.042239 android.widget.ImageView 185 4.883844

(24)

method widget count percentage onCreate NaN 173 15.242291 onActivityResult NaN 114 10.044053 onResume NaN 94 8.281938 a NaN 69 6.079295 startActivityForResult NaN 54 4.757709

(25)

Chapter 6

Discussion

6.1 Evaluation of results

6.1.1 RQ1

What is the coverage difference between random and active learning testing techniques?

Random on average achieves higher coverage than active learning. As expected, because of its effi-ciency, random fires more events than active learning in the same amount of time. Thus validating previous comparison studies. However, the coverage difference of 1.0% is a lot lower than we had anticipated. In fact, with such a low difference in coverage, it only gets harder to find the cause for the difference. The saving grace here is that the standard deviation is high. Indicating that some apps perform better using one strategy than other. Which is what we are interested in. The cov-erage could be very close due to incorrect instrumentation which is highly susceptible to the folder structure of an APK. We found that a lot of APK’s deviate from the application id given. In other words, the package structure is different from the application id. By default, this is not the case. As a consequence, instrumentation and static analysing the right source code becomes hard. Granted, there are some studies that focus on library detection [42,43]. However looking at the source code of Argus we found that these are not used. Specifically, we found that mobile App builders mostly use a different package structure than application id which is logical since generating the same package structure is more efficient and the differentiating factor here is the application id. Conversely, we can argue that using an testing tool would only be done during development where the source code is readily available. Thus making this a non-issue under normal circumstances.

Despite the low coverage it is hard to say which study is correct. Compared to the previous paper [2] this study worked with APK’s without source code. In contrast, many other studies work with datasets of APK’s with source code, thus making it easier to evaluate their tool. However, its downfall is that these tools are built for that dataset, and lack the evaluation on industrial applications. High coverage in small apps does not translate to high coverage in industrial applications; it only scratches the surface. As a consequence, the performance is lacking for industrial applications, its usability is low due to not having a comprehensive test report or code coverage out of the box. Therefore, it becomes difficult to justify the implementation of AndroidRipper in a more significant app development process. Instead, it should be used as a tool to compare different implementations of testing techniques.

6.1.2 RQ2

What effect does the difference in (ICC)transitions per app have on the delta coverage?

∆transition has a weak effect on ∆coverage. The results show a correlation of 0.32 with a very weak positive relation. Indicating that the difference in coverage is not due to one technique finding more transitions. Which is interesting, this could be due to the distribution of code. The code for transition is low compared to the rest. Alternatively, most of the code is located in a different screen altogether.

(26)

Meaning that one screen can have more transitions but less code and vice versa. Moreover, there could be many transitions from activity to different components but not from activity to activity which is what we track. Additionally, the low ∆coverage simply does not amplify the issue thus making it harder to find any correlation.

6.1.3 RQ3

What effect does difference in the number of unique screens visited by each technique have on the delta coverage?

Similar to RQ2 we found that there is a weak positive relation between ∆screens and ∆coverage with a low correlation of 0.39. Therefore we can conclude while finding screens has some effect on the coverage it is not the main reason for causing the difference. Interestingly, this result also goes against our hypothesis. Finding more screens results in having more code to test, so it is logical that the coverage should be higher. However, the results show that this is not necessarily the case. Which can be because of various reasons. It could be like the case in RQ2 that the code is distributed unevenly and the screen containing a large chunk of the code is not reached.

Alternatively, what we believe is the cause; the technique reaches a new screen and crashes. There-fore it is not able to thoroughly test the new screen found. For example, using random testing, finding new screens is hard, if random finds a new screen and it crashes on that screen due to a bug or the tool itself, it will have to start all over again from the starting state. Reaching the earlier found screen is difficult. On the other hand active learning ”knows” how to reach the new screen again, therefore, testing it more thoroughly than random. So while both techniques found two screens, one has a higher coverage.

Similarly, instead of crashing the tool can get ”stuck” due to state explosion or complex interaction. By looking at the missed widgets we see that Listview is in the top 5. Listview can be seen as a cause for state explosion since the content of the listview can change. Likewise, the widget MenuItem is a case of complex interaction. Like listview, it is often not found by AndroidRipper possibly because of the two-step process. While these widgets block AndroidRipper from finding more screens; it does not necessarily prevent AndroidRipper from testing the already found screens thoroughly.

Last; it is important to note that we did not analyse fragments due to time constraint. Fragments are also screens contained in activities. An activity can consist of one activity and multiple screens using fragments. In such a case the tool would not be able to detect the transitions, and therefore the coverage could be higher for one technique by finding more fragments, but on paper, it will look like one screen. Furthermore, the difference in coverage could be solely due to one technique executing more events in the same amount of time and therefore gaining more coverage. Then again we expected a higher ∆coverage. All in all, we can conclude that the navigation flow is not the cause for the difference between random and active learning. It has an even effect on both techniques.

6.1.4 RQ4

What are the characteristics of the widgets that random and active learning testing techniques failed to execute?

With RQ2 and RQ3 we looked at two aspects of the navigation flow: the nodes and the edges. With widgets, we want to find the initiator of the transition from one node to another. So while we can already conclude that navigational flow is not necessarily the cause for the difference in techniques, it is still interesting to see what widgets were not executed. Both random and active learning have a high number of transitions that have no (NaN) widget associated, because of the inability of Argus to translate custom widgets to Jawa classes.

Looking closer at NaN we see that a large number of transitions come from the Android lifecycle methods. Since we did not look at the internal flow of a method, we cannot say why this is the case. It could be that the transition is behind a conditional statement or that these methods initialise a class that contains many ICCMethods. Doing a more comprehensive method call analysis would fix this issue. Still, it is peculiar that many missed transitions occur from the lifecycle methods. One example

(27)

where this makes sense is the splash screen. We could argue that in many cases the testing strategy does not get past the splash screen thus resulting in a low coverage, a low number of screens found and a low number of transitions and yet this should not be possible because we manually excluded apps that did not start up correctly.

The next most found widget is view. These widgets mostly occur in methods that register a listener. In case the view is not defined in the same method as the listener we cannot determine the type of the widget. For this, we need to improve the accuracy of the tool by doing a points-to analysis outside the scope of the method. Equally, it could also be due to Argus not finding the type for the view widget.

The third widget type we find interesting is that many missed transitions occur from the menu screen. Both Random and active learning misses these transitions. So it could be that events that focus on menu items have a low priority in Androidripper. On the other hand, what we believe is, that opening a menu and clicking on a menu item is a two-part task, to put it differently, a complex interaction. Both techniques have difficulties with such interactions.

Next, listview which is also in the top 5 can be viewed as a widget that causes state explosion because the content of a listview is dynamic. Although, this depends on the comparison criteria of states. For example, if the comparison rule only looks at activity names, then its content do not matter. However, if it takes into consideration the widgets, then it becomes more prone to state explosion. AndroidRipper is set to look at the Activity-structure; thus unless the widgets change in an activity, it is not a cause for state explosion. We believed that some widgets would cause state explosion and others not. However, since there is no distinct difference between random and active learning, it could be possible that the widgets that cause such issues are few and far between. That is to say, that the sum of apps in which state explosion occurs is trivial.

Lastly, the amount of different widgets found is high indicating that the tool can be improved to account for widgets like ImageButton, GridView and PopupMenu to name a few.

6.2 Threats to validity

One crucial metric for this study is code coverage. The code coverage should only apply to source code, in many cases, this can easily be deduced by looking at the package name which is the standard. However, sometimes, the developers deviate from this standard, in this case, we exclude all known libraries, and the remainder is considered the source code. So the precision for this method is entirely based on the assumption that the library list is complete. We combated this issue by first finding the most common libraries online, and second, we logged all potential libraries during execution which we manually validated afterwards. And yet we know that this is probably not enough because of Proguard and other forms of code obfuscation. Fortunately, Proguard does not change package names, and therefore the detection of libraries is possible.

For this study, we also used the WTG, which we generated using a custom tool. Since we did not formally validate the accuracy of the tool, we cannot claim that it is the truth so to say. Nevertheless, the tool was tested against Gator and Flowdroid although with a small dataset. We found that this tool had the best accuracy through manual validation for our use case. Furthermore, it is important to note that it is built upon the Argus framework, the tool stitches the information provided by Argus into as WTG. So tool works within the boundaries of Argus, and any issues in Argus will also be visible in the tool.

One of the last issues we believe will cause some doubt is the stopping criteria we chose for An-droidRipper. It would have been better to modify the code to use a timeout. This way we could have guaranteed the same running time. Nevertheless, the running time based on our criteria is higher than the saturation point of around 10 minutes based on [31]. So slight variances in running time will not have a high influence on coverage.

The study is limited to the implementation of the testing techniques in AndroidRipper and therefore not generalisable to different implementations of the testing techniques. Results between tools can differ, however, in general, they should fall in the same range. AndroidRipper by design contains the implementation of the most popular tools. Amalfitano et al. broke down each technique in a general framework so it will be easier to compare different variable combinations that make up the whole

(28)

testing technique. AndroidRipper, however, is still lacking some implementation, but these are the less popular ones.

Last, the failure rate of apps is high. Unfortunately, we were limited by time to fix these issues. So instead we contacted the authors about the issues, and when it was possible, we tried to help them.

(29)

Chapter 7

Future work

Static analyser and Argus The statical analysis framework and the tool we build can still be improved. First of all, we did not take into consideration the control flow of a method. Second of all, the points-to analysis can be improved to work outside the scope of a method. Last, we assume that if a class initialises another class, it will also call the ICCmethod. This assumption is not necessarily correct. So it would be better to instead bind methods by analysing (method) call statements. The framework also needs some work, for one, it lacks widget analysis. For both Flowdroid and Argus, we had to modify the library code to obtain information about widgets. As a result, we wonder if there has been any comprehensive analysis towards the use of widgets in Android apps. How do developers use widgets and what impact does it have on the user? We believe that widgets can have a high influence on what the user will click and which part of the app he or she will explore. Analysing the behaviour of widgets on users could be significant for complex apps with thousands of users that need to highlight specific flows. For example by using a button instead of image or gesture to guide users to new features.

Instrumentation Another area we found to be unexplored is the instrumentation of APK. It is essential to develop a tool like Asc [24] that can track code coverage of an app without source code. Such a library or tool would greatly benefit the research community by standardising the coverage measurement.

Library detection For both the first and second point we need a tool that can accurately find library code. There have been studies in this field [42,43] so it would be interesting to evaluate these studies. Other questions regarding library that come to mind are: what libraries are popular and why? Most libraries are used to aid developers in their development. However, there are ad-based libraries that add activities to the app. These activities are part of the library and not the source code. How often are such libraries used and does it influence the app rating? We can imagine that many ad screens can cause an app rating to drop.

AndroidRipper Next, this study was limited to AndroidRipper. Running the same study on dif-ferent tools can produce difdif-ferent results. On the other hand, it would be better to implement a new testing technique using the framework provided by Amalfitano et al. in AndroidRipper. Thus making it easier for future studies to compare the performance between the techniques. Furthermore, AndroidRipper can also be improved, it does not provide a comprehensive testing report, and it does not track code coverage. Overall the usability is low. For a future study focus on improving the tool with regards to usability.

Fragments This research focused on transitions between activities and ignored fragments. So it would be interesting to incorporate fragments in the analysis.

WTG Last, there are also other applications for the WTG. For example, the WTG in combination with the spread of code can create a weighted test strategy that can guide the testing tool. Alternatively, the WTG can be used to build a better flow for users by analysing the flow users

(30)

usually take and comparing that to the WTG. Thus potentially finding screens or widgets that are not used and improving upon them. Consequently, creating a better experience for the users.

(31)

Chapter 8

Conclusion

In this study, we compared three aspects of a navigation flow (WTG) against the difference in coverage between two automated test input generation techniques: random and active learning. In the hope to find an app characteristic that can explain the difference between these two testing techniques. In particular, we looked at nodes represented as activities; edges represented transitions; actions represented as widgets. We generated a graph using a custom static analysis tool based on Argus, and for testing we used AndroidRipper. Our results based on testing and analysing 412 industrial Android apps using both techniques show that random achieves higher coverage, finds more transitions and screens compared to active learning. However, despite generating a more complete graph, random does not achieve a significantly higher coverage over active learning; thereby concluding that the difference in coverage is not due to the navigational flow.

Additionally, we found that both techniques have trouble with similar widgets when it comes to finding new screens and transitions. Notably, both techniques have difficulty with widgets that require a complex interaction, like MenuItems and AlertDialogs.

Furthermore, the study provides a statical analysis tool that is capable of producing a WTG and a tool for adding instrumentation to APK’s without source code. With these tools, this study can be expanded by also incorporating fragments. Besides that, the WTG can be used for generating a more optimal model that can be used to guide testing and thereby improving the performance. Nevertheless, before increasing performance, the current set of tools are not usable in industry, so a focus on usability and maturing from a developer standpoint is just as important.

(32)

Acknowledgements

I would like to thank my academic- and company supervisors Ivano Malavolta and Kevin Bankersen for their guidance and advice during my research.

(33)

Bibliography

[1] Shengqian Yang, Hailong Zhang, Haowei Wu, Yan Wang, Dacong Yan, and Atanas Rountev. Static window transition graphs for android (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 658–668. IEEE, 2015.

[2] Domenico Amalfitano, Nicola Amatucci, Atif M Memon, Porfirio Tramontana, and Anna Rita Fasolino. A general framework for comparing automatic testing techniques of android mobile apps. Journal of Systems and Software, 125:322–343, 2017.

[3] Fengguo Wei, Sankardas Roy, Xinming Ou, et al. Amandroid: A precise and general inter-component data flow analysis framework for security vetting of android apps. ACM Transactions on Privacy and Security (TOPS), 21(3):14, 2018.

[4] Improve app performance and stability with firebase (google I/O ’18), May 2018.

[5] Pavneet Singh Kochhar, Ferdian Thung, Nachiappan Nagappan, Thomas Zimmermann, and David Lo. Understanding the test automation culture of app developers. In Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International Conference on, pages 1–10. IEEE, 2015.

[6] Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. Real challenges in mobile app devel-opment. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on, pages 15–24. IEEE, 2013.

[7] Mario Linares-V´asquez, Carlos Bernal-C´ardenas, Kevin Moran, and Denys Poshyvanyk. How do developers test android applications? In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on, pages 613–622. IEEE, 2017.

[8] Domenico Amalfitano, Anna Rita Fasolino, and Porfirio Tramontana. A gui crawling-based technique for android mobile application testing. In Software testing, verification and validation workshops (icstw), 2011 ieee fourth international conference on, pages 252–261. IEEE, 2011.

[9] Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Salvatore De Carmine, and Gennaro Imparato. A toolset for gui testing of android applications. In Software Maintenance (ICSM), 2012 28th IEEE International Conference on, pages 650–653. IEEE, 2012.

[10] Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Bryan Dzung Ta, and Atif M Memon. Mobiguitar: Automated model-based testing of mobile apps. IEEE software, 32(5): 53–59, 2015.

[11] Saswat Anand, Mayur Naik, Mary Jean Harrold, and Hongseok Yang. Automated concolic testing of smartphone apps. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, page 59. ACM, 2012.

[12] Tanzirul Azim and Iulian Neamtiu. Targeted and depth-first exploration for systematic testing of android apps. In Acm Sigplan Notices, volume 48, pages 641–660. ACM, 2013.

[13] Wontae Choi, George Necula, and Koushik Sen. Guided gui testing of android apps with minimal restart and approximate learning. In Acm Sigplan Notices, volume 48, pages 623–640. ACM, 2013.

Analysing Android testing techniques using the navigation flow