Imitation learning-based task completion with drones

(1)

Imitation learning-based task

completion with drones

Matthew van Rijn 10779353

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098XH Amsterdam Supervisor: Tom Runia University of Amsterdam Faculty of Science Science Park 904 1098XH Amsterdam 2nd July 2017

(2)

Abstract

Imitation learning provides a novel method of teaching robots to per-form tasks. In this thesis, previous successes in imitation learning are built upon by teaching a drone to perform a task that involves search and obstacle avoidance. The aim of the research, which is performed in a simulator, is to determine whether specific imitation learning methods can be used to complete the task. Example data is collected by an expert at the task and used to train two types of neural net-works. An algorithm is then applied to introduce error recovery data into the dataset. Six models are evaluated using a combination of behavioural and statistical methods. The results indicate that while the models are not safe for application on a real drone, the method shows promise for future work.

(3)

1 Introduction

Since its introduction in the mid 20th century, machine learning has played an increasingly large role in our lives. This development has been fuelled by the rapid increase in available computing power, and the rapid decrease in its cost. Over the years many new, more advanced, machine learning algorithms have been made, many of which can be used by everyday objects such as smartphones.

A more recent development is the increase in popularity of drones. These small flying machines are mostly used by hobbyists, but their versatility means they have many po-tential machine-learning applications. For example, drones could help search for survivors after a natural disaster that leaves roads blocked.

The combination of drones and machine learning has proven popular with researchers. Over the past few years, many learning methods have been used to make drones perform various tasks, and the field remains very active today.

One type of machine learning which has been used is imitation learning. (Argall, Chernova, Veloso & Browning, 2009) In imitation learning, the drone learns to complete a task by watching an expert demonstrate it. Imitation learning is far from the go-to method for teaching robots, but various studies have shown it can be successful. (Stadie, Abbeel & Sutskever, 2017) (Ross et al., 2013)

This project endeavours to replicate previous successes of imitation learning on a two-element task. Specifically, the task is for a drone to locate and fly to a large red square on the ground, while avoiding obstacles. This combines the elements of obstacle avoidance and search.

The combination of obstacle avoidance and search is interesting, because almost any autonomous application of a drone requires these two elements to succeed. Without obstacle avoidance, the drone will inevitably crash and without search, it will be unable to find its goal. Therefore, it is important to determine whether machine-learning methods can provide a drone with these abilities.

The research question shall be Can a drone be taught to perform a task involving search and obstacle avoidance by learning from examples using Imitation Learning? If successful, this research shall contribute a confirmation to the field of research that imitation learning can teach a drone to search and avoid obstacles. If unsuccessful, the insight provided into the problems may still lead to improvements in the used algorithms.

1.1 Terms

Throughout this thesis some drone- and imitation-related terms are used which may or may not be familiar to the reader. This section provides an overview of these terms, and clarifies their meanings.

• Expert: a person with full understanding of the problem, who is able to perform the task faultlessly. In this thesis, the expert refers to the author, who flies the drone.

(5)

• Policy: a mapping from state to action. A model’s policy indicates what action it believes to be optimal to reach the goal. When policy is applied to a drone, it is known as executing, or unrolling the policy.

• Policy Derivation: a method of obtaining policy from expert demonstrations. Policy derivation is performed with algorithms, and yields a model.

1.2 Thesis structure

Including the introduction, this thesis is divided into six chapters, each detailing a different aspect of the research. First, in chapter 2, there is a review of related research. In this chapter, findings from previous work are related to this project. Afterwards, chapter 3 discussed the methodology and approach used to answer the research question. Chapter 4 lays out the evaluation method, and lists the experiments performed. The results of these are presented in chapter 5. Finally, a conclusion is given in chapter 6, along with a discussion of the results and pointers for future work.

(6)

2 Related Work

Imitation learning, also known as learning from demonstration, is a machine learning method where the robot derives its policy by observing expert demonstrations of a task. The observations can be made in first-person, using the robot’s own sensors, or with an external camera. The task of policy derivation must be performed by a separate machine learning method. (Argall et al., 2009)

Various different methods exist which are suitable for this. One of these is reinforcement learning. Reinforcement learning is a type of machine learning in which robots learn tasks through experimentation. (Sutton & Barto, 1998) It makes use of a predefined reward function, which indicates the expected return of an action in a given state. A robot using reinforcement learning will use trial and error to determine what actions lead to a maximisation of the reward function in both the short and long term, and use this knowledge to update a value function. This value function determines the robot’s policy.

Inverse reinforcement learning is a variant of reinforcement learning which does not require a predefined reward function. Instead, it uses the expert’s demonstrations to ap-proximate the true reward function of a task. (Argall et al., 2009) This method is suited to tasks for which a reward function cannot easily be defined.

Recently, most successful applications of reinforcement learning have used models based on deep neural networks, such as deep Q-networks. (Mnih et al., 2015) A drawback of these networks is that they often require large training sets to converge. (Mnih et al., 2013) This can also apply to feedforward and convolutional neural networks, which are two traditional network types. (Schmidhuber, 2015)

In this project, imitation learning is applied to a drone. Successful studies have been performed using imitation learning. (Ross et al., 2013) (Giusti et al., 2016) In these studies a drone is taught to perform a navigation task though a forested environment using imagery collected from the drone’s cameras, or a close approximation thereof. Unlike in this project, however, the drones do not have a final goal.

Collecting sufficient and varied training data to use with a neural network is challenging, especially for imitation learning. Expert demonstrations do not contain many errors, so there is no information in the training set on how to recover from mistakes. The DAGGER algorithm attempts to solve this problem by having the robot execute its policy to create additional training data containing errors, which must be corrected by an expert. (Ross, Gordon & Bagnell, 2011) The expert correction step is potentially time-consuming, so studies that implement this algorithm generally allow the expert to correct the data as it is being recorded. DAGGER has previously been used to collect training data for a drone. (Ross et al., 2013)

The methods and findings of (Ross et al., 2013) and (Giusti et al., 2016) may be of interest for this project, given the overlap in subject matter. If this project is successful in applying one or more methods from these studies, it will indicate that these are suitable for performing multiple task types.

(7)

3 Method and Approach

The goal of this work is to research imitation learning with drones. There are three necessary steps to perform this research. Firstly, the expert demonstrations, which the drone must imitate, need to be recorded. Secondly, a model must be trained using the demonstrations to provide policy. Finally, the drone must be able to execute this policy.

This chapter details the approach used to achieve these steps, and explains the decisions made in the process. It is subdivided into four sections. Section 3.1 describes the simulation of the drone. The structure of the rest of the learning environment is shown in section 3.2. The format of the collected training data is given in section 3.3. Finally, section 3.4 explains which learning algorithms are used and how the models are trained.

3.1 Simulation

There are two options for collecting data and running policy. One is to use a real drone and the other is to use a simulator.

3.1.1 Real drone

Using a real drone allows all aspects of the real world to be taken into account when doing research. A simulator, on the contrary, can never simulate everything. A real drone provides a guarantee that any results achieved are applicable to the research field, which a simulator does not. It is, therefore, important to make sure that any chosen simulator simulates the aspects of the real world that affect the task that is learned. If this is not done, the gained knowledge may not be transferable to the real drone.

There are, however, several practical disadvantages to using a a real drone over a simulator:

• A real drone can only fly a limited amount of time before being recharged • There are few places where drones can be flown outside legally

• Testing can only occur when at these places

• It takes longer to record training trajectories, since the drone must be manually moved to a new location between recordings.

Due to these limitations, this research makes use of a simulator. 3.1.2 Robotics Operating System

The robotics operating system (ROS) (Quigley et al., 2009) is a framework that allows for the interaction of various tools used for controlling, monitoring and analysing robots. The individual tools are known as nodes. The nodes publish their data to channels, which can

(8)

Figure 1: The gazebo simulator environment

be read by other nodes. This common interface means nodes can be substituted without breaking the environment.

The flexibility of the ROS framework makes it well suited to this research. While the use of a real drone is not part of this research, the goal is to advance the field of research, which also concerns real drones. It is, therefore, desirable that the simulator can be replaced by the real drone. In ROS, this can be done by replacing the simulator node with a driver node for a drone, since they both listen to the same movement channel. It is also possible to switch between different simulators in this manner, should a problem arise.

3.1.3 Simulator

The simulator used in this research is the tum simulator. This simulator simulates a Parrot AR.Drone in an environment provided by the Gazebo simulator. Figure 1 shows an impression of this environment. The gazebo simulator allows for models, such as houses, walls and other obstacles, to be added, moved or removed. This makes is possible to conduct experiments with the models in different positions, which is useful for the evaluation of learned policy.

For the purpose of this research, a large red square has been added to the world. This indicates the area that the drone must find and fly towards.

3.2 Learning Environment

The complete collection of software used to perform the imitation learning is referred to as the learning environment. Within this environment, individual components (ROS,

(9)

Figure 2: Overview of the learning environment

simulator and classifier) exchange data with each other. A schematic overview of these interactions is shown in figure 2.

Between ROS and the classifiers lies an interaction layer. This layer is responsible for managing the environment. During the recording of training data, it maps keyboard input to actions and writes these to disk together with the latest image. During policy execution, it sends the images from ROS to the classifier, and translates the resulting classification into the matching movement vector.

The classifier is run on a remote server equipped with a Titan X graphics card. This setup is required because of the large amount of video memory needed to load and train complex neural networks. A detailed overview of the software used is provided in appendix A.

3.3 Data

3.3.1 Images

The gazebo simulator broadcasts 640x360 images from the drone’s front-facing camera to ROS at 60Hz. Figure 3 shows two examples of such images. In image a the drone is facing the goal, with buildings in the background. In image b it is facing a window in one of the buildings. These images represent the drone’s state within the state space.

(10)

(a) The red square (b) A window

Figure 3: Images recorded by the drone’s front-facing camera Table 1: Action space for recording

Action X Y Z Pitch Roll Yaw

FORWARD 1.0 0.0 0.0 0.0 0.0 0.0

CLOCKWISE 0.0 0.0 0.0 0.0 0.0 -1.0

ANTICLOCKWISE 0.0 0.0 0.0 0.0 0.0 1.0

HOVER 0.0 0.0 0.0 0.0 0.0 0.0

3.3.2 Actions

The simulator takes movement commands as a vector containing values for the x, y and z axes, as well as the pitch, roll and yaw within [−1, 1]. The potential action space, therefore, consists of a near infinite number of combinations of these values. The task, however, can be completed with just two actions; one to move forward (FORWARD ) and one to turn (CLOCKWISE ). In this situation, the drone can reach any point on the plane at its take-off height, as long as there is a path that is not intersected by any obstacles.

Having just one action for turning severely limits the drone’s ability to make corrections, since it would have to turn almost a full circle to make a slight correction in the opposite direction. For this reason, a second turn action is used to turn in the opposite direction (ANTICLOCKWISE ). By default, the drone repeats the previous action indefinitely until a new one is given, so the HOVER action is used to stop the drone when no input is given. The x, y, z, pitch, roll and yaw values of each action are displayed in table 1.

3.3.3 Collection

The expert demonstrations used as training data are collected by recording trajectories in the simulator. Trajectories are arrays of drone states (images) linked with actions. Figure 4 displays a set of training examples extracted from a trajectory.

(11)

Figure 4: A selection of training examples from a trajectory

this position, the expert flies the drone to the goal using the arrow keys. This is done via the, in the expert’s opinion, most optimal route. At a rate of 10Hz, state-action pairs are appended to the trajectory, using the most recent camera image and keyboard action. The trajectory ends when a land command is given.

3.4 Learning

One the expert demonstrations have been collected, the next step in imitation learning is policy derivation. (Argall et al., 2009) In this research, two types of neural networks are used for policy derivation; a feed-forward network and a convolutional neural network.

For policy derivation the action ”HOVER” is omitted from the action space shown in table 1. The action causes the drone to hover in place, and thus does not change the state. Once in such a state, the drone would hover indefinitely and never reach the goal. The action is, therefore, never good policy.

3.4.1 Feed-forward neural network

A feed-forward network is the simplest type of neural network. It consists of an input layer, an output layer and any amount of hidden layers in between. In such a network, layers are usually fully connected and connections only go forward, hence the name. Feed-forward networks can be used for image recognition, but applications are usually simple such as handwritten digit recognition. A feed-forward model, therefore, may provide a suitable baseline for the task.

For this research, the feed-forward neural network has four layers, of which two are hidden. The input layer has 4096 neurons, enough to accommodate a 64x64 single-channel input image. The hidden layers have 1024 and 128 neurons respectively. The output

(12)

Figure 5: Architecture of the feed-forward neural network

Figure 6: Architecture of the convolutional neural network

layer has three neurons, each representing one of the actions from the action space. Each hidden layer is is followed by a dropout layer with drop probability 0.3, to prevent over-fitting. (Srivastava, Hinton, Krizhevsky, Sutskever & Salakhutdinov, 2014) The network architecture is shown in figure 5

3.4.2 Convolutional neural network

Convolutional neural networks are special neural networks that specialise in images. By running various convolution operations, convolutional networks are able to recognise ob-jects in the images. This should, theoretically, make a convolutional neural network more suitable than a feedforward network for this task, since it could learn to associate actions with the presence of certain objects.

For this research, the convolutional neural network has three convolution layers, fol-lowed by one 512 neuron fully connected layer with dropout and the same three neuron output layer as in the feedforward network. The full network architecture is shown in figure 6. This network is inspired by one used for learning games through deep reinforcement learning. (Mnih et al., 2013)

3.4.3 DAGGER

The expert demonstrations used in imitation learning show a task being performed in a (close to) optimal manner. The dataset, therefore, contains a lot of information about how to complete the task from certain positions, but none from others. For example, if the drone

(13)

finds itself facing a wall due to an earlier policy mistake, it will not know how to recover, since there is no training data of the drone facing the wall. This leads to compounding policy errors, and potentially a crash.

A potential solution to this problem is to apply the Dataset Aggregation (DAGGER) algorithm. This algorithm uses the error-prone policy, and combines it with expert input to expand the dataset. This introduces training examples into the dataset that show how to recover from errors.

Algorithm 1 DAGGER

1: dataset ← base dataset

2: policy₀ ← base policy

3: for i = 0 to N do

4: Collect expert-labelled trajectories from policy_i as Dn

5: dataset ← dataset + Dn

6: policy_i← new classifier trained on dataset

The DAGGER algorithm as used in this project is shown in algorithm 1. Traditionally, the expert labels the entire trajectory for step 4. In the learning environment described in section 3.2, this is done by giving continuous keyboard inputs as the policy is being unrolled. The actions from the keyboard inputs are combined with the images from the drone’s camera and added to the dataset.

A slight variation on DAGGER is also tested. Instead of providing constant input as the policy is being unrolled, input is only provided when the drone is about to make a critical mistake, such as crashing. Only these actions are recorded as new training data. Additionally, they override the actions from the current policy, allowing the drone to continue to attempt to reach the goal. This approach should have a smaller effect on the overall policy than DAGGER, but still prevent crashes.

DAGGER is tested with both the feedforward and convolutional neural networks. The original DAGGER algorithm is run until 40 new trajectories have been collected. With the alternate approach, the trajectories are far longer. Therefore, trajectories are collected until the number of new training examples is approximately 2000.

3.4.4 Data preprocessing

The raw 640×360 RGB images have 691.200 pixel values each. Performing imitation learn-ing on these images is not feasible in terms of computational power, and especially memory. Fortunately, the resolution of the images can be reduced significantly without losing the information necessary to complete the task. This is possible because each feature in the original image is represented by hundreds, or thousands of pixels.

The raw images are reduced to 64×64 single-channel images. To emphasise the goal, the conversion from RGB to single-channel is done using equation 1. This causes objects

(14)

Figure 7: Reduction of 640×360 RGB images to 64×64 single-channel representation

that are near-pure red, such as the goal, to appear far brighter than surrounding objects. Two examples of the transformation are shown in figure 7.

f (R, G, B) = ( R, if R > 250 ∧ G < 25 ∧ B < 25 1 3R, otherwise (1) The reduced images have 4.096 pixel-values each, a reduction of 99.4%. Despite this, both the goal and the obstacles are still clearly recognisable.

(15)

4 Evaluation

In chapter 3, several methods of policy derivation are introduced. These methods are used to train six models:

• FF, using the feedforward network from section 3.4.1 and trained using the 25 expert trajectories.

• CNN, using the convolutional network from section 3.4.2, also trained with the 25 expert trajectories.

• FF-DG, using the same method as the FF model with additional DAGGER data, as in section 3.4.3

• CNN-DG, using the same method as the CNN model with additional DAGGER data.

• FF-DG2, using the alternative approach to DAGGER introduced in section 3.4.3, alongside the FF model.

• CNN-DG2, using the alternative approach to DAGGER alongside the CNN model. Several experiments are performed using these models to evaluate their performance and answer the research question. The most important experiment tests the models’ policies. All models are subjected to this experiment, which is explained in section 4.1. The best performing model is tested in an environmental independence experiment, explained in section 4.2.

Statistical measures are used to evaluate the models in section 4.3 further. Their classification accuracies and confusion matrices are examined. These measures may provide an insight into the cause of unexpected or bad policy. Finally, the influence of the dataset size is tested on the two base models (FF and CNN) in section 4.4.

4.1 Policy

Good classification results do not necessarily mean that a model is able to complete the task. A small amount of misclassifications could cause the drone to fly into an obstacle. For this reason, the behaviour produced by the policy must be evaluated.

The behavioural evaluation consists of two parts. In the first part, a general description is given of the drone’s behaviour. This includes descriptions of what the drone does when it encounters an obstacle, how it behaves when the goal is in sight and what it does when the goal is not in sight. In the second part, the drone’s policy is observed from the ten predefined starting states shown in figure 8. The policy is evaluated based on the following criteria:

(16)

Figure 8: Ten starting locations for behavioural evaluation

2. Goal achievement time 3. Avoidance of crashes

In some cases a model’s policy may bring the drone to a state from where it will never reach the goal, for instance when it flies away from all the buildings, or becomes stuck in a loop. In such a case, the drone is stopped after 60s and the goal achievement is considered to have failed.

The ten starting states have been manually selected to challenge the models’ perform-ances in different ways. The ability of a model to navigate past obstacles is tested in starting states one through five. Starting states six, seven and nine test its interaction with houses. The efficiency of a model when the goal is in sight is tested by states eight and ten. Finally, states one and five also show how the drone behaves when no objects are visible at all.

The use of predefined starting states over randomly chosen ones allows for better com-parison between different models. Randomly chosen states would mean that different aspects of the behaviour are tested for each different model.

4.2 Environmental independence

The layout of the training environment, as seen in figure 1, remains constant throughout training and testing. Therefore, it may be that the models can perform the task in this

(17)

environment, but not in another. It is possible that at least some of the policy for reaching the goal relies on the relative position of the goal to certain buildings or objects.

To determine whether this is the case, or whether the models are environmentally in-dependent, the best-performing model undergoes additional evaluation. In this evaluation the goal is moved, once to position two in figure 8, and once to position ten. The behaviour of the drone is then evaluated using the general evaluation described in section 4.1. 4.3 Classification

The behaviour-based policy test is good for determining how good the policy is, but does not explain it. A good way of doing this is by examining the classifier’s test accuracy and confusion matrices. If, for example, the drone displays overly confident behaviour, it might be explained by a bias in the classifier towards the forward action. Uncertain wobbly beha-viour, on the other hand, may be explained by a bias to the other actions. The confusion matrices will also show the effect DAGGER and its variation have on the classifiers’ per-formances. While the test-set accuracy does not directly explain any behaviour, recording it might uncover a possible correlation between it and the policy performance.

4.4 Data requirement

Neural networks, especially convolutional networks, are known for benefiting from large datasets. To determine whether this holds true for the neural networks used in this project, they are retrained while withholding data from the training set.

Both the feedforward and convolutional neural networks are trained with between one and twenty-five trajectories, using a 90/10 train/test split. The expected result of this ex-periment is an increasing test accuracy for each additional trajectory, but with diminishing returns. This is because every trajectory provides additional data to find patterns.

(18)

5 Results

In this chapter, the results of the experiments and evaluation methods from the previous chapter are presented. Section 5.1 contains results from the policy test. Section 5.2 shows the accuracies and confusion matrices from the classification test. Section 5.3 gives the results of the changing environment experiment and finally, section 5.4 gives the results from the dataset size experiment. Only results and observations are shown in this chapter. An analysis of unexpected results is given in the discussion.

5.1 Behavioural

The results from the behavioural policy test from section 4.1 are displayed in table 2. A description of the policy of each model follows:

• FF: The drone shows a limited amount of exploration. When the goal is visible, the drone flies towards it, stopping occasionally to correct its course. When faced with an obstacle, the drone will most often crash into it, except if it is the concrete barrier or the blue house.

• CNN: The drone shows a large amount of exploration, alternating small anticlock-wise turns with extended forward motion when nothing is visible. When the goal is visible, the drone flies towards it in a straight line, and only stops to correct its course just before reaching it. When faced with the concrete barriers, blue house or brown staircase, it turns to face away. When faced with the white house or grey wall, however, it will sometimes crash.

• FF-DG: The drone does a reasonable amount of exploration. When the goal is visible, it always flies forward and does not make course corrections. The obstacle avoidance behaviour is unchanged from the base FF model. Crashes are prevented by the drone’s tendency to fly straight out of the world without turning.

• CNN-DG: The drone does little exploration, far less than with the base CNN model. When the goal is visible, the drone will fly forward, but often veer to one side and lose sight of it. The drone will crash into anything but the brown staircase. It is generally very indecisive, which is shown by it turning a lot without extended periods of forward motion.

• FF-DG2: The drone explores in a similar manner to the base FF model. When the goal is visible, it flies forward but does not make any corrections. The drone will avoid the concrete barriers and white house, but crash into other obstacles. When no objects are visible, it turns in circles.

• CNN-DG2: The drone explores very little, opting instead to turn circles until something of interest is visible. It will avoid the white house and the concrete barriers,

(19)

Table 2: Results from the behaviour test. G - Goal achieved, C - Crash, T - Time

FF CNN FF-DG CNN-DG FF-DG2 CNN-DG2 State G C T G C T G C T G C T G C T G C T 1 × 22s × 32s × 34s 2 × 18s × 20s × 20s × 25s × 35s 3 × 10s × 4s × 26s × 38s × 19s 4 × 38s × 26s × 8s × 4s 5 × 7s × 50s × 15s × 35s 6 × 9s × 18s × 23s × 20s 7 × 18s × 7s × 13s 8 × 13s × 17s × 6s × 25s × 9s × 21s 9 × 22s × 22s × 6s × 7s × 23s × 6s 10 × 21s × 23s × 11s × 17s × 56s × 25s Total 3 6 -3 1 4 -3 2 2 0 2 8 -6 4 4 0 3 7 -4

but will crash into other obstacles. When the goal is visible, it flies towards it and makes corrections once nearby.

The last row of table 2 shows the number times the drone achieved the goal, or crashed with each model. Alongside those totals, the number of times the goal was achieved less the number of crashed is shown. According to this measure, the best performing models are the FF-DG and FF-DG2 models. The FF-DG model is also the safest model, with 2 crashes. The FF-DG2 model, in turn, is the most successful model, with 4 goal achievements.

There is no obvious pattern to be seen in the task completion or crash times. The goal is achieved too infrequently to make any statement about which model is fastest.

5.2 Classification

The confusion matrices and accuracies of all size models are displayed in table 3. Below is a brief description of the results for each model type (base, DAGGER and alternative DAGGER).

• Base models: The base models (FF and CNN) both have the vast majority of train-ing examples classified as forward. This leads to very high error rates for the clockwise and anticlockwise actions of between 0.53 and 0.95. The dataset these models were trained on has significantly more forward examples than clockwise and anticlockwise examples, so some of these numbers are based on as little as 19 examples. Due to the near-perfect classification of forward examples, the overall accuracies are 0.8 and 0.82 for the FF and CNN models respectively.

• DAGGER models: The clearest difference between the base models and the DAG-GER models is seen in the classification accuracy. At 0.49 and 0.58, these are far

(20)

Table 3: Confusion matrices and test set accuracies of each model. AC - Anticlockwise, FW - Forward, CW - Clockwise, E - Error rate

FF CNN FF-DG Classified as → AC FW CW E AC FW CW E AC FW CW E AC 1 13 7 0.95 9 9 1 0.53 58 129 23 0.72 FW 0 150 1 0.01 1 151 1 0.01 7 239 5 0.05 CW 2 18 13 0.61 9 16 8 0.76 27 179 58 0.78 Accuracy 0.80 0.82 0.49 CNN-DG FF-DG2 CNN-DG2 Classified as → AC FW CW E AC FW CW E AC FW CW E AC 65 44 46 0.58 17 26 27 0.76 3 23 7 0.73 FW 23 206 31 0.21 7 234 16 0.09 0 259 17 0.06 CW 46 49 63 0.60 11 27 37 0.51 5 25 47 0.39 Accuracy 0.58 0.72 0.80

lower than the base models. What appears to be a factor in this difference is the new balance of the training set. For the base models, clockwise and anticlockwise make up small proportion of the data, but DAGGER has introduced so many of these examples that the datasets are now close to balanced between the three actions. There is also a significant difference in how the clockwise and anticlockwise examples are classified by the based model compared to the FF-based one. The CNN-DG model classifies a plurality if those examples correctly, while the FF-CNN-DG model classifies the vast majority as forward. This explains the significant difference in accuracy between the two models.

• Alternative DAGGER models: Unlike the DAGGER models, these models have only slightly lower accuracies than the base models. This appears to be explained by a return to a forward-dominated dataset, the examples of which are classified with low error rates of 0.09 and 0.06. Most clockwise and anticlockwise examples are classified as a turning action, but both models prefer clockwise.

In conclusion, the base models show a high preference to forward. DAGGER introduces data that is for the most part consisting of clockwise and anticlockwise data, and leads the models to far lower classification accuracies. Finally, the alternative approach to DAGGER has a far smaller impact on the statistical measures than the regular algorithm.

5.3 Moving goal

The results of the moving goal experiment described in section 4.2 are given below. The experiment was performed with the FF-DG2 model, since it shared the best score from the behavioural evaluation with the FF-DG, but has a more positive general description.

(21)

Figure 9: The effect of the training set size on the classification accuracy.

• Position 2: With the goal in the new position, the drone no longer targets the ori-ginal position. From some directions, the drone targets the new goal with noticeably less confidence than before. It also needs to be closer to the goal before it will lock onto it.

• Position 10: In this position, the drone’s tendency to fly to the goal is not non-existent, but severely diminished. It looks as if there is a conflict between the drone’s desire to fly to the goal, and fly away from the green building right next to it. To conclude, the model performance is reduces when the goal is moved. The reduction appears to be far more severe when the goal is placed alongside an object the drone would previously avoid, such as the house next to position 10.

5.4 Training set size

The results from section 4.4’s experiment are shown in figure 9. It shows an increase in the test accuracies as the number of trajectories in the dataset increases. The increase is rapid at first, but the returns are diminishing. The training accuracies show a similar trend for the first five trajectories, after which they start to decrease.

There is no significant difference between the accuracies of the FF and CNN models in this test. They perform similarly at equal dataset sizes. The increase in the CNN model’s accuracy, however, is more stable. This is best exemplified by trajectories 2 to 6, where the FF model’s accuracy falls by 0.2.

(22)

6 Conclusion and Discussion

This final chapter wraps up the research by relating the results to the research goal and question, and discussing them. It starts with conclusion in section 6.1, which answers the research question. Afterwards, several aspects of the results are discussed. Finally, section 6.3 gives recommendations for future work.

6.1 Conclusion

The results of behavioural policy test show that across all models, the drone reaches the goal in 15 out of 60 cases (25 %). Recall the research question Can a drone be taught to perform a task involving search and obstacle avoidance by learning from examples using Imitation Learning? The answer to this question is yes. A 25% success rate is not impressive, but is sufficiently high that it cannot be achieved through luck. Besides that, the exploratory behaviour of most models shows that they can search, and the behaviour around the concrete barriers shows that they can avoid obstacles.

On the other hand, the performance of the models is not sufficient to use on a real drone, as would be necessary to declare the methodology a complete success. The drone crashes in 31 out of 60 cases (52%). Furthermore, the drone sometimes becomes stuck in a loop.

The conclusion, then, is that while the current results are far from optimal, imitation learning shows a lot of potential. With further research into the underlying problems of the current models, imitation learning should be suitable for the task.

6.2 Discussion

The results reveal some topics that warrant further discussion, primarily where the results were poor or unexpected. These are discussed in this section.

• DAGGER performance

After applying several iterations of DAGGER to the base models, the classification accuracies were reduced dramatically and the policy behaviour did not improve. This result is unexpected, as DAGGER was expected to reduce the number of crashes and the time needed to reach the goal. The confusion matrices reveal that the data introduced by DAGGER is primarily from the clockwise and anticlockwise classes. This is unsurprising. The drone is rarely flying directly at the goal, so the expert is constantly telling the drone to turn. What this leads to, however, is that the amount of data of the drone being corrected from mistakes quickly outnumbers the amount of data of the drone performing the task. This may explain the lacking performance. Another possible explanation related to the imbalance in data is that insufficient iterations of DAGGER were run. As the model develops, it should perform the task

(23)

better, which in turn results in more examples of the task being performed properly being added to the dataset.

The alternative approach to DAGGER contributes far less new data to the dataset. This resulted, as expected, in the classification behaviour staying more similar to the base models. However, this approach also made no significant improvement to the performance of the models. Crashes are just as common, if not more so than with the base models. This may be due to the relatively small amount of data added by the alternative approach, something that could also be resolved by running more iterations.

• Undoing actions

A major problem with the behaviour of the drone is that it gets stuck in a loop. Most of the time, this is due to the model calling for a turn in one direction until encountering a new object, after which it calls for a turn in the opposite direction. In this situation, the drone continues turning back and forth indefinitely.

This problem is in part caused by the similarity between the states that call for clockwise and those that call for anticlockwise turns. For example, a state where the drone is facing a wall directly could belong to either action. A simple solution would be to remove one of the two turning actions. While this would solve the contradictory movement, it would severely limit the mobility of the drone. A small turn in one direction would require an almost 360◦ turn in the opposite direction. Two solutions may eliminate the problem without limiting the drone’s mobility. The first is using sequences of images to train and test the models. This would allow the model to learn a different policy depending on from where the object is encountered. This method is well established in machine learning, and is used in the study that provided the base for the convolutional neural network. (Mnih et al., 2013)

The second solution is to utilise the probabilities for each class from the softmax out-put layers of the neural networks. Currently, the class with the highest probability is selected as the class. Instead, a method that penalises changes in movement direction could be implemented. If for example, the drone is currently turning clockwise, the anticlockwise classification probability is reduced. This should cause the drone to switch directions less, allowing it to escape from the loop. It should be noted that a similar approach to extend the length of forward motions is likely to result in an increase in crashes.

• Environmental independence

The result of the moving-goal experiment shows that both the goal and the surround-ing buildsurround-ings are besurround-ing used to locate the goal. This reliance on buildsurround-ings causes the performance to be reduced when the goal is moved. The second test, with the goal in position 10, shows that sometimes the desire to avoid a building is stronger than

(24)

the desire to fly to the goal. This is an unexpected result, as the representation of the state (see section 3.4.4) is designed to make the goal dominate it.

The reliance on buildings to find the goal may be combated by generating a random world for each recorded trajectory, as well as for each test. Unfortunately, this is not feasible within the current learning environment, as there are no commands to move objects.

6.3 Future work

As noted previously in this thesis, the results of this kind of research are more reliable when performed on a real drone. Replicating this research on a real drone would, therefore, be an important avenue for future research. However, it may be sensible to try to improve the current method first. The conclusion and discussion highlight issues, but also propose solutions to them. Attempting these solutions is also a good path to follow for future work. Future work may also seek to expand the abilities of the drone, for example by ex-panding the action space. Adding movement actions such up and down would allow the drone to fly over and under obstacles. It is likewise possible to implement different policy derivation algorithms, such as inverse reinforcement learning.

(25)

References

Argall, B. D., Chernova, S., Veloso, M. & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and autonomous systems, 57 (5), 469–483.

Giusti, A., Guzzi, J., Cire¸san, D. C., He, F.-L., Rodr´ıguez, J. P., Fontana, F., . . . others (2016). A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1 (2), 661–667.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Ried-miller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., . . . others (2015). Human-level control through deep reinforcement learning. Nature, 518 (7540), 529–533.

Quigley, M., Gerkey, B., Conley, K., Faust, J., Foote, T., Leibs, J., . . . Ng, A. (2009, May). Ros: an open-source robot operating system. In Proc. of the ieee intl. conf. on robotics and automation (icra) workshop on open source robotics. Kobe, Japan. Ross, S., Gordon, G. J. & Bagnell, D. (2011). A reduction of imitation learning and

structured prediction to no-regret online learning. In Aistats (Vol. 1, p. 6).

Ross, S., Melik-Barkhudarov, N., Shankar, K. S., Wendel, A., Dey, D., Bagnell, J. A. & Hebert, M. (2013). Learning monocular reactive uav control in cluttered natural environments. In Robotics and automation (icra), 2013 ieee international conference on (pp. 1765–1772).

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61 , 85–117.

Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15 (1), 1929–1958.

Stadie, B. C., Abbeel, P. & Sutskever, I. (2017). Third-person imitation learning. arXiv preprint arXiv:1703.01703 .

Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning: An introduction (Vol. 1) (No. 1). MIT press Cambridge.

(26)

Appendix A: Software

An overview of the software used in this project is shown below in table 4 Table 4: Software used

Purpose Software Comment

Operating System Ubuntu 14.04 14.04 required for ROS version.

Robot interface ROS Indigo Igloo Indigo required for tum simulator. Use full install. Simulation Tum simulator1 No working version for ROS Kinetic.

Simulation Gazebo 2.2.3 Packaged with ROS.

Interaction layer Self-written2 _{Using python 2.7.6, required for converting ROS images.} Machine learning Keras Python library, tensorflow backend.

1_{https://github.com/dougvk/tum simulator} 2

Imitation learning-based task completion with drones