Learning Action Coordination for Multi-Arm Robots

(1)

MSc Artificial Intelligence

Master Thesis

Learning Action Coordination

for Multi-Arm Robots

by

David Speck

12372773

July 2, 2020

48 credits November 2019 - July 2020

Supervisor:

Dr. Elia Bruni

Assessor:

Dr. Efstratios Gavves

(2)

Abstract

When multiple robotic arms work together on a shared task, a coordinated plan is required to increase productivity and avoid interference between those arms. In this thesis, we formalize this planning problem as a constrained multiple Traveling Salesman Problem (m-TSP) and develop a deep reinforcement learning-based approach. We base our neural network architec-ture on Nazari et al. approach the Vehicle Routing Problem (VRP), introduced in 2018, and apply it to the more general problem class of constrained m-TSP. We develop and evaluate our approach in the context of the task of harvesting apples with multiple robotic arms. We use this context to frame the task and develop simulations for training and evaluation. However, our approach is generally applicable to all instances of the constrained m-TSP and not lim-ited to our specific application. Our experiments show that our approach is competitive with industry-standard solutions such as OR-Tools routing solver. Furthermore, we explore the transition to more complex problem instances, real-world applications with pilot experiments, and start a discussion on how to push the boundaries of our research.

(3)

Acknowledgement

I would like to express my sincere gratitude to Dr. Elia Bruni, my research supervisor, for his patient guidance, enthusiastic encouragement, and useful critiques of this research work. His willingness to give his time so generously has been very much appreciated.

I would also like to extend my thanks to Michiel van der Meer and Maximillian Knaller, who helped to frame the problem at the beginning and provided technical support, as well as Leon Eshuijs and Vivien van Veldhuizen, who supported the development of the Physical Simulation.

Finally, I wish to thank my family and friends for their support and encouragement throughout my study.

(5)

1 Introduction

Robot arms play a crucial role in the automation of manufacturing pipelines. Other industries, such as health care, logistics, and agriculture, have also started deploying robot arms to auto-mate processes. However, to properly scale production, robots must be able to work together in coordination, which is a challenging task. In this thesis, we make progress in the task of multiple robot arm coordination by formulating it as a multiple Travel Salesman Problem, casting it as a reinforcement learning problem, and by then evaluating it in the context of automatic apple harvesting.

In the literature, there are two main categories of multi-arm robot coordination: bimanual and goal coordinated. Bimanual coordination is the most studied, and it looks at maneuvers that require two or more arms. For example, holding a bottle in one hand and screwing open the cap with the other is an instance of the bimanual coordination task. In this thesis, we instead focus on goal coordinated cooperation, which deals with the coordination of multiple arms that share a common goal but do not require physical interaction. It covers any use case where a single-arm performing the task would suffice, but having multiple single-arms work in parallel could improve productivity. For example, pick and place objects from bins or conveyor belts or pick fruits from a tree (our use-case).

Goal coordinated multi-arm robots received less attention than bimanual multi-arm robots in recent years. However, increasing productivity is vital for many applications to become eco-nomically viable and effective parallelization with multiple arms can be one way to achieve that. Furthermore, other ways of improving productivity, such as developing faster hardware, take di-rect advantage from the productivity gains obtained with multiple arm parallelization. Hence, it combines well with other efforts for increased productivity.

In recent literature, some approaches explicitly use goal coordinated multi-arm robots, such as the strawberry harvesting robot by Ya Xiong et al. [1]. Their prototype splits the workspace in half and assigns one arm to each part. Then the arms harvest their assigned workspace from one side to the other. Another example is Hohimer et al., which positioned second at the 2017 Amazon Robotics Challenge for bin-picking[2]. They used target priority and collision avoidance in a 2D projected space to coordinate two robotic arms. Both examples achieved a 20 - 30% improvement in performance through the second arm. In this thesis, we go beyond these simple methods for coordination and aim to achieve higher gains. With a theoretical upper bound of two arms working twice as fast (100% improved performance) as a single-arm, the problem of goal coordination for multi-arm robots has room for improvement. It constitutes an interesting problem to investigate using machine learning methods.

We develop our approach starting from the real-world application of apple harvesting. Despite decades of research, the automation of fruit harvesting is still in development. The rise of deep learning enabled significant advances in the sensing capabilities of robots, which brought auto-mated fruit harvesting closer to a successful commercial application. Any push to productivity is crucial to move prototypes over the threshold of paying off economically. That is why we identi-fied the improvement of coordination of multiple robotic arms as a relevant contribution to this development.

We consider the practical applications for apple harvesting throughout this thesis and how to transfer the solution to the real world. We describe the technical setup in hardware and software to provide an applied context for our approach. Further, we explain how the task of harvesting apples with multiple robotic arms can be formulated as an abstract mathematical problem and identify it as the constraint multiple Traveling Salesman Problem (m-TSP). Based on this abstraction, we developed an analytic simulation that allowed us to train and iterate fast on our approach.

Our approach is not limited to fruit harvesting applications, though—an essential criterion for our approach is to allow for any custom constraints. Flexibility to custom constraints makes our approach widely applicable to any problem that can be modeled as an m-TSP. These applications include all goal coordinated multi-arm robot applications. For example, harvesting tasks, picking and placing tasks in storage halls, manufacturing, and 3D printing, to name a few. Beyond that, the constraint m-TSP is a generalization of the Traveling Salesman Problem (TSP), which is the special case of a single traveling salesman (or robot arm). Its only constraints are: all locations are

(6)

visited, and no location is visited twice. The TSP has several applications in planning, logistics, and the manufacturing of microchips. Another subset of the constrained m-TSP, known as the Vehicle Routing Problem (VRP), finds various transportation applications. This family of optimization problems usually operates on a street network and has different constraints depending on the application. For example, demand for nodes, limited capacity of vehicles, and time windows for delivery.

We choose deep reinforcement learning as our approach to solve the m-TSP. By letting the algorithm learn from experiences instead of handcrafting an expert heuristic, our approach remains more general and requires less domain knowledge to be applied. In particular, it can become computationally costly to predict collision between robot arms ahead of time. Our approach can avoid this prediction during training in the simulation and instead learn that occurring collisions are undesired and how to avoid them with planning. To solve combinatorial optimization problems, reinforcement learning is paired with new neural network architectures. We are building specifically on a variation of the Pointer Network developed by Nazari et al. to solve VRP[3].

As part of this thesis, we investigate methods to make reinforcement learning time-efficient. A big challenge in reinforcement learning is sample efficiency. Out of the box, most methods require high numbers of samples to learn quality solutions. Part of the challenge of applying reinforcement learning to a given problem is finding configurations and hyperparameters that allow for fast and stable convergence. A particular focus lies on the trade-off between variance and bias. In particular, we use Monte Carlo updates[4], which, with sparse rewards, presents high variance and consequently, slow convergences or sometimes even divergences. We investigate multiple methods for reducing the variance to stabilize and accelerate the training without introducing too much bias.

The key contributions of this thesis are:

• Providing a reinforcement learning solution to constrained m-TSP;

• Describing a formalization of the apple harvesting task as a constrained m-TSP;

• Detailing the setup and pipeline to train and apply out methods to a real-world application; • Investigating practical steps to improve the learning stability and speed of reinforcement

learning.

2 Problem Definition

Our apple harvesting robot is a mobile platform with cameras and multiple robotic arms attached. The robot traverses an orchard in discrete steps: move to a new position, scan for apples, harvest apples, repeat. The scan operation locates ripe apples on the tree and transforms their position to 3D coordinates relative to the robot platform. Given these coordinates, our objective is to find a plan for all available robotic arms that allows the robot to harvest the apples as fast as possible. A plan in our context is a set of sequences of apples to pick, one sequence for each arm. Additional to the objective of minimizing the execution time of the whole plan, we must satisfy the following constraints: no apple is picked twice, and no collision between arms occurs during execution. A formalization of our problem description follows.

Each robot arm starts at their individual home position. Let x1, . . . , xmbe the home positions

of the arms where m is the number of arms. Apples are called targets and we use xm+1, . . . , xm+nto

denote their positions. n is the number of given targets. Together {x1, . . . xm, xm+1, . . . , xm+n} =

X is the set of planning targets.

A partial plan p is a sequence target coordinates pi∈ X, i = 1, . . . , |p| of length |p|. Each partial

plan must include exactly one arm starting position. Formally:

∃i : pi∈ {x1, . . . , xm} ∧ ∀j, j 6= i : pj 6∈ {x1, . . . , xm} (1)

This starting position defines for which arm this partial plan is for. We call a set of partial plans P a plan if every target is visited exactly once by exactly one partial plan.

(7)

Figure 1: From left to right: example picture of apples on a farm, potential targets detected by camera, abstract representation of targets is bases for our planning problem, an example of a plan where different colors indicate trajectories for different robotic arms

These constraints define a plan. Depending on the real world application additional constraints can be formulated. Our robot design for the apple harvesting task comes with multiple robotic arms on a shared vertical rail. All other joints move the end-effector within the horizontal plan, which means they do not change the height or vertical position of the end-effector. Hence, the only way a collision can occur when two arms move to the same vertical position on the rail. Let zitbe the height of arm i ∈ {1, . . . , m} at time t, then

∀i 6= j : zit6= zjt ∀t ∈ R (3)

is the formal description of this constraint. Since time factor here is continuous it is difficult to evaluate this constraint. We use different approaches to make the evaluation of the collision constraint computationally efficient (see Section 6 for more details). We consider every plan that satisfies all our constrains a solution to our task.

We compare two different plans by the time their execution takes. For a partial plan the execution time is the sum of the times of all its stepas:

|p|−1

X

i=1

d(pi, pi+1) (4)

where d(xi, xj) is a distance measure between xiand xj. In our application it is the time to harvest

a target at xj when the end-effector is currently at xi. Specific to our application we can break

this distance down into two parts: move the end-effector from xi to xj denoted as m(xi, xj) and

harvest the target denoted as e. We consider the harvest operation as a constant time effort. It is independent of the start position of the end-effector, while the move operation depends on start and end position.

d(xi, xj) = m(xi, xj) + e (5)

Since all partial plans are executed in parallel the time to execute the full plan is the execution time of the longest plan:

max p∈P |p|−1 X i=1 d(pi, pi+1) (6)

Our optimisation objective is it to find a plan that minimizes the execution time:

min P  max p∈P   |p|−1 X i=1 d(pi, pi+1)     (7)

(8)

In the special case that we have only a single arm, the full plan consists only of one partial plan. Than the objective simplifies to:

min p   |p|−1 X i=1 d(pi, pi+1)   (8)

This is the objective function of the Traveling Salesman Problem (TSP), a well studied combina-torical optimisation problem. It is known to be NP-complete [5] which means it is only feasible to solve it exactly for small number of targets. That is why most practical applications use heuristic a solvers, which provide find solutions with good quality in within limited time.

The generalization of the TSP to the objective in Equation (7) is known as multiple Traveling Salesman Problem (m-TSP) [6]. While the m-TSP is studied less than the TSP, a family of constrained m-TSP’s known as Vehicle Routing Problems (VRP’s) got much attention.

We reached a family of well-defined and studied mathematical problems with the formalization of the robotic fruit harvesting task. This problem definition enabled us to find literature and existing approaches from which we developed our method.

3 Technical Setup

3.1 Software

The pipeline utilizes on the Robot Operating System (ROS) [7]. ROS provides a framework for communication between multiple processes, called nodes. ROS allows distributing nodes over multiple machines within a network, enabling remote control of a robot via WiFi. Further, ROS is popular in the research and open-source community, so it comes with a vibrant ecosystem of packages. The aim of using ROS is to develop a pipeline for the simulation that can run without changes on the physical robot.

To handle motion planning and execution, we use the ROS package MoveIt [8]. This package provides a framework to interface with motion planners, including inverse kinematics solvers, and executes the resulting motion plans. MoveIt represents motion plans as joint trajectories. To use MoveIt for a given robotic arm, one has to create an additional configuration, including information such as self-collision matrix, move groups, gripper, and controllers.

MoveIt uses ROS topics to communicate the joint velocities to controller nodes, which uses this input to control the actuators through their respective APIs. In our project, we used actuators from different manufactures for different parts of the robot arm. To enable MoveIt to control the full robot arm, we had to integrate all of their APIs.

To test the pipeline during development, we integrated the robotic simulator Gazebo [9]. Gazebo integrates well with ROS and uses the same interfaces that the physical robot will use. The integration allows switching from simulation to the real robot with minimal changes in the setup. The simulator is also the base for the training environment for robotic fruit harvesting. The environment simulates picking fruits from an orchard of trees with Gazebo providing an accurate simulation of the robot arm dynamics. In order to train within this environment we used the OpenAI Gym [10] interface with the extension OpenAI ROS1.

For faster training and iterations, we developed simplified simulation using OpenAI Gym[10] and Python. We describe more details on how we model the environment in Section 5.

Lastly, we use Docker2 _{to run the whole pipeline in a container. Docker enables us to run}

and deploy quickly on different machines and provides independence of the underlying operating system.

3.2 Hardware

The prototype robot arms used in this thesis are of type Selective Compliance Articulated Robot Arm (SCARA). The 3 degrees of freedom (DoFs) of the SCARA arm can move the end-effector to

1_{http://wiki.ros.org/openai_ros} 2_{https://www.docker.com/}

(9)

Target coordinates Sequencer Motion Planning Motion Execution Controller Controller Simulation

Figure 2: Schema of the high-level software pipeline. We assume the target coordinates are provided e.g. from a computer vision pipeline. The sequencer module is the focus of this thesis.

any position and rotation in the 2D plane. This arm is mounted on a vertical prismatic joint to give the hole arm 4 DoFs to position the end-effector. See Figure 3.

For the actuators of the SCARA part of the arm, we use X-series actuators from HEBI robotics3_.

HEBI Robotics offers a configuration tool to produce the description files for custom confgurations4_.

We use their C++ API, ROS integration example code, and Gazebo plugin to control the actuators in simulation and on the physical robot.

The vertical prismatic joint uses parts from Dunkermotoren5. We use the BG 75X25 motor together with the PLG75 gearbox and E90R breaks. Each robot arm will have its vertical rail. The rails of paired robot arms are being assembled next to each other. Hence, it makes no difference that each arm has its vertical prismatic joint for the planning problem.

The robot arm has a gripper designed in Fusion 3606as an end-effector. We are using Robotis Dynamixel actuators7 _{for the gripper. A third party plugin}8 _{to Fusion 360 allows us to export the}

design to URDF. We then combine the different description files of the robot arm parts manually to a single URDF describing the whole arm.

During training in simulation and on the robot on the field, we run the software pipeline on an NVidia AGX Xavier9_{. This device offers an energy-efficient GPU designed for autonomous}

machines. The Xavier has different power modes to control power efficiency. We used power setting six on the Xavier for maximum performance from GPU and CPU for single-thread processes.

3_{https://www.hebirobotics.com/x-series-smart-actuators} 4_{https://www.hebirobotics.com/robot-builder} 5_{https://www.dunkermotoren.com/} 6_{https://www.autodesk.com/products/fusion-360/overview} 7_{http://www.robotis.us/dynamixel/} 8_{https://discourse.ros.org/t/fusion-360-add-in-to-export-urdf/6944} 9_{https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/}

(10)

Figure 3: This screenshot shows the robot arm simulated in Gazebo. The gray pole represents the vertical prismatic joints. All revolving joints are highlighted in red. We consider the shoulder, elbow, and wrist joint as part of the arm. The remaining two joints are part of the gripper.

4 Related Work

4.1 Automated Fruit Harvesting

Many aspects of robotic harvesting are actively researched, including grasping patterns for apple picking [11], localization of fruits [12], and benefits of workspace simulation of a robotic arm for its design process [13]. Bac et al. reviewed the state-of-the-art and challenges for automated fruit harvesting a few years ago [14]. One of the challenges the authors identify is “enhancing the robot”, specifically, they suggest to improve the ability to learn and adapt. In our approach, we address this point by using a learning method that can be refined with data from the real world harvest to improve future iterations.

The industry is highly interested in automated fruit harvest as well. Plenty of start-ups work on solutions. Some examples for automated apple harvest are the companies FFRobotics10 _and

Abundant Robotics11_{. RootAI}12 _{is developing a solution for automated tomato harvesting and}

Agrobot13focuses on the automated strawberry harvest.

One direction to approach harvesting fruits hanging from trees is to shake the tree or branch and catch falling fruits [15][16][17]. However, the fall can damage sensitive fruits; hence these techniques cannot be applied widely. The direction we are following is to pick each fruit individually with a robotic arm. Many of these approaches focus on a single robotic arm to avoid disturbances of the tree through another robotic arm and challenges of collision avoidance, and coordination [18][19][20]. Instead, we explore the more difficult but potentially more rewarding challenge of coordinating multiple robotic arms in a shared workspace.

Bloch et al. evaluated different robot and work environment designs for efficient apple harvest-ing [20]. They found that modern orchards (rows of apple trees) designed to ease human labor (like the tall Spindle [21] and Y-trellis tree [22]) also ease automation with robots. Further, they identified the time it takes the robot to move from one picking position to another as an essential factor for the overall performance.

10_{https://www.ffrobotics.com/} 11_{https://abundantrobotics.com/} 12_{https://root-ai.com/}

(11)

A dual-arm strawberry harvester proved to be more efficient than a single-arm version of the same design [1]. However, the single-arm setup took only 1.3 times as long as the dual arm setup. The system coordinated both arms by splitting the workspace evenly and deciding whether they start from left or right. With ideal coordination, the dual-arm setup would be up to twice as fast as a single arm. This theoretical upper bound indicates room for improvement through more sophisticated planning techniques to utilize additional arms more effectively.

Zion et al. looked explicitly into coordinating multiple arms for melon harvesting at constant driving speed [23]. Their robot design limits the arms to a lateral movement, which allows modeling the coordination problem as a k color-able sub-graph problem. They assumed all fruit positions are known in advance and found feasible algorithms to plan which arm picks which melons. Higher numbers of arms improved the amount of successfully picked melons and therefore supported higher driving speed. This study is the closest to our work as they specifically focus on coordinating multiple arms for a harvesting task. We also consider a continuously moving platform. However, we do not have a strict limitation to lateral movement of the arms. The more complex arm movements prevent us from applying their model to our case.

4.2 Coordinated Dual-Arm Manipulation

This thesis focuses on multi-arm coordination to improve the efficiency of harvesting robots. This section reviews related work on multi-arm coordination.

There are two types of coordinated dual-arm manipulation: bimanual and goal-coordinated manipulation [24]. Bimanual manipulation mimics humans’ ability to manipulate a single object with both arms simultaneously: for example, having one arm picking apples from the tree while another catches picked apples and transports them to the collector via a hose [25]. Most research in two-arm coordination follows this direction. Goal-coordinated manipulation covers situations where two arms work together on a common goal without physically interacting. For example, two arms pick apples in parallel and have to coordinate who picks which apple. Our work uses goal-coordinated manipulation.

Another example of a goal-coordinated dual-arm robot is the second place contributions of the Amazon Robotics Challenge 2017 [26]. The pick and place tasks of the challenge are similar to apple picking. The two arms in their system shared their workspace and coordinated by selecting targets based on scores provided by the vision pipeline. To avoid collisions between arms, they mapped the current position and future waypoints of each arm into the top-down plane. A new trajectory was only permitted if it maintained a safe distance to all waypoints in the plain. This strategy still led to much waiting for arms, and one arm only took about 1.2 - 1.3 times longer than two arms together, which is similar to the strawberry harvesting robot’s results.

4.3 Robustness and Transfer from Simulation to the Real-World

The problem of coordinating robot arms to harvest apples is dynamic. Picking an apple might set the branch into a swinging motion, and similarly, wind and rain affect the positions of apples over time. There are also technical factors that add to the problem’s natural dynamics, such as a continuously moving platform to improve harvesting speed and inaccuracies in the perception pipeline and actuator controls. Hence, extra effort is required to ensure that solutions developed in simulation work well in the more complex real world.

Classical task and motion planning solutions assume the environment to be static. These solutions represent plans as joint-angle-trajectories. When the environment changes just slightly between the planning and the execution of such a static plan, it becomes invalid. The final pose in the plan does not longer match the changing environment. Recent research represented the plan as a goal pose relative to the target object [27]. This way, the target pose remains valid after changes in the environment. Reactive controllers take such abstract plans and update the target joint angles in a high-frequency update loop. The fast update loop enables them to quickly adapt to changes over time and allows them to apply these techniques to a dynamic environment.

Machine learning methods can also provide robustness to a noisy environment. Combining reinforcement learning with classical task and motion planning can lead to more robust and faster solutions [28]. In this paper, a mobile robot has to navigate an office floor. The mobile robot

(12)

requires help from humans to open doors. Relying on human help makes the time needed to pass a closed-door uncertain. The rooms to navigate, stay the same over multiple episodes; this allows transferring experiences from one episode to another. This model cannot be transferred to apple harvesting directly since the positions of apples vary continuously between planning instances.

Reinforcement learning techniques have proven to work well in multiple environments with different rules without manual tuning, like planing Chess, Shogi, and Go with one set of hyperpa-rameters [29]. Classical routing tasks more similar to our situation (such as the traveling salesman or vehicle routing problem) can be solved efficiently with recurrent neural networks [30] and graph attention models [31] when trained with reinforcement learning.

However, it remains challenging to apply reinforcement learning to real-world problems [32]. Recently, automated domain randomization successfully enabled transfer and robustness in robotics tasks like dexterous hand manipulation [33]. Forcing the learner to deal with many variations of the environment during the training helped it generalize so well that it could still perform its tasks when some limbs were malfunction. While our tasks are different from dexterous hand manipulation, the same methods may improve our application’s robustness to unforeseen environment dynamics.

4.4 Learning to Plan

In Section 2, we formulated the fruit harvesting task with multiple robot arms as a combinatorial optimization problem. While it is often unfeasible to compute the optimal solution to larger problem instances, there is a big family of heuristic search methods that can obtain a good solution with a low computational cost. However, handcrafting such heuristics is an expensive process. The rise of deep learning came with the promise that learning such heuristics is plausible. Kool et al. showed that an Attention Model could learn heuristics for the Traveling Salesman Problem (TSP) and the Vehicle Routing Problem (VRP) using reinforcement learning [31]. They managed to outperform previously handcrafted heuristic search methods with a comparable computational budget. Another neural architecture developed for combinatorial optimization problems is the Pointer Network by Vinyals et al. [34]. The Pointer Network is a recurrent neural network that uses an attention mechanism to point to input items as answers instead of a fixed answer vocabulary. This feature enables the network to handle different problem sizes. Nazari et al. build on the idea of Pointer Networks to solve VRPs [3]. This variation does not longer use a recurrent encoder, which enables the network to be invariant to the order of inputs. Since the order of nodes is irrelevant for the solution of TSPs and VRPs, this property is desirable. Further, the authors add dynamic features to each node to effectively encode information such as capacity and demand to solve VRPs with additional constraints. Due to the similarity of our problem to VRPs, we choose to use the architecture developed by Nazari et al. for this project.

5 Methods

In Section 2, we framed our fruit harvesting task as a constraint multiple Traveling Salesman Problem (m-TSP). We approach the m-TSP as a Markov Decision Process (MDP), which allows us to use reinforcement learning to learn a heuristic solver.

5.1 m-TSP as a Reinforcement Learning Problem

We can frame m-TSP as a deterministic MDP. A finite set of states, a finite set of actions, and a reward function for state-action-pairs define a deterministic MDP.

We use the set of node coordinates X (including arm start positions and target positions) together with a potentially incomplete plan P0 as a state s = (X, P0). An action at is adding a

step to any of the incomplete trajectories in P0. at = (i, xj) where i is a trajectory index and

xj ∈ X. The new incomplete plan after an action must satisfy all given constraints (see Section

2). If all planning targets in X are in P0 _{the plan is complete and the state is terminal. We define}

our reward function as follows:

R(at, st) =

(

− maxp∈PP|p|−1_i=1 d(pi, pi+1), if stis terminal

(13)

e1 e2 e3 ... em+n Target coordinates x1 x2 x3 ... xm+n Encoder ... → e2 e1 Decoder Attention

Figure 4: Schema of our adaptation of proposed model by Nazari et al. [3]. An encoder network (orange) creates embeddings for the features of each targets. A recurrent decoder network (blue) takes these embeddings as initial hidden state. Then the output of the decoder at each step together with the embeddings is used for an attention mechanism to point at one of the input embeddings.

Using the negative execution time as a reward make maximising the reward equivalent with min-imising the execution time – our object function from Section 2.

5.2 Neural Network Architecture

Solutions to VRPs using neural networks often impose the constraints as a binary mask to restrict the set of actions to valid actions. An example is the Pointer Network variation by Nazari et al. [3]. Since the binary mask handles the constraints and not the neural network, we can use the same architectures to learn solutions for our constraint m-TSP since we can express our custom constraints in the binary mask.

Figure 4 shows a schema of the architecture we are using. Our implementation is based on the open-source implementation by Matthew Veres14.

We are using the coordinates of each target as features. Each target is embedded individually, trough a 1-dimensional convolution layer. We use the embeddings as the initial hidden state for the GRU [35] decoder. The decoder output at each step and the embeddings are the inputs for an attention mechanism [36], which produces a pointer to one of the inputs. Since the network does not use recurrence for the embeddings, it is invariant to the order of the input elements. Further, the attention mechanism allows it to handle input sets of varying lengths since it is not limited to a fixed length vocabulary. This property is essential because the number of targets within a planning window is not constant in real-world fruit harvesting. Furthermore, it enables the network to deal with larger target sets than seen during training.

Different from Nazari et al., we only made use of these static features and did not add any dynamical features to each target. A potential extension to this work is to investigate dynamic features. For more details, refer to [3].

5.3 Reinforcement Learning Method

We are using the reinforcement learning method advantage actor-critic (A2C), which is a popular synchronous version of the asynchronous advantage actor-critic (A3C) introduced by Mnih et al.

14

(14)

in 2016 [37]. A2C is empirically more cost-efficient than A3C on a single-GPU machine15_{. Since}

we had no large cluster available for training, A2C was the preferred choice.

A2C is a policy gradient method. In policy gradient methods we compute the gradient of a cost function J (θ) for the current policy πθ(at|st). The policy computes the probability of an action at

in a given state st. The basic cost function for policy gradient is:

∇θJ (θ) = Eπθ[

T −1

X

t=0

∇θlog πθ(at|st)Gt] (10)

Gt is the discounted cumulative reward for step t and T is the number of step of this episode.

The vanilla policy gradient has high variance and struggles with updates where the cumulative rewards Gtis zero. One way to address this issue is to introduce a baseline b(st):

∇θJ (θ) = Eπθ[

T −1

X

t=0

∇θlog πθ(at|st)(Gt− b(st))] (11)

A good baseline reduces the gradients such that only the useful information of the reward remains and offsets from zero are mitigated. The idea baseline is b(s) = V_π∗(s) the true value function. We do not have the true value function given, but we can use a neural network to approximate it b(s) = Vφ(s). This network is called critic and the policy network πθ is called actor. We can use

Eπθ(Gt) = Qθ(at, st) to rewrite the gradient equation:

∇θJ (θ) = Eπθ[ T −1 X t=0 ∇θlog πθ(at|st)(Qθ(at, st) − Vφ(st))] = Eπθ[ T −1 X t=0 ∇θlog πθ(at|st)A(at, st)] (12)

where A(at, st) = Qθ(at, st) − Vφ(st) is known as the advantage function. In order to stabilise the

training process we use the advantage to compute the gradient. The advantage can be understood as the improvement of the observed rewards over the expected value of a state. The critic estimates the value function and acts as a baseline. The actor uses this estimate to update the policy into the direction the critic suggests. 16_).

The following techniques are recent extensions of the actor-critic method. Looking for an effective method for our task, we explored the effects of their combinations. For more details to the configuration search see Section 7.2.

Schulman et al. [38] introduces generalized advantage estimation (GAE) in 2015. GAE uses an exponentially weighted estimator for the advantage function:

AGAE(γ,λ)(at, st) = ∞ X l=0 (γλ)lδVt+l ₍₁₃₎ Where δVt _{= r}

t+ γV (st+1) − V (st). The GAE can reduce the variance even more and stabilise

the training than the advantage estimate in standard A2C. According ot Schulman et al. with the right choices for γ and λ GAE reduces the variance drastically with an low increase in bias making it worth the trade-off. The discount factor γ is an environment parameter; however, the authors argue that it can help to learn faster to choose a gamma smaller than the one given by the environment.

Ilyas et al. [39] mention that advantage normalisation (AN) as extension of GAE. It can help reducing the variance even further but may introduce a bias that is sensitive to outlier in the current batch. The equation for AN given by:

A0=A − µ(A)

σ(A) (14)

15

https://openai.com/blog/baselines-acktr-a2c/

16_{The article “Understanding Actor Critic Methods and A2C” by Chris Yoon is a good resource to learn about}

(15)

where µ(·) is the mean and σ(·) the standard deviation. Practically to avoid division by zero a small value > 0 is added to the divisor.

A0=A − µ(A)

σ(A) + (15)

Lastly, we considered using Proximal Policy Optimisation (PPO) introduced by Schulman et al. [40]. The authors suggested multiple variations of PPO, and we specifically consider clip-PPO. The main object of this variant is:

LCLIP_{(θ) = E [min(r}t(θ)A(at, st), clip(rt(θ), 1 − , 1 + )A(at, st))] (16)

is a hyperparameter and rt(θ) := _ππθ(at|st)

θold(at|st) denotes the probability ratio between the old and

the new policy. The idea here is that we perform multiple gradient steps per sample. We scale each gradient step by the probability ration between the old policy (that obtained this sample) and the new policy. If the current policy assigns a higher probability of good actions that the old one, we can do greater gradient steps. However, the further the new policy differs from the old policy, the less informative are the obtained samples. Clipping the ration prevents the algorithm from becoming overly confident in the state-action-distribution of the new policy.

Trust Region Policy Optimisation (TRPO) inspired the PPO objective. The core idea here is to maximize the update step size, but consider how certain we are about it. Larger steps have low certainty since the obtained samples are less informative about the new policy.

Additional to the clipped ration, PPO stops early if the new policy differs too much from the old policy. The authors use an approximation of the Kullback-Leibner divergence (KL divergence) and a threshold hyperparameter to decide when to stop gradient steps early.

This concept of doing multiple gradient steps for one observed sample within a save region around the old policy enables PPO to much more sample efficient. The repeated gradient steps are particularly valuable when obtaining new samples is very costly; for example, if complex simulations or real-world executions are required to obtain new samples. Our Physical Simulation described in Section 6.2 is an example of such a case.

For a more in-depth explanation of PPO variants and implementation details, we refer to the original paper by Schulman et a. [40] and the OpenAI tutorials SpinningUp17_[41].

6 Simulation

Training a policy on a real robot arm is expensive, slow, and potentially dangerous. Hence, simulations are used for training and evaluation of methods before deploying them on real-world robots. Some properties of a simulator are particularly important:

• speed : An ideal simulation requires little compute and allows for faster iteration and training. • accuracy: The simulator’s model must represent relevant features of the world accurately so

that models trained in the simulation transfer to the real world.

• customisation: A simulation that can be easily tuned and extended helps to iterate quickly and test new ideas.

Each simulation must make a trade-off between speed and accuracy. On one extreme of the spectrum is the real world, which provides perfect accuracy but is extremely expensive to use. The more abstract we model the task, the better is the speed of the simulation but at the cost of accuracy.

To balance these properties, we choose to deploy two different simulations for training and evaluation: The Analytic Simulation uses a highly abstract description of the task and robot arm to allow fast early experimentation and pre-training of models. The Physical Simulation uses the same pipeline as the real robot and uses the robotic simulator Gazebo [9] to accurately simulate the robot arm (see the following subsections for more details for each simulation).

(16)

Further, we handcrafted a model to approximate the distributions of apples on a tall spindle apple tree. Due to limited data, it was not possible to learn this distribution. Instead, we hand-crafted a generator. It is used in both simulations to generate the target samples. See Section 6.3 for more details.

6.1 Analytic Simulation

The Analytic Simulation uses an abstract description of the fruit harvesting task to reduce the amount of computing needed while maintaining the task’s relevant properties. The simulation computes the results of one action immediately, and the next decision point is after the full ex-ecution of this action. We do not simulate intermediate time steps, which allows us to run the simulation fast. At each step the observation consists of:

• target coordinates: 3D coordinates of all targets in this episode.

• active targets: A binary mask indicating which targets are already picked and which are still available.

• arm trajectories: The sequence of visited nodes for all arms, including their current position. All nodes are either arm start positions or target coordinates.

• arm times: The total time each arm was busy so far in this episode.

Given this observation the agent has to pick an arm i and target xtarget as next action. The

time t spend on this action is computed as follows: t = d(xcurrent, xtarget)

v + g (17)

d(xcurrent, xtarget) is a distance measure between the current location of the arm and the chosen

target. We use Euclidean distance in our experiments. Other choices of distance measure can model different movement properties of the robot arm. v ∈ R is a constant hyperparameter to model the velocity of the robot arm. g ∈ R is a constant hyperparameter to model the grasp time and other constant factors of the picking action.

At the end of an episode, we return the highest arm time Ti as a reward. The arm time Tifor

arm i is the sum of time all its actions took. All intermediate steps in the episode return a reward of zero. The objective is to minimize the total time spent on an episode by effectively distributing the workload between all robot arms. This reward signal models this objective directly and faithfully. While giving intermediate rewards could help with the training, it is hard to produce a helpful signal that can not get exploited since the overall performance during an episode cannot be known ahead of time. For example, if we choose the negative distance as a reward at each step, the model is incensed to select close targets early in the episode, but this may lead to sub-optimal plans later in the episode.

Lastly, the simulation rejects actions that would cause the last segment of the trajectories of any two arms to overlap in z-direction. This check is modeling the constraint that the arms cannot move past each other in height. For simplicity, the simulation does not calculate the positions of all arms at intermediate time steps. Instead, we consider the whole segment from their previous position to their current target as a blocking area for all other arms. This blocking area covers past positions (previous position), the current position, and future positions (current target). Checking blocking areas keeps the computational cost low for collision detection while guaranteeing to detect all potential collisions.

This Analytic Simulation models the task of picking fruits with multiple arms as multiple Traveling Salesman Problem with additional collision constraints.

6.2 Physical Simulation

The Physical Simulation uses the same ROS-based pipeline described in section 3.1 and robot description as the physical robot. We use the same information as observation at each step; however, the difference to the Analytic Simulation is that the time t for an action is obtained

(17)

by simulating the picking motion in the robotic simulator Gazebo. Gazebo also handles collision detection instead of the simple predictive heuristic deployed in the previous section. This simulation can only detect a collision when it occurs; hence actions that fail due to collision will produce a time penalty before being rejected.

This environment produces noise and much more accurate estimates for the time and collision. These improvements in simulation accuracy come at the cost of higher computational expense and make this task overall more challenging. However, it also makes this simulation well suited to evaluate and refine policies obtained through training in the Analytic Simulation.

6.3 Apple Distribution Generator

The distribution of target positions for the task is orthogonal to the accurate simulation of the robot arm dynamics. Having access to an excellent approximation to the real-world spatial distribution of apples to harvest can help to learn faster and transfer more robust for the task to the real world. Additionally, obtained evaluation results are more realistic. To obtain such a distribution, we need data from real-world apple trees on farms.

We found the dataset from Victor Bloch, Amir Degani, and Avital Bechar 18 which contains data about the branches and apples of a tall spindle apple tree. See Figure 5 for a photo of this apple tree type. This tree is similar to the ones used on the farm we intent for testing the robot. Since only one data point is available, we cannot directly fit a distribution to the data; instead, we consider the given data point as a prototype and varying its properties with random noise to create a generator for apple distributions. We use the branch structure data to generate virtual trees and distribute the apples random on this virtual trees. This way, we guarantee that the resulting distributions are biologically plausible. We analyzed the prototype tree to obtains all hyperparameters for angles and segment lengths.

Figure 5: Photo of a tall spindle apple tree orchard. Source: https://commons.wikimedia.org/ wiki/File:Intensive_apple_orchard.jpg

All random numbers are sampled uniform from an interval defined through hyperparameters. The hyperparameters are chosen as min and max values that occur in the prototype tree (see Table 1 and Table 2).

18

(18)

(19)

Hyperparameter Min Value Max Value

length of trunk segment 0.456m 0.614m

angle around y-axis for trunk segment -7.576◦ 7.576◦

number of segments per branch 2 4

length of first branch segment 0.05m 0.19m

length of second branch segment 0m 0.28m

length of third branch segment 0.08m 0.23m

length of forth branch segment 0.13m 0.14m

rotation around x-axis of first branch segment 54◦ 100◦ rotation around x-axis of second branch segment 36◦ 98◦ rotation around x-axis of third branch segment 19◦ 40◦ rotation around x-axis of forth branch segment 24◦ 24◦ rotation around z-axis of first branch segment 12◦ 27◦ rotation around z-axis of second branch segment 0.6◦ 20◦ rotation around z-axis of third branch segment 0.6◦ 31◦ rotation around z-axis of forth branch segment 35◦ 35◦

valid height of branch start on trunk 0.45m 1.5m

Table 1: Hyperparameters random intervals for tree generation

Hyperparameter Value

number of trunk segments 4 number of branch segments 11 min branch distance 0.01m

number of apples 30

min height of apples 0.3m max height of apples 1.75m

Table 2: Hyperparameters non-random hyperparameters for tree and apple generation

As the first step, we generate the tree trunk. Just like in the prototype tree, the trunk consists of 4 segments. Each segment has a random length and a random rotation along the x-axis by a random angle.

As the second step, we add branches to the tree. We choose a random start position on the trunk for each branch with the constraint that two branches cannot start to close to each other. Each branch is 2 to 4 segments long. The segments rotate around y- and z-axis to describe an overall downwards arc.

Lastly, we sample random positions on the branches and trunk for all apples. Additional constraints prevent apples from hanging too low or high on the tree or overlapping with other apples. Figure 6 shows plots of all generation steps next to the prototype tree.

7 Experiments

This section describes the experiments we used to develop and evaluate our approach.

7.1 Baselines and Upper Bounds

To understand the gain through coordination, we will compare the multi-arm setup to a single-arm. With ideal coordination, the time for the multi-arm solution is the time of the single-arm solution divided by the number of arms, i.e., two arms would take half the time a single-arm needs.

Let us assume we always find the perfect sequence for a single-arm for any workspace. Let T be the time of the optimal single-arm solution for a workspace. We divide this solution into two parts: the first part ends at the node at which the second part starts. We name the times of the partial solutions T1 and T2. Then it holds that T = T1+ T2. If we further assume that executing these

(20)

partial solutions does not violate any constraints of our m-TSP, then the time needed to execute the partial solution in parallel with two arms is min(T1, T2). The optimum of this equation is

T1= T2, which means two arms working in parallel would be twice as fast as a single-arm given a

perfect coordinates plan. The closer a solution gets to this upper bound, the higher is the payoff for adding the second arm to the robot.

We also compare the scheduler performance against the following symbolic baselines:

Random: This baseline chooses the next target for each arm by randomly sampling from the available and reachable targets. It avoids collisions and never picks targets multiple times. Each arm gets a new target assigned as soon as it completed its previous action. This baseline acts as a lower bound. Any randomly initialized neural network should do at least as well as Random.

Divide: This baseline splits the number of targets evenly between all sub-workspaces. The split happens along the z-axis (height) to satisfy the collision constraint. Within one workspace, each arm picks all targets from bottom to top. Other papers used this method, and it is a simple yet effective improvement over Random. We aim to beat this baseline in all experiments significantly. Nearest Neighbour : A simple, yet powerful baseline method for the TSP is Nearest Neighbour. In our case, the home position of an arm is the start target. Then pick the closest target by euclidean distance next to extend the trajectory. The shortest incomplete trajectory in terms of time chooses next, again greedily. These steps repeat until no targets are left. Note that Nearest Neighbours uses a heuristic (euclidean distance in our case) to determine the nearest target. In our Analytic Simulation, this heuristic correlates perfectly with the time needed for the picking operation; however, that is not necessarily the case in the Physical Simulation. Distances in different dimensions may contribute differently to the time required. For example, the transnational joint may be slower than others, making movement in z more costly than in other dimensions. If we wanted to overcome this limitation, we would need an accurate estimate of the movement time to determine the nearest target.

OR-Tools: The open-source optimization research tools library by Google, known as OR-Tools, provides among other solvers for TSPs and VRPs19_{. The routing solver of OR-Tools uses Local}

Search iteratively to improve an initial solution. The solver can be limited in time at the cost of solution quality. With more time given, the solver will find higher quality solutions.

Note that we use OR-Tools with the hyperparameters suggested in the TSP example20_{. Due to}

time constraints, we did not make any effort to find an optimal configuration for our task. Hence, we expect that OR-Tools could produce slightly better results if one makes these efforts.

OR-Tools routing solver allows for a set of common VRP constraints. However, to the best of our knowledge, we cannot consider arbitrary constraints such as the collision avoidance of robot arms in our application. Hence we use OR-Tools without any custom constraint. We treat the resulting solutions as an upper bound reference since the total travel time of these solutions might be lower (better) than the perfect solution that respects the given constraints. We will compare our approach to the solutions obtained by OR-Tools without time constraints and with time constraints similar to the inference time of our approach. The first result serves as upper bound, while the second gives us a baseline respecting the computational cost of constraint. Limitations in compute and power consumption, are highly relevant since we deploy our approach to a robot.

7.2 Configuration and Hyperparameter Search

The following experiments aim to find a functional configuration of reinforcement learning tech-niques and hyperparameters within a limited time. An important consideration for us was the wall time needed to train the model to enable scaling the approach to a larger number of targets.

All experiments in this subsection use the Analytic Simulation described in Section 6.1. We use two benchmarks with 2D coordinates and 3D coordinates. Both benchmarks use a small problem instance with ten targets sampled uniformly. Additionally, the 2D benchmark uses a grasp time of g = 0 to make the results comparable with the TSP results of other papers, for example, Nazari et al. [3]. For a complete list of hyperparameters for both benchmarks, see Table 3. Figure 7 shows a render of both benchmarks.

19_{https://developers.google.com/optimization}

(21)

Hyperparameter 2D Benchmark 3D Benchmark

coordinate dimensions 2 3

distribution uniform[0, 1] uniform[0, 1]

number of targets 10 10

number of arms 1 1

velocity 1 m/s 1 m/s

grasp time 0 s 1 s

Table 3: The environment hyperparameter for different benchmarks we used for the configuration and hyperparameter search

Figure 7: Screenshots from the two benchmark settings. Left: 2D benchmark, Right 3D benchmark. Crosses are targets that are already visited and circles mark targets that are still available. The last segment of the trajectory is highlighted with a darker color.

At first, we tried different batch sizes with A2C in the 3D benchmark. The batch size is essential to converge faster and be able to run more experiments. We are testing the batch sizes 1, 32, 64, and 128. We picked the batch size for all further experiments that provided the fasted convergence in terms of wall time without hurting the solution quality.

Next we compared different combinations of the reinforcement learning methods described in Section 5.3. The configurations we investigated are:

• Base A2C • A2C + GAE • A2C + GAE + AN • A2C + GAE + PPO

To quickly obtain insights on how these configurations perform, we trained them with 2D coor-dinates. The benchmark closely relates to our target problem family since it is the special of using only one arm with lower dimensionality of the coordinate space. We trained each configuration three times for 20,000 updates with different seeds to obtain reliable results and insight in the variance of the training performance.

We take the two best configurations in the 2D benchmark and train them with 3D coordinates to move closer to our target problem. Again we run each configuration three times with different seeds. The configuration that performs better on the 3D benchmark is than used for all following experiments.

Due to limitations in time and computational resources, we could not do an exhaustive hy-perparameter search. We use in all experiments the Adam optimizer [42] with a learning rate of 0.001. For the reinforcement learning methods, we use suggested default hyperparameters from the original papers or implementations. For GAE we use γ = 0.99 and λ = 0.97. For PPO, we use

(22)

Hyperparameters Value sample size 1024 batch size 64 learning rate 0.001 discount factor γ 0.99 GAE hyperparam. λ 0.97 embedding size 128 hidden size 128

number of GRU layers 1

Table 4: Overview of hyperparameters we use for all training after the Configuration Search.

a clip ratio of 0.2, KL threshold 0.01 for early stopping, and an upper limit of 80 gradient steps per sample. AN does not introduce any new hyperparameters. All experiments for the configuration search were done with a sample batch size of 100, trained for 20,000 updates, and evaluated on 10 validation episodes every 100 updates.

When training in the Analytic Simulation, we can revisit the same problem instances. Training on a single problem instance repeatedly proved valuable for debugging. A low variance speeds up convergence, but the bias is so high that the trained model generalizes poorly. We choose to investigate the variance-bias trade-off through variation in sample size. Due to constraints in computing and project time, we aimed to find a sample size that produces models that converge fast in terms of wall time and perform still well on a held-out validation set.

We choose a log-scale to cover a broad range of potential sample sizes with the following values [28_{, 2}9_{, 2}10_{, 2}12_{, 2}14_{]. Once the algorithm went through all samples, we shuffle them and go through}

them again. To compare the convergence in terms of wall time, we train for a fixed number of updates. That keeps the wall time comparable, while each sample is used more often for smaller sample sizes.

7.3 Evaluation of Our Approach

We train our approach on 10 targets for 50,000 updates for one and two arms separately. Table 4 shows an overview of all hyperparameters we use for training. We reason for the choice of sample size and batch size in Section 9.1. With these hyperparameters, our model has 165,888 weights to learn. We run each training three times with different seeds. From these training runs, we take the model checkpoint that achieved the best performance on the 100 samples validation set. To compare our approach to the other methods, we evaluate all approaches on a test set of 100 unseen samples in the 3D benchmark. Since low power usage and limited compute matter for robotic applications in the field, we compare all approaches in inference time.

We compare the performance with two arms against the performance with one arm. To justify the deployment of an additional arm and our approach, the second arm’s performance gain is vital. We evaluate the performance gain from the second arm for different velocity and grasp-time values in the Analytic Simulation to gain an insight into how different robot arms affect the improvement ratio. We test all combinations of velocities of [0.1, 0.2, ..., 1] and grasp-times of [0, 0.5, ..., 4.5]. The values are chosen to interesting theoretical edge cases (grasp-time of 0) as well as a realistic range for these hyperparameters.

7.4 Scaling to Harder Tasks

We run pilot experiments to investigate three directions of scaling to harder tasks for our approach. 7.4.1 Larger Problem Instances

To investigate larger problem instances, we train on 20 targets for 100,000 updates, with one and two arms. The architecture we are using can handle varying numbers of targets without re-training. We investigated how well the model generalizes in this respect by evaluating the trained models

(23)

ß

Figure 8: Solution length during training with different batch sizes on the 3D benchmark. Short solutions are better. All training runs were performed on the same hardware to compare training wall time.

on test sets with 5,6,. . . ,30 targets. We look in particular how the relative performance to baseline methods like Divide changes as we move further from the number of targets seen during training. 7.4.2 More Realistic Distributions

As another aspect of scaling, we will investigate how the model performance on targets from the Apple Distribution Generator. Since the models are trained on target samples from a uniform distribution, we gain insight into how general the solution is for the target distribution.

7.4.3 Physical Simulation

Lastly, we investigate the transfer to a more realistic scenario by evaluating the model trained on 10 targets in the Physical Simulation. In the physical simulation, we use the Apple Generator Distribution with an offset such that all targets are within the robot arm’s workspace. The focus here lies on how well the abstractly trained model handles the more realistic and noisy simulation as an indicator of how well in transfers to the real world robot.

8 Results

8.1 Configuration Search

Training in batches allowed us to utilize the GPU of our NVidia AGX Xavier. Training for ten updates with batch size 64 takes 25s when only using CPU. When we enable GPU training, ten updates with batch size 64 takes 16s (56% faster). Training for the same 640 samples with batch size takes 91s on CPU and is even slower when enabling GPU usage. However, the updates till convergence suffer from increasing batch sizes. Training with a single sample per batch showed good convergence behavior on the validation set, but the training samples’ performance was unstable. The largest batch size tested (128) was the most stable, however, despite more effective use of the GPU, the convergence was slower than for smaller batch sizes. We found that a batch size of 64 worked best for our task and used it in all the following experiments.

Figure 9 and 10 show the results of the reinforcement learning configuration search. We com-pleted only one run for the PPO variants because the achieved performance was not promising enough, and the training time increased drastically with the number of gradient steps. The training time for PPO with up to 80 gradient steps (purple line) was over four times longer (over 16 hours) than other configurations (around 4 hours).

We chose GAE and GAE + AN as winners of the first round and compared them in the 3D benchmark, as seen in Figure 10. In the 2D benchmark, advantage normalization seemed to help with early convergence. We do not see this effect in the 3D benchmark. GAE without advantage normalization shows both faster convergence and better solution quality by a small

(24)

Figure 9: Solution length during training on the 2D benchmark. Short solutions are better. Plots are obtained by averaging three runs and colored area shows one standard deviation. PPO variants have only one run.

Figure 10: Solution length during training on the 3D benchmark. Short solutions are better. Plots are obtained by averaging three runs and colored area shows one standard deviation.

(25)

Figure 11: Solution length during training on the 3D benchmark with different sample sizes. For smaller sample sizes all samples are revisited more often.

Figure 12: Solution length during training on the 3D benchmark with different sample sizes. For smaller sample sizes all samples are revisited more often.

margin. Based on these observations, we decided to use A2C with GAE and no further components for all remaining experiments.

Figure 11 shows the results of training with different sample sizes. We see that a higher sample size increases the variance. We achieve the best performance within 20,000 updates with a sample size of 210_{= 1024 witch is neither the highest nor the lowest tested. We did an additional test on}

how shuffling affects the training, see Figure 12. Both training runs use 1024 samples, and the run without shuffle shows the training samples repeatedly in the same order. The training run without shuffle converged faster during the first few updates. However, both runs achieved the same level of performance and variance towards the later updates.

8.2 10 target m-TSP

Our approach achieved the best and second-best performance on the test set in our main eval-uation cases of 10 targets with one and two arms, respectively, see Table 5. Note: OR-Tools ignores collision in the two arm scenario. Hence we consider its performance an estimate for the upper bound. Figure 13 show the validation performance throughout the training. Our approach surpasses Divide early on and converges steadily to OR-Tools-level performance.

Nearest Neighbour comes out as the strongest baseline, even in the two arm case Divide was designed for. In the one-arm scenario, Nearest Neighbour performs strong, reaching 99% of OR-Tools performance. For two arms, Divide’s performance improves, while Nearest Neighbour’s performance decreases compared to the one arm case, but their relative order remains the same.

The relative improvement of two arms over one arm does depend on the environment variables grasp-time and velocity we described in Section 6.1. Figure 14 shows the improvement for our ap-proach and Nearest Neighbour. In the hypothetical case of a grasp-time of 0, our apap-proach achieves only a 50% improvement with two arms. When the grasp-times is greater than 0, the improvement

(26)

10 targets

Method 1 arm 2 arms

our approach 14.16 ± 0.52 8.03 ± 0.69

Random 16.94 ± 0.88 9.83 ± 1.41

Divide 15.60 ± 0.92 8.36 ± 0.51

Nearest Neighbour 14.25 ± 0.56 8.10 ± 0.72

OR-Tools 14.17 ± 0.50 7.71 ± 0.38

Table 5: Performance of our approach compared to other methods for 10 targets. We use the 100 sample test set in the Analytic Simulation to obtain these results. The environment variables are: velocity 1m/s and grasp-time 1s

Figure 13: Validation performance of our approach compared to other methods while training on 10 targets with one and two arms.

ratio increases with increasing grasp-time and velocity. For the highest tested values, it achieves an almost 90% improvement rate. Nearest Neighbour also shows an increasing improvement ratio, which increasing grasp-time and velocity; however, the slope is much shallower with improvements from 73% to 78% for all tested environment configurations.

Figure 15 show how absolute performance changes as grasp time and velocity are changes in the Analytic Simulation. As we expected, the performance improves with higher velocity and lower grasp time. However, different from the one arm scenario, the change is non-linear. Our approach and Nearest Neighbour are affected differently by velocity and grasp time. This difference leads to a shift in relative performance to each other with the environment variables, see Figure 16. For lower velocity and lower grasp time, the Nearest Neighbour performs better than our approach. When grasp-time or velocity increases, our approach does relatively better.

8.3 Scaling to Harder Tasks

We consider the following results a pilot into further work, and they would greatly benefit from an investment of additional computational resources and time.

8.3.1 Larger Problem Instances

In Figure 17, we see the validation performance throughout the training on 20 targets. Our approach levels off before surpassing Nearest Neighbour. We make the same observation on the test set results in Table 6, where our approach beats Divide comfortably, but it performs worse than Nearest Neighbour. As expected, OR-Tools beats all other approaches for 20 targets as well. We tested our trained models on different numbers of targets, as found in the training set. For the 10 target model, the performance remains stables between 8 and 12 targets relative to or baselines, see the left graph Figure 19. The tipping point where the model trained on 20 targets becomes better than the model trained on 10 targets is around 15. From that point onward, the 20-target model does increasingly better. Relative to Nearest Neighbour and OR-Tools, the 20

(27)

Figure 14: Relative improvement of using two arms instead of one for different environment settings. The improvement is calculated from the performance on the 10 target test set in the Analytic Simulation. The matrix on the left compares our approach for two arms with our approach with one arm. On the right we compare Nearest Neighbour for two arms and one arm.

Figure 15: Performance using two arms for different environment settings with 10 target test set in the Analytic Simulation. Left: our approach, Right: Nearest Neighbour. Lower values are better.

Figure 16: Performance difference between our approach and Nearest Neighbour for different environment settings. Computed on the 10 target test set in the Analytic Simulation.

(28)

20 targets

Method 1 arm 2 arms

our approach 27.27 ± 0.74 14.83 ± 1.01

Random 33.71 ± 1.14 17.64 ± 1.37

Divide 30.91 ± 1.18 16.09 ± 0.71

Nearest Neighbour 26.75 ± 0.62 14.45 ± 0.97

OR-Tools 26.44 ± 0.57 13.90 ± 0.33

Table 6: Performance of our approach compared to other methods for 20 targets. We use the 100sample test set in the Analytic Simulation to obtain these results. The environment variables are:velocity 1m/s and grasp-time 1s

Figure 17: Comparison of our approach on the m-TSP with 2 arms to other methods.

target model maintains its performance from 18 targets upwards.

In Figure 20, we see how inference time increases with an increasing number of targets. OR-Tools inference time scales exponential with increasing targets, reaching 989ms per episode for 50 targets. For our approach, we used the same model size for all target sizes. The inference time for our approach also increases super linear; however, at a significantly slower rate than ORTools. At 50 targets, our approach takes 149ms for inference. Since there is no build of ORTools for the ARM architecture, we could not record the inference time on the Xavier. Instead, the inference time was recorded on a 2014 MacBook Pro using only CPU (2,5 GHz Quad-Core Intel Core i7). The inference time values are averages over 100 episodes.

Figure 18: Absolute performance with two arms of different models and methods over increasing number of targets.

(29)

Figure 19: Relative performance with two arms over increasing target number. On the left our model trained on 10 targets is compared to other methods and our model trained on 20 targets. On the right we compare our model trained on 20 targets with the others.

Figure 20: The graph shows how inference time scales with increasing number of targets. The time was recorded on a 2014 MacBook Pro using only CPU.

Learning Action Coordination for Multi-Arm Robots

MSc Artificial Intelligence

Master Thesis