Deep reinforcement learning for PID autotuning on a robotic arm

(1)

Deep reinforcement learning

for PID autotuning on a

(2)

Layout: typeset by the author using LA_TEX.

(3)

Deep reinforcement learning for PID

autotuning on a robotic arm

Leon Eshuijs 11866071

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. E. Bruni

Institute for Logic, Language and Computation Faculty of Science

(4)

Abstract

In recent years the combination of robotics and Artificial Intelligence has proven to be an evolving and promising research area. In this paper an exploratory research is constructed to improve the movement of a robotic arm that is applied for apple harvest. The arm is optimized by using Reinforcement Learning to learn the appropriate parameters of its control system, called the PID. In a constructive manner, Actor Critic is applied on the robot system with the goal to improve its functioning by finding the appropriate parameters per motion, based on the location of the apple. It was concluded that, under the constrains of the complex environment, the algorithm showed improved accuracy of the controller. Finally, evidence is established that gain finding based on goal locations has the potential to further increase performance and is a worth while pursuit.

(5)

Acknowledgments

Throughout the project a number of people have provided a crucial contribution. Firstly, I would like to thank my supervisors Elia Bruni and David Speck as well as my colleague Vivien van Veldhuizen. Thanks to our combined brainstorming sessions and the guidance of the supervisors the complex problem shifted from overwhelming to a structured one. Secondly, a special thanks

will go out to the ROS community, in particular Gijs van der Hoorn whose correspondence clarified the deeper mechanics of ROS and related software. Furthermore, the assistance of the

HEBI Robotics team should be commended, due to their auxiliary correspondence about the control and implementation of their actuators. Lastly, I would like to thank my colleague and

(7)

1 Introduction

The ubiquitous fear of Artificial Intelligence(AI) taking over jobs has penetrated many a work space. Even though many types of manual labor had seen its automation during the industrial revolution, the current combination of AI and robotics seems fruitful for the implementation in this branch yet again. This paper is part of an overarching research into the implementation of robotic arms in apple picking. An important factor for the implementation of such systems is their speed in comparison to that of manual labor. Optimizing the speed however, can crease unstable functioning and lead to vital errors. Dependable control systems are in place to prevent this from happening, by adjusting the actuator motion throughout the execution of a planned trajectory. The control system evaluated in this paper is that of a PID control, which is tuned to improve the accuracy of motion to bring a robotic arm to, and from an apple.

To achieve the desired behavior of the PID, its three gains called the proportional, integral and derivative gain are tuned. Their interdependent behavior makes tuning a single gain undesirable and Deep Reinforcement Learning is applied to discover these optimal values. This desired learning is achieved by an Actor Critic network, which is trained to find the right gains for the right movement, based on the predicted apple location. This learning goal is deconstructed into three sequential research questions to evaluate the advances of the set up. Firstly, is it possible to use Actor Critic learning to find appropriate gains for an actuator? Secondly, can the motion of a robotic arm be improved by simultaneously tuning the gains of two actuators on a single Actor Critic model? Finally, can the accuracy of the arm be further improved by training on multiple goal locations?

1.1 Colaboration

Due to external circumstances, this project was run by two students separately. The independence of the project is primarily with the goal of writing a separate paper. Nevertheless, the project is part of an overarching project and consequently collaboration was necessary with the project leaders and the other student, Vivien van Veldhuizen. Therefore, at least the setup of the project(which was a significant part) was a joint effort. Furthermore, since the project specified implementing RL, the implementation of the AC was also subdivided to assure efficacy. Notwithstanding this alliance, the interpretations and conclusions of the results and experiments are on an individual basis.

(8)

2 Literature Review

The control mechanism called PID is a common approach for the control loop problem(s). The first approach to set the three gains of the PID was by a method, named after its inventors, called Ziegler-Nichols (ZN)[1]. It was a great first step for auto-tuning, yet the PID still caused a lot of overshoot. Thereafter, intelligent systems where used to set the gains and they outperformed the ZN method. However, in real life the PID controllers are influenced by the operational environment and complex dynamics which require the gains to be re-tuned. This problem called for a more adaptive on-line method. Neural Network have been shown to provide a good solution for the autotuning[2]. However, as aforementioned, the system needs to be more adaptable to be suitable for real life. Genetic Algorithms have also had great success with PID tuning and are favored because they need little prior knowledge, yet have been unable to realize adaptive real-time optimization[3].

Besides these approaches, many others appeared that showed great accuracy, like the Whale Opti-mization Algorithm[4] and Particle Swarm [5]. However, important problems that persist are those of slow convergence, long execution time and overall inefficiency[6] [7]. At present, Reinforcement learning has been shown to overcome these problems. One method, called Q-SLP, used Q-learning to optimize the weights of the novel Swarm Learning Process(SLP) algorithm[7]. Another model called Fuzzy MAS Q-learning, initiate the gains using the ZN method and uses multiple agents, one for each gain, for optimization[8]. The last important approach used the Asynchronous Advanced Actor-Critic (A3C) algorithm. Hereby, multiple agents with different sets of parameters work and measure independently of each other for a diverse and more complete result.

An interesting point is that all the recent works considering PID gains, used some form of Actor-Critic learning or Q-learning, and contained a part where multiple agents where used. The presented works however, were almost all based on online tuning, which given the complexity of the project was deemed, in hindsight justifiably so, too excessive for this research. The success of Reinforce-ment Learning methods was used as a basis and due to their continuous action space, the Actor Critic architecture was used for the implemented network.

(9)

3 Theoretical background

To clarify the understanding of the experiments for the reader, a few techniques and their functions are presented. At first, a most central part of the research, the PID controller, is explained. Following the inner workings of reinforcement learning are described to show how it is suitable for the project.

3.1 The PID controller

A block diagram of the PID controller is shown in Figure 1. The controller takes as input the error e(t), which is the difference between the goal value r(t) and the real value y(t) of a signal [9]. As output, the signal u(t) is created as a response to minimize the error. To accomplish this computation, it utilizes the three gains Kp, Kiand Kd, each of which contribute to a different part

of the control. The interaction of all these terms is depicted in Formula 1. u(t) = Kpe(t) + Ki Z t 0 e(t0) dt0+ Kd de(t) dt , (1)

The proportional gain Kp causes the control to be directly affected by the error. To eliminate the

offset error, the integral term with gain Kiis taken into account. The side effect of this term is that

it increases the overshoot of the action, meaning that right after the error reaches zero, it increases in the opposite direction because of its momentum. The differential term with gain Kd, counter

acts this overshoot and assures that the intensity of the action is proportional to the size of the error.

(10)

3.2 Reinforcement Learning

Reinforcement Learning(RL) is a Machine Learning method concerned with finding the best action for a given a state. In contrast to many other forms of Machine Learning, the action is not in-structed, nor given as example. However,it is discovered during training[10]. An agent chooses an action to optimize the reward and this decision is based on its policy function. After the reward is received,the policy function is updated to improve future rewards.

Another popular area of Machine Learning is supervised learning, where decisions are learned based on labeled examples. In its first stages,the progress of RL was constrained by its limitations of scale expansion. This was mainly due to the look-up table that was used for state-action pairs, these are fine for small or discrete state spaces, nonetheless when the state space becomes huge or continuous, look-up tables become too inefficient or inappropriate[11]. To overcome this scalability issue, gen-eralization techniques from supervised learning were used and resulted in function approximations. Because of its general applicability RL has excelled on tasks where the probabilities and rewards are unknown [10]. This skill arises from its special way of learning by interaction between the agent and the environment. Inspired by biology, its reward-based learning imitates the associative processes in animal behavior[12]. Its roots can be traced back to Thorndikes Law of effect, which describes a positive relationship between the gratification experienced after a behavior and the frequency of that behavior thereafter [13]. In RL the positive and negative reinforcement function as pleasure and pain respectively.

A conundrum that arises from RL but is hardly present in other forms of learning, is the exploration-exploitation trade-off. If an agent was to take the first optimal solution every time, it is likely to never achieve a good policy. To make sure that local optimums are surpassed and the global opti-mum is found, Reinforcement Learning algorithms have an exploration policy. One common method is the epsilon greedy algorithm, where random actions are sometimes taken to ensure exploration. The frequency of the random actions decreases over time when it is expected the algorithm has explored a considerable part of the action space [14].

3.2.1 Markov Decision Processes

The mathematical framework Reinforcement learning agents uses for describing the decision making process, is an Markov Decision Process (MDP). Such a process is formalized by specifying the set of states S, the set of possible actions A, a state transition function P(s0|a, s) which specifies the probability of each next state given the current state and reward, and lastly the reward function R(s0_{|a, s) for a state, action new-state combination.}

A premise of a MDP, is that the action is only dependent on its current state and reward, and not its history, it is therefore considered a memoryless property [15]. This key assumption is called the Markov Property and is formally expressed as:

P (st+1, rt+1| st, at, rt, st−1, at−1. . . s0, a0, r0) = P (st+1, rt+1| st, at). (2)

Furthermore, nearly all RL algorithms are based on estimating value functions [10]. These functions estimate how desirable it is to be in a certain state. This desirability is expressed by how promising future rewards are expected from that state. Consequently, these future rewards depend on which action the agent takes, e.g. the policy function. The quintessential properties of any RL algorithm are then its policy function, defined as π(a, s) and its value function, formalized as Vπ(s).

(11)

3.2.2 Actor-Critic

Actor-Critic architectures are structured to represent the policy function independent of the value function. The policy function is embedded by the Actor and the value function by the Critic. This differs from other common Reinforcement Learning methods like Q-learning where each state’s action-value, or Q-value, is calculated. The policy gradient technique which underlies the Actor Critic, employs a far more vigorous method by finding which action the agent should prefer, in stead of finding the value of each action [10]. Because the state-value is not directly reliant on the action, the actor and critic components can be executed in parallel. In the classical assembly, both the actor and the critic part have their own neural network.

Actor Critic networks are Model-free, which means that they do not model the environment nor consider how this is affected by its actions. This way of learning can simplify tasks and thereby decrease computation. Actor Critic networks have achieved great success in tasks where an agent needs to take actions in an environment where the reward is only acquired at the end [10]. This way of learning is achieved by distinguishing the relevant state action pairs through a method called the Temporal Difference error[10]. The Temporal difference error has primarily been renowned for its successes in online Actor Critic networks. What makes the algorithms online, is that multiple actions are executed throughout a training episode. However, in this paper an offline method is applied whereby one action is chosen to ascertain the potential of the network and give way for the implementation of an online one. Before a further explanation of the inner workings of the used Actor Critic network is described, the implementation and setup of the experiments is outlined. This way the connection between the setup and the algorithm are clarified.

(12)

4 Set-up

The PID tuning of the actuators of this project will be implemented on a robotic arm designed for fruit harvest. Although the training and testing on this arm will be done in simulation, this simulation is based on a real version which resides in Italy. Results and findings will therefore be asserted as of comparative stature as if the experiments were run on the physical version.

(a) The entire robotic Arm (b) The gripper

Figure 2: The robotic arm and gripper simulated in gazebo. The hebi joints in red and the gripper in grey

4.1 Construction of the robotic arm

The robot arm can be divided into three parts: the rail, the hebi-arm and the gripper. The distinction between these groups is mainly based on their type of actuators. The rail group connects the hebi-arm to the rail via the special actuator called the prismatic joint which regulates the vertical movement of the entire arm. Even though the actuator of the prismatic joint is like the gripper servomotors, from the DYNAMIXEL1 _{brand, its function is very different and worth evaluating}

separately. The hebi-arm actuators are made by the company HEBI Robotics2_{and are accompanied}

by many software options3 _{to optimize its motors. Two of such actuators make up the hebi-arm,}

which rotate to create the horizontal movement for getting the gripper to its right place. The first one, called the shoulder joint, is attached to the rail and can turn 360 degrees around the rail, the elbow joint on the other hand4 _{cannot move 180 degrees, this is to prevent possible collisions of}

the gripper with the arm. At last the gripper, this group comprises one horizontal, followed by a rotational wrist movement, and finally two finger joints to clam the apple. In real life the gripper will be extended with a soft flexible material therefore the grepping motion does not have to be a delicate movement. The dimensions, masses and control functions of the actuators and materials are all based on the real robotic arm to increase the utility of the results.

1_{http://www.robotis.us/dynamixel/} 2_{https://www.hebirobotics.com} 3_{https://docs.hebi.us/tools.html#} 4_{pun intended}

(13)

4.2 Control segmentation

Although the construction of the entire arm has been examined, another segmentation should be established for the control. Since the project aims to improve the movement of the arm, this move-ment will in practice consist of multiple sequential movemove-ments. The first one being the movemove-ment of the arm to get the gripper to the right place and after that the motion of the gripper opening, etc. It is clear that the latter is not dependent on the location of the apple, i.e. this motion is the same for every apple. Accordingly, the former motion should be performed by the actuator set consisting of: the prismatic joint, the shoulder joint, the elbow joint and the horizontal wrist joint. These actuators are the relevant ones that are part of the optimized movement in this experiment and for all clarity will be referred to as the Arm group. As mentioned, the HEBI actuators used controllers that provided extensive software options. For the optimization these proved key, and the retrieval of this data from the other actuators was not attained. Accordingly, the only two actuators that were tuned by the network were the two HEBI actuators.

4.3 Applications of PID control

For clarity of the project, a final specification between the location of the PID controllers inside the robot should be noted. Even though all PIDs in the robot are part of a control system that minimizes a control signal of the robot, it should be established which PID controller is referred to in this paper. The first one is a lower level controller which has an elementary function, like the regulation of the current that is fed into the actuator. Here the control system assures the current is converted correctly to the motor. The place of this controller inside the entire system is illustrated in Figure 2, where in the real robot block, a subsection named Embedded Controllers is shown. The application of the PIDs in this research concern the controllers that are used for the execution of a motion plan. The place of these PIDs in the robot system is shown in the yellow Controller block of Figure 2, and is directly controlled by the follow_joint_trajectory. In this instance, the error consists of the difference between the desired and planned state of an actuator, these states, depending of the hardware implementation of the PID, are in position, velocity or effort(i.e. torque).

(14)

Figure 3: ROS controll systems schematic

4.4 Software

One of the most essential pieces of software that is used for the implementation of the arm is called the Robot Operating System(ROS )5_{. This is a collection of software packages designed to simplify}

the creation of complex and robust robot systems. ROS uses a graph-like architecture where different parts of the systems can cluster together to function as a node. These nodes structure the system and make it easy to transfer functionalities. A node communicates via messages with other nodes through topics, whereto it subscribes or publishes. Figure 3 illustrates the inner workings of ROS for our project: here the /move_group node communicates with the node /gazebo through the topic /rail_controller/follow_joint_trajectory/goal to communicate the movement of the prismatic joint to the simulator. This simulation is run using the Gazebo simulator6. This 3D simulator uses a high-performance physics engine to calculate the physical constraints on the robot and computes the given environment.

Figure 4: ROS Nodes and Topics of the project

Another important software package is MoveIt! 7_{. This computes and executes the}

move-ments of the robot from the current to a goal position. The Python commander package of MoveIt!8 _{communicates with the robot through a ROS interface on topics like the aforementioned}

5_{http://wiki.ros.org/melodic} 6_{http://gazebosim.org} 7

https://moveit.ros.org

(15)

/follow_joint_trajectory subtopic. The messages on this topic are sent by the FollowJointTra-jectory controller which is used for all the actuators in the setup. Moreover, the MoveIt! setup assistant was used to complete the files for the connection of the arm to the gripper and subdivide the arm into useful groups. The last important program that was employed is Rviz9_{, which connects}

with MoveIt! for the execution of simple predefined poses and visualizes planned movements. Rviz played a central role in the construction of the arm because it allows for a clear overview of the setup and provided an analysis of the connections and joint control.

4.5 Robot construction

The schematics of the robot is stored in a Unified Robotic Description Format(URDF ), which is described as an Extensible Markup Language (XML). In the URDF, the robot is defined as a set of separate pieces, called links, and their connections known as joints. For every link all the physical properties like shape, mass and collision boundaries are specified. For the joints the kinematic and dynamic properties are noted and comprise elements like: the type of joint(e.g. rotational, prismatic), the parent and child link, and the places the joints attach. For a URDF to be applicable to Gazebo, other constraints have to be defined, such as friction coefficients for the links. Gazebo then converts the URDF to a Simulation Description Format(SDF ) .

The MoveIt! setup assistant reads the URDF and requests additional settings, e.g. which joints form groups. It constructs the final .launch files, which start the gazebo world, the robot in it, as well as a control structure. Along with the .launch files the configuration files are created and saved as .yaml files. These files specify the control settings as well as kinematic constraints, all of which are gathered from either the original URDF files, or the additional settings.

4.6 Path planning by OMPL

The standard motion planner of MoveIt!, Open Motion Planning Library (OMPL) was used. It is purely a kinematic planner which only generates paths, i.e. without timing information, and is a probabilistic method that samples trajectories. This means that from the same start and end point, the found paths might differ. This way of trajectory planning can create noise whereby the time for one set of gains can create different paths, and thus different rewards. Early experiments however, showed that using the same path for one set of start and end points is undesirable, because a suboptimal path can favor specific gains and therefore reduce the generality of the results. Fixing the path creates an other problem of making sure the paths are suitable. Nevertheless, the overall difference in paths appears minimal since the movements are often straightforward. Therefore the

(16)

4.7 PID types

The resulting trajectory is a sequence of waypoints. For each waypoint the expected position, veloc-ity, acceleration and effort is defined. All these different expected values allow for different types of trajectory controllers. ROS provides a generic controller for this called the joint_trajectory_controller, which supports multiple hardware interface types, namely position, velocity and effort controlled joints11. For the joint_trajectory_controller the expected value that is compared in the PID is based on the position. When this trajectory controller is position controlled, the expected value is thus per definition the real value. This results in a constant zero error and the influence of the PID expunged. Therefore, the effort controlled joint_trajectory_controller was used the for the DYNAMIXEL joints, as it is the default set by the MoveIt! setup assistant.

The used Hebi joints contains several unique PID control setup options whereby all three of the mentioned control inputs are combined12_. _{Figure 5 shows control strategy 3, which was}

chosen for its relative clear structure of PID combination. To generalize the results of the ex-periments, two of the three PIDs were eliminated by setting their gains to 0. Contrary to the joint_trajectory_controller, the Hebi controller is influenced by the position control and this PID was the selected input control for the joint. This selection was slightly arbitrary but the by Hebi Robotics provided default gains of the position control were set far higher than for velocity or effort. From this arguable evidence was inferred that the Hebi Robotics team considered the posi-tion control of greater value. Moreover, early experiments showed a substantial amount of unstable movements using the other two PIDs, which would further complicate the experiments. Contrary to the joint_trajectory_controller, hebi joint contains its own trajectory controller

Figure 5: Hebi PID, control strategy 3

11

http://wiki.ros.org/joint_trajectory_controller

(17)

5 Method

5.1 Exploratory research

Even though previous works have substantiated that dynamically changing the gains improves PID functioning and that the optimal gains can change due to the dynamical nature of its processes, this research is establishing if different gains are optimal for different movements. This hypothesis seems especially valid given the mechanics of the arm, where joints further from the base are turned while under motion and acceleration exerted by the joint closer to the base. It becomes clear that overshoot of the arm is a bigger problem due to the joint external forces. By using the same neural network for multiple actuators, we expect the system to complement these dependent forces, and find the optimal gains for each actuator.

5.2 Actor Critic

The applied Actor Critic method consists of two Neural Networks, one for the actor and one for the critic. The input layer of each consists of three floats which represent the goal location. These nodes are fully connected with the ensuing hidden layer, which consists of 256 nodes. The last layer is also fully connected and consists of one node for the critic network, and two nodes for each of the gains in the actor network.

The networks are implemented using PyTorch 1.4.0.13_{, this version is the newest release that still}

worked for Python2, which was required since ROS as wel as the MoveIt! Python packages cur-rently only work for Python2.

The PID gains are not outputted by the network directly, but for each gain the network outputs the mean and variance. Finally the gains are estimated by sampling from the Gaussian distribution, which is constructed using the mean and variance. This method of parameterizing the output using a probability distribution is used to include the exploration paradigm. One advantage of using this method in stead of epsilon greedy, is that overtime the distribution becomes narrower and even deterministic, whilst using epsilon greedy always results in probability of selecting a random action.

(18)

From the way the Actor Critic network updates its functions, it shows that it is a policy gradient method. Firstly, it computes the eligibility vector, also known as the log probability ∇logπ(a, s). The eligibility vector is part of the policy gradient algorithm REINFORCE. In the basic REIN-FORCE algorithm the policy function is updated using this derivative of the log probability and the reward[10]. High rewards however, increase the variance of the update which result in noisy gradients, and cause erratic learning. An improved model of this algorithm subtracts a baseline from the reward, which reduces the size of the gradients. These smaller gradients reduce the vari-ance and consequently increase the learning speed. In Actor-Critic methods this baseline is based on the critic value, and in the one-step algorithm the Temporal difference(TD) error, as shown in Figure 6, is as follows:

δ = reward + γVπ(st+1) − Vπ(st) (3)

The address of single step in the algorithm name concerns the update of only the value of the first next state, Vπ_(s

t+1), is taken into consideration with a factor γ. But as mentioned, our algorithm

consists of one action, the second state is therefore always the terminal state after which no new reward is received. Moreover, the training is an episodic task, since the terminal state is the same for every starting state, namely the home position of the arm. Formally the model can thus be regarded as a zero step episodic task, but more generally defined as a single action Actor Critic algorithm. Because there is only one action and a fixed end state, the second term of error falls away and the Temporal difference error and corresponding loss functions are:

δ = reward − Vπ(st) (4)

value_loss = δ × ∇logπ(a, s) (5)

critic_loss = δ2 (6)

In Algorithm 1 the entire setup of the training and testing of the network is depicted. Here, the update functions are reflected in the network.learn() function that is called. Moreover, the negated accuracy of the movement is supplied, since this is the reward the network tries to maximize.

Algorithm 1: Single action Actor-critic algorithm initialize ActorCritic network;

for every apple_coordinate do reset_simulation() ; gains = ActorCritic(apple_coordinate) ; set_gains(gains) ; accuracy = run_simulated_movement() ; network.learn(-accuracy) ; end 5.2.1 Reward

The PID enforces that the planned trajectory is followed as accurately as possible, yet as illustrated in Figure 7 there is always some error. The top image of Figure 7 shows the planned and executed trajectory of the shoulder joint, consequently the lower image shows the resulting error between these. The HEBI Python API was used to log the performance of the two trained HEBI joints during a learning iteration, from which the cumulative error was computed. The absolute of all

(19)

the errors was used, so positive and negative errors can not cancel each other out. Even though Figure 7 shows continuous lines for the plots, the real data consists of a sequence of data points. To ensure that the influence of the sampling frequency on the reward is limited, an interpolation method is applied between the points14_{. To finalize the accuracy measurement and thus the reward,}

a summation of the error is performed, this is concluded by integration of the absolute error on the resulted line. The reward is then achieved by taking the inverse of the integrated interpolated absolute error of the executed trajectory.

Figure 7: Top: the planned and the executed paths in radians for the shoulder(J1) joint. Below: the position error between the paths in radians.

5.2.2 Activation

The action sampled from the probability distribution uses the sigmoid activation function, as shown in Figure 8. By using the sigmoid, a continuous action space is created with boundaries 0 and 1. Because it is a continuous space we scale the output for each gain by multiplying it with a constant. Even though the sigmoid limits the search space, it was chosen over activation functions that permit unbounded search spaces like the ReLu function. The reason hereof, ist that unbounded action spaces have been known to cause problems like exploding gradients[17] which make the

(20)

Figure 8: The sigmoid activation function before scaling

6 Experiments

For clarity of the experiment description, some final definitions are established, namely the starting and goal position which will be referred to as the basket and apple accordingly.

To answer the multiple research questions, three different sets of experiments were evaluated. The first experiment was to evaluate if the Actor Critic network was applicable for gain finding of individual joints for one apple position. The second experiment was to evaluate the combined tuning of the shoulder and elbow joint, for one apple position. Lastly, once the combined tuning was established the algorithm was tuned for both actuators and a set of 100 apples.

6.1 Preliminary Experiments

Commencing the project, the construction of the robotic arm in simulation was still a work-in-progress. To accommodate this, as well as the time intensive nature of the physics simulation, the applied Actor-Critic network was initially constructed and fine-tuned using a functionally similar application. To this end, the algorithm was compiled during preliminary experiments on the openAi Gym Lunar landing environment, which was PID controlled15_{. At first discrete action-spaces were}

applied where the algorithm was trained to either increase or decrease one of its gains by a fixed amount. After learning capabilities were established, the algorithm was successfully applied using a continuous action space.

6.2 Robot Control Hyper parameters

Since the PID is responsible for following the planned trajectory, it is readily apparent that this efficacy is greatly dependent on these planned trajectories. For this reason, the control settings that directly influence the planned trajectory of the robotic arm should be considered. Accordingly, the maximum velocity and acceleration, were inspected and tested for different ranges. Increasing these constraints, however, proved strident, for the speed boundaries are also used by the trajectory execution to determine successful execution. Consequently increasing the speed limits to speed-up the planned trajectory, entails a premature termination of the trajectory execution. This happens, because the broad speed limits expand the boundaries the execution control considers a successful finish. This deficiency of the HEBI trajectory controller is put to the attention of HEBI Robotics, and is processed to be implemented in an ensuing software version16. Especially the acceleration

15_{https://github.com/wfleshman/PID_Control} 16_{https://forums.hebi.us/viewtopic.php?f=12&t=68}

(21)

maximum was of great influence on this issue and it, as well as the velocity maximum, were there-fore left at the default values supplied by HEBI Robotics for both joints. This was an acceleration of 1 radians/sec2 _{and a velocity of 30 radians/sec. The other more general control settings that}

were modified are the stated trajectory execution parameters. These adjustments are more straight-forward, the allowed_execution_duration_scaling and allowed_goal_duration_margin were set from 1.2 and 0.5 respectively, to both 10. This modification prevents faulty gains from being aborted and adds to a consistent learning approach.

6.3 Simulation constraints

Since the simulation runs continuously, the physical changes inside the simulation accumulate which can interfere with the experiments. To this end, multiple aspects of the simulation were evaluated and appropriate measurements were taken. One of the issues is the temperature build up inside the HEBI joints. The feedback of these joints showed an increasing temperature building up throughout the training. Documentation of hebi control however, showed that their control is only actively limited above a certain temperature. The Python HEBI API provides a function that indicates when this boundary is surpassed, and further movements are halted until the flag has disappeared. A second and more important limitation of the continuous running is crash control. Throughout training there were considerable occurrences of aborted motions. The reason these motions could not go through, was because the arm was not steady when the motion was tried to be executed. These errors were first discovered when the maximum velocity and acceleration were varied, as explained in section 6.2. During these experiments the movement succeeded prematurely, and the next movement was thus attempted while the arm was still in motion. The persisting aborted motions were expected to be due to crashes of the robot arm, after the error was replicated when high proportional and integral gains were set and the derivative gain to zero. This set of gains results in big overshoot, the trajectory control can malfunction and the arm starts spinning uncontrollably. After such a crash, the movements are more prone to crash, independent of the new gains. It is, however, also observed that the abort error can happen with arbitrary gains at all times during learning, without a previous crash. Figure 9 shows a trial run of different gain combinations and their resulting rewards, which was -3 for a canceled movement. Neglecting any convergence of the gains, it is apparent that the abort error can happen throughout the gain distribution. Nonetheless patterns of crashes are not excluded.

Although crashes might be due to the prolonged simulation time, they are still an important part of the research. The simulation allows for exploration of gain areas that could lead to harmful behavior of the system, but which might surround other promising gains.

(22)

Figure 9: All three PID gains combinations and the distribution of crashes (purple).

6.4 Sampled locations

Because the addition of the prismatic, as well as the horizontal wrist joint, were not achieved within this project, the apples were set at the same height as the basket, 0.5 meters from the ground. For all three experiments the sampled apple locations were based on a range of locations wherein the real robotic arm will find the apples. Figure 10 shows the area in green and the radius of the robotic arm in red. The rectangle of possible apples is 1 meter wide and 0.25 meter deep starting from 0.5 meter from the rail center. The basket location for the real arm is undetermined for now. However, early experiments showed that if the arm is bent at the basket position, the angle of this corner can change ensuing a crash, since the trajectories are calculated every iteration for a coordinate and not for a specific joint position. This crash can change the arm at basket position from a left to a right arm, which can interfere with the estimation of the value function. The simplest solution was applied through straightening the arm by setting the basket position at the maximum range along the x axis. The apple location of experiment 1 and 2 were chosen in the middle of the area(i.e. x=0.0, y=0.625). For the last experiment 100 samples were stochastically chosen from the apple area.

(23)

Figure 10: Locations within range of the arm(red), the basket and sampled apples(green).This last area with a width of 1 meter, starts from 0.5 meter till 0.75 meter from the pole center.

6.5 Further network specifications

Possibly the most important hyperparameter of the two networks is the learning rate. Even though good values were found after several training episodes on the preliminary openAi Gym environment, the further learning rates proved an important aspect. The used optimizer for both networks was Adam17_{, which has a default learning rate of 0.001. However, the noise that arises, is due to the}

aborted movements complicates training, to limit interference thereof a lower learning rate proved more robust. Extensive tuning however, was limited by the computational cost of the training. At least 1000 steps were deemed sufficient for learning in experiment 1, but this demanded on average 2 hours training time. For experiment 1, a learning rate of 0.0005 for the actor and 0.0001 for the critic worked best. For experiment 2 and 3, lowering this with a factor of 10 appeared to give the best results. A summary of the learning rate and all the other most essential specifications of the network for the three experiments are shown in Table 1.

(24)

Output Actor

Output

Critic Epochs Batch size

Learning rate Actor Learning rate Critic Experiment 1 6 1 1000 1 0.0005 0.0001 Experiment 2 12 1 3000 1 0.00005 0.00001 Experiment 3 12 1 100 100 0.00005 0.00001

Table 1: Network specifications for the three experiments. Experiment 1 : training a single actuator based on one apple coordinate. Experiment 2 : training two actuators on a single apple coordinate. Experiment 3 : training two joints for 100 apple locations. Output of the networks is given in number of nodes.

6.6 Evaluation methods

Because this research is exploratory, experiment 1 and 2 are trained on only one apple and thus one state. Evaluation on unseen test states is therefore not relevant. The network outputs and resulting rewards, are however, compared to the baseline gains of the network. As mentioned the HEBI controller combines three different types of PIDs, so their default values for just the position controller might be based on the dependency of the other PIDs. Nevertheless, the position gains were kept at their found value of 15 for the shoulder join, and 30 for the elbow joint. The integral and derivative gain were decided after an inspection equal to the ZN method, whereby the PID response was monitored and the the gains increased till they either stopped performing better, or performed worse. For the integral gain no improvement was found and these were left at 0, the derivative gains stopped improving after being set to 1 for both joints. To clarify comparison with any results, these baseline gains are shown in Table 2.

Kp Ki Kd Shoulder 15 0 1 Elbow 30 0 1

Table 2: Baseline gains for the position PIDs of the shoulder and elbow actuators

6.7 Final Software Drawbacks

After exhaustive training for experiment 2 and 3, the Actor Critic model started outputting Not a Number (NaN) values for all the gains. During training of experiment 2 this malfunction went unnoticed, and was assumed to be due to the exhaustive training or physics simulation constraints. However, during training of experiment 3 the learning sped up, and it became an apparent problem. The NaN output happened right after converging of the output, when some of the gains converged to their bounds. The sigmoid assures that the maximum bound is only approximated when all the input nodes of the output layer reach their limit. It is therefore believed that exploding gradients occurred, and the sigmoid is overburdened and outputs a NaN. To counteract this issue gradient clipping was applied, this means the gradients can not be updated past a predefined bound. After applying this technique, however, another complication appeared. The execution of the movements was aborted significantly more often when the gradient clipping was applied. There was no con-nection found between the chosen gains with or without clipping. This concon-nection of premature finishing of movements due to the HEBI controllers, and gradient clipping appears, a very much unrelated one. A tentative hypothesis was made, that the clipping consumed extra computation

(25)

power, which in turn interfered just enough to mess with the error prone HEBI controllers. Therefore, this drawback made longer training impossible, and experiment 3 was reached a maxi-mum or 9 instead of 100 epochs over the 100 apples.

7 Results

7.1 Single joint, single apple

The learning process of the networks, for the tuning of the two actuators separately, is displayed in figures 11(a) and 11(b). The red lines displayed in each figure show the average error over 100 runs using the baseline gains. For the shoulder joint this error was 0.17995 with a maximum deviation of 0.01375, and for the elbow joints this was 0.04189 with a maximum deviation of 0.00480. Given these deviations the averages were deemed representational enough.

(a) Shoulder Joint (b) Elbow joint

Figure 11: The reward received over 1000 steps, aborted movements excluded.

7.2 Two joints, single apple

Figure 12 displays the reward received over 3000 steps. As a comparison, the line in red displays the average error using the baseline gains. This average error was calculated during a trial of 100 steps, during which 2 crashes occurred. Not taking into account the crashes, the average reward of this run was rounded -0.2032, with a maximum deviation of 0.013175.

(26)

Figure 12: The reward of trajectory to and from apple. Crashed iterations not included In figure 14 and 16 the convergence of the gains throughout the training session is shown. For evaluation of convergence the rewards for the applied gains are shown in the figures above them, i.e. figure 13 and figure 15 respectively. For figures 13 and 15, the aborted runs, which were awarded a reward of -3, are not shown to enhance focus on the rest of the plot. There were 316 of these crashes throughout the training session. For figures 13 and 15, these aborted runs where not removed to show the transition of the gains.

(a) Reward over Kp gains (b) Reward over Ki gains (c) Reward over Kd gains

Figure 13: Experiment 2: rewards as a function of the gains for shoulder joint. Crashes left out.

(a) Kp gains over steps (b) Ki gains over steps (c) Kd gains over steps

Figure 14: Experiment 2: chosen gains over steps throughout training for shoulder joint. Crashed iterations included.

(27)

(a) Reward over Kp gains (b) Reward over Ki gains (c) Reward over Kd gains

Figure 15: Experiment 2: rewards as a function of the gains for elbow joint. Crashes left out.

Figure 16: Experiment 2: chosen gains over steps throughout training for elbow joint. Crashed iterations included.

The distribution of selected actions, i.e. PID gains, for each joint, and the resulted reward are illustrated in figure 17.

(28)

(a) Aborted movements are excluded Total Error J1 Kp 0.000162 J1 Ki 0.018437 J1 Kd -0.003177 J2 Kp 0.000235 J2 Ki 0.074919 J2 Kd -0.002065

(b) Aborted movements are in-cluded Total Error J1 Kp 0.001609 J1 Ki -0.120123 J1 Kd -0.002428 J2 Kp 0.000773 J2 Ki 0.130234 J2 Kd -0.003016

Table 3: Coefficients for using two actuators independently.

7.3 Two joints, multiple apples

As mentioned in section 6.7, the training session of the multiple apples experiment terminated prematurely. More precisely, at step 946, after which its actions became stuck at NaN. The run, however, still achieved improvements in the gathered rewards, as is shown in figure 18. The smaller blue dots show all the rewards at each learning step. Since different apple locations can result in different errors, direct comparison between two arbitrary runs is trivial. A clarified illustration of the learning progress is therefore more visible through the red dots in figure 18, which show the average reward over the finished epoch (i.e. previous 100 runs). Again the aborted runs were excluded for the individual, as well as the calculation for the average reward.

Figure 18: The reward of trajectory to and from apple. Crashed iterations not included Similar to experiment 2, the chosen gains during the training session are depicted in figure 19 and 20. Just as was briefly explained in section 6.7, the convergence of the gains appeared much earlier then in experiment 2.

(29)

Figure 19: Experiment 3: chosen gains over steps throughout training for shoulder joint. Crashed iterations included.

Figure 20: Experiment 3: chosen gains over steps throughout training for elbow joint. Crashed iterations included.

To further inspect the functioning of the algorithm over states the achieved reward for the different training apples is depicted in figure 21(b). Because the model was still learning from step 800 till 900, and it never finished the epoch from 900 till 1000, the data shown in figure 21(b) was gathered by testing the training apples on the network that was saved after step 900. To be precise, the gains were received, nevertheless, during this testing of the training apple locations, the update functions of the learning were disabled to prevent the model from further learning. As a baseline, figure 21(a) shows the received rewards over the different coordinates when the baseline gains were applied. The mean of the error is -0.22154, with a maximum deviation of -0.16191. This maximum deviation results from the run displayed at the most right and lowest coordinate, which achieved the highest reward of -0.05963. These apples are the closest to the basked position which is at x=1, y=0. From the 100 apple locations that were reached using the PID tuned model, none of them

(30)

(a) Baseline gains (b) Model found gains

Figure 21: The reward received after motions to the trained apple locations. Left: when the baseline gains were applied. Right: rewards achieved when the gains were found by the model based on the apple coordinate.

The difference between the reward of the baseline run of figure 21(b) and the rewards acquired by the, network tuned, robot arm is depicted in figure 22(a). The baseline reward is subtracted from the reward gathered from the trained run. To further concentrate where the network achieves the best improvement only the 66 apple location that improved with respect to the baseline are shown in figure 22(b).

(a) Reward improvement with respect to baseline (b) Only positive improvements

Figure 22: The improvement of reward with respect to the baseline, received after motions to the trained apple coordinates. Positive values show the model improved upon the baseline

7.3.1 Test Coordinates

To inspect how well the network is able to perform on unforeseen input states, the same network that was used for the evaluation of the trained apples, is applied on a set of test apples. Figure 23(a) shows the rewards achieved as a function of the test apple coordinates, similar to figure 21(a). In figure 23(b) , we see the reward when the network is applied and in c the difference between the two. Of all the 95 apples that were tested without an aborted movements, 75 achieved an

(31)

increased reward with regards to the baseline gains. OMSCHRIJF: miss sommige leert hij beter, andere niet? 75 of the 95

(a) Baseline gains (b) Model found gains

Figure 23: The reward received after motions to the test apple locations, when different gains were applied. Left: when the baseline gains were applied. Right: rewards achieved when the gains were found by the model based on the apple coordinate.

Again the improvement of the rewards when the Actor Critic network was applied on the gains was, with respect to the baseline gains applied run is illustrated in figure 24(a). Furthermore the 75 locations with improved rewards is showed adjecent, in figure 24(b).

(32)

8 Discussion

The first and most general aspect of the conducted experiments that should be discussed, is that of the limited training set-up. Due to the complex control architecture of the robotic arm, the problems concerning the discussed aborted movements and crashes became apperant too late in the project. This made rearranging the control structure infeasible. Consequently, the experiments and their results are all restricted in some way, e.g. reduced convergence and thus limited improved rewards. I conclude this issue with the notion that the resulting approach consisted of a manner of coping in various ways, like a fail-safe to account for crashes or a limited speed maximum.

Due to the considerable number of figures and results, an evaluation is in order to signify their importance. In section 7.1, the learning capabilities of the Actor Critic network were established, and its potential to surpass the performance of baseline gains was confirmed.

In section 7.2, by utilizing a lower learning rate, the learning capabilities of simultaneous PID tuning of two actuators, and thus six gains, was also successful. For a deeper evaluation of the network output the figures 13 and 15 should be considered. These figures show the relative importance of the network output for the reward. Specifically, it appears that, for both actuators, a high derivative gain and a low proportional gain are a limiting factor for the reward. For the integral gain its importance is less straightforward, consequently it is the last output of the network that converges. Nevertheless, there seems to be an optimum between 0.1 and 0.9, for the shoulder joint. The limited importance of the integral gain for the elbow joint is more apparent in figure 17(b), where the difference in reward along the Ki axis appears indistinguishable. Another key feature of the 3D figures, is that they substantiate the statement of the inter-dependencies of the PID gains. Notwithstanding the limited training of experiment 3, section 7.3 validates that learning over the training set is still attained (figure 18). Remarkable is the unforeseen increased learning speed when training is performed on multiple input states. Within 100 steps, the output converges for all gains in some degree, and for both proportional gains to at least a tenth. This phenomenon gives rise to the importance of the number of input states and applied learning rates.

A possibly far more essential aspect that arises from section 7.3, is that the effect of PID tuning for different movements gains more credence. From figures 21(a) and 21(b) it is apparent that the reward, and thus the PID performance, increases proportional to the x coordinate. Since the basket is placed closest to the right and lower corner, this means the reward decreases then the movement increases. However, in figure 22(a) and even more distinctly in figure 22(b) it is visible that these motions to points further from the basket improve the most. The last remark concerning the influence of the gains and their influence on the global setup, is how the test set improved the reward on more points than the training set (see figures 24(a) and 24(b)). Noticable is, however, that some apple coordinates for the test set received a reward up to -0.6, whist the maximum for the training set was half of that.

Pertaining the performances of the different applications of the Actor Critic networks, primarily the rewards over steps are of important function (figures 11(a), 11(b), 12 and 18). From all these figures, performance of experiment 1 appears the most unstable. Likely due to the lowered learning rate, experiment 2 showed increased stability and less susceptibility to leveling of reward due to bound convergence. The most impressive results, however, came from the network application of experiment 3. Moreover, since the ultimate goal is PID tuning over multiple apples, the most important aspect of future research concerning the Actor Critic network is improving upon this implementation. One of the first recommendations for improvement is increasing the bounds of the proportional and integral gain. These were kept limited due to lack of convergence when higher

(33)

bounds were applied. However, this was only tested for experiments 1 and 2, since the experiments were executed sequentially. The second parameter for further tuning is the learning rate. Even though a multitude of different learning rates were tested, these too were only applied on the first two experiments. Moreover, given the unforeseen speed increase of experiment 3, it is expected that the tuning hereof for multiple input states is a promising endeavor.

9 Conclusion

The goal of this thesis is to apply deep reinforcement learning to improve the movement of a robotic arm, by tuning PID gains. An important focus was therefore on the workings of the PID and its possibilities of improvement. Furthermore, to answer the research goals, three research questions were proposed: Firstly, is it possible to use Actor Critic learning to find appropriate gains for an actuator? Secondly, can the motion of a robotic arm be improved by simultaneously tuning the gains of two actuators on a single Actor Critic model? Finally, can the accuracy of the arm be further improved by training on multiple goal locations?

This constructive research has shown that Actor Critic learning is a good fit for single task appli-cation, and that combining multiple actuators and lowering the learning rate can further improve stability of learning. Furthermore, the assumption that accuracy of the trajectory is dependent on the type of movement is substantiated. Lastly, training the Actor Critic network on multiple locations has shown to improve the reward of the more difficult movements.

References

[1] John G Ziegler, Nathaniel B Nichols, et al. Optimum settings for automatic controllers. trans. ASME, 64(11), 1942.

[2] Reza Jafari and Rached Dhaouadi. Adaptive pid control of a nonlinear servomechanism using recurrent neural networks. Adv. Reinforcement Learning, pages 275–296, 2011.

[3] Xue-Song Wang, Yu-Hu Cheng, and Sun Wei. A proposal of adaptive pid controller based on reinforcement learning. Journal of China University of Mining and Technology, 17(1):40–44, 2007.

[4] Ahmed M Mosaad, Mahmoud A Attia, and Almoataz Y Abdelaziz. Whale optimization algo-rithm to tune pid and pida controllers on avr system. Ain Shams Engineering Journal, 10(4): 755–767, 2019.

(34)

[8] Panagiotis Kofinas and Anastasios I Dounis. Online tuning of a pid controller with a fuzzy reinforcement learning mas for flow rate control of a desalination unit. Electronics, 8(2):231, 2019.

[9] Antonio Visioli. Practical PID Control. Springer, 2006.

[10] Andrew G. Barto Richard S. Sutton. Reinforcement learning: An introduction, second edition, 2018.

[11] T. Yakhno. Advances in Information Systems: Second International Conference, ADVIS 2002, Izmir, Turkey, October 23-25, 2002. Proceedings. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2002. ISBN 9783540000099. URL https://books.google.nl/books?id= LkwYhgV-UgEC.

[12] Robert A Rescorla, Allan R Wagner, et al. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory, 2:64–99, 1972.

[13] Edward L Thorndike. Animal intelligence: an experimental study of the associative processes in animals. The Psychological Review: Monograph Supplements, 2(4):i, 1898.

[14] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.

[15] Franciszek Grabski. Semi-Markov processes: Applications in system reliability and mainte-nance. Elsevier, 2014.

[16] Anis Koubaa. Robot Operating System (ROS): The Complete Reference (Volume 1). Springer, 2016.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

Deep reinforcement learning for PID autotuning on a robotic arm

Deep reinforcement learning

for PID autotuning on a

Deep reinforcement learning for PID

autotuning on a robotic arm

Abstract

Contents

Acknowledgments

1

Introduction

1.1

Colaboration

2

Literature Review

3

Theoretical background

3.1

The PID controller

3.2

Reinforcement Learning

4

Set-up

4.1

Construction of the robotic arm

4.2

Control segmentation

4.3

Applications of PID control

4.4

Software

4.5

Robot construction

4.6

Path planning by OMPL

4.7

PID types

5

Method

5.1

Exploratory research

5.2

Actor Critic

6

Experiments

6.1

Preliminary Experiments

6.2

Robot Control Hyper parameters

6.3

Simulation constraints

6.4

Sampled locations

6.5

Further network specifications

6.6

Evaluation methods

6.7

Final Software Drawbacks

7

Results

7.1

Single joint, single apple

7.2

Two joints, single apple

7.3

Two joints, multiple apples

8

Discussion

9

Conclusion

References