Efficiency Optimisation in a Virtual Environment

(1)

Efficiency Optimisation

in a Virtual Environment

(2)

Layout: typeset by the author using LA_TEX.

(3)

Efficiency Optimisation

in a Virtual Environment

Dylan J. Prins 11247126 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. H. van Hoof

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 907 1098 XG Amsterdam

(4)

Abstract

Classic electric motor controllers are in need of optimisation done by experts. Research has shown the potential of modern machine learning techniques in the field of electrical engineering and many others like image processing and speech recognition. This research

provides an answer to whether electric motor control can be optimised in a virtual environment using these modern machine learning techniques and if it is beneficial or not compared to a classic PI-Controller. The results show improved efficiency and reliability of

electric motor control by implementing an agent using reinforcement learning. The agent is trained using a Deep Q Network. Electric motor control using this Deep Q Agent instead of a classic PI-Controller shows an improvement in stability and performance when

controlling the motor’s torque and current. Further research on electric motor control using various reinforcement learning algorithms is encouraged to further improve

(5)

1 Introduction

Exhaustion of the world resources is a process practised by humankind to supply our daily needs of transport, energy and electricity. Crude oil is processed to fuel cars, motorcycles, and planes. In addition to transportation the oil is burned to generate electricity which powers the planet. Other resources are used when oil is inapplicable. For instance to produce mobile phones, computers or medicines. The waste products of the aforementioned processes consist out of sub-stances which cause ozone depletion.

While it is not feasible to cease this manufacturing entirely, such processes can nevertheless be optimised. This research is focused on the optimisation of an electric motor in a virtual environment. While these motors are electric they are nevertheless often powered by planetary resources. The exponential growth of electric vehicles as shown in figure 1 indicates the impor-tance of their efficiency.

Figure 1: Worldwide number of battery electric vehicles in use from 2012 to 2018 (in 1.000s)[15]

Besides vehicles, electric motors are used in a significant amount of household items like hairdry-ers, vacuum cleanhairdry-ers, toothbrushes, blendhairdry-ers, etc.

The goal of this research is to asset an efficient yet powerful solution which optimises these kind of processes with use of an Artificial Neural Network (ANN) build by modern machine learning techniques. Current optimisation of efficiency is done by humans. Such optimisation is a tedious process and can only be carried out by experts. These experts are scarce and still do not guarantee the most efficient calibration. Artificial intelligence (AI) has shown that it could provide superior solutions compared to the solutions thought of and provided by humans. Like the AI AlphaGo Zero who became the world champion by not learning from human input but by becoming its own teacher[12]_.

An agent will be learned to efficiently control the motor’s behaviour using an ANN. Examples of current motor controllers are PI-Controllers and Model Predictive Control. Contrary to an

(7)

agent being trained using Reinforcement Learning (RL) a Model Predictive Controller requires real-time optimisation[13] _{which is a computationally costly}[3]_{process and in need of continuous}

correction by the aforementioned experts. In case of the trained agent, the optimisation is done automatically in the learning process. The same RL model can be applied in a multifaceted manner implicating one model applicable for a wide variety of electric motors without any fur-ther modification.

An answer will be given to whether electric motor control can be optimised in a virtual en-vironment using machine learning techniques and if it is beneficial or not compared to a classic PI-Controller.

2 Related Work

2.1 Electric Motor Control

Various methods of electric motor control have been examined and tested in previous work. Classic PI-Controllers and more advanced methods like Model Predictive Control are used pre-dominantly in the field of electrical engineering. Even though their dominance, electric motor control with the use of neural networks seems to show it’s potential and a transition to these modern methods is expected.

A previous attempt to improve classic methods of electric motor control[16] _{has been made}

by Mohamed S. Zaky. The author proposes a self-tuning PI-controller which controls the speed of the electrical motor drives by following a reference. The results of the self-tuning PI-controller were promising compared to the classic PI-controller. It did not only exceed performance of the classic PI-controller concerning reference tracking but additionally it is unaffected by load inertia variations and load torque disturbances. While still being mathematically simple and so insinuating easy implementation.

Even though the relative simplicity of the above-mentioned approach it still does not provide an optimal solution. The parameters of the self-tuning PI-controller have to be updated online while contrarily an agent trained by RL has minor online complexity[6]_{and is adaptive by means}

of its applicability to multiple electric motors without major adjustments.

The Department of Power Electronics and Electrical Drives at Paderborn University in Ger-many have made a proof of concept by providing an approach to a controller design for electrical drives using Deep RL[11]_{. The authors state that the control performance of conventional}

con-trol methods profoundly correlate with the developing engineers experience and prior education. A Deep RL controller is designed that controls the phase currents of a permanent magnet syn-chronous motor in a field-oriented framework. The controller shows perfect dynamic behaviour and the RL algorithm itself utilizes the whole control variable space. Performance was com-parable to classic controllers while being less computationally heavy and not in need of prior knowledge of the electric motor’s environment. The main goal of the research was to proof the effectiveness of Deep RL as a solution for motor control. The authors encourage further research like not only controlling the motor’s current but additionally controlling its torque and velocity to further corroborate the potential of Deep RL in the field of electrical engineering and other applications like complex production lines and microgrids.

(8)

2.2 Deep (Reinforcement) Learning

Machine Learning (ML) can be classified in a wide variety of learning methods like supervised, unsupervised and reinforcement learning.

Earlier work[10]_{of Vladimir Nasteski provided an overview of supervised machine learning}

meth-ods. These methods all show the same need of labeled ”true” data. As said by Nasteki ([10] page 2): ”The various algorithms generate a function that maps inputs to desired outputs”. An answer y for a question x is needed prior to the learning process. Where supervised learning will map the new x to the already known y. Earlier research describing the recent advancements and applications of unsupervised learning for numerous learning tasks[14] explain its benefits over traditional supervised learning. Unsupervised learning lacks the need of labeled data and manual handcrafted feature engineering. Which results in a flexible and general model applicable over a broad range of problems.

RL is based on the concept of taking actions in an environment which maximise the cumulative reward. It works by performing an action by interacting with the environment and determining its next action based on the reward of the previous action. Thus, RL algorithms learn from expe-rience in a trail-and-error manner. An introduction to RL[5] shows the similarities between RL and unsupervised learning in the sense that RL is not in need of knowledge of the environment a priori to the learning proces itself. They both share the difference compared to supervised learning that the answer y is not already known. RL learns from experience.

ML in general used to be very sensitive to the quality and size of its input. High-dimensional input like raw images were laborious to learn from and in need of policies defined prior to the learning itself.

Introducing Deep Neural Network’s (DNN’s) to tackle the aforementioned issue has shown its potential in many fields like computer vision and speech recognition. Image classification prob-lems like classifying potential cancerous tumors[4]_{or using a DNN for speech recognition showing}

far better results[9] _{than other solutions substantiate this claim.}

Likewise the field of RL has seen many enhancements throughout the years mainly due to the introduction of DNN. Researchers at DeepMind used a convolutional neural network combined with their variant of Q-learning to allow an agent to learn and play seven Atari 2600 games[7]_,

surpassing human expert level on three of these games.

The previous work explained the differences between the possible approaches to the problem concerned with this research. Furthermore stating the potential of Deep RL and the desired research to be done in the field of electric motors. The suggestions made by the previous work (M. Schenke et al.[11]_{) will be taken seriously and executed in this research.}

3 Method

The agent is trained to control the electric motor’s current (i), velocity (ω) and torque (T ). A Deep Q Network is implemented as the brain of the agent. The Gym Electric Motor Toolbox (GEM)[1] is used to simulate the electric motor. Furthermore a classic PI-controller is imple-mented to compare the performance of the control method provided by this research. All code is

(9)

written in Python 3.7.7 and PyTorch 1.4.0 is used as the Deep Q Network’s framework. The first section of this method will be used to provide more insight into the GEM, showing its capabilities and application in this research. The second section will explain how the Deep Q Network is implemented.

3.1 Gym Electric Motor Toolbox

The GEM has been used in order to simulate the electric motor. GEM is a toolbox built upon the OpenAI Gym Environments[2] and therefore highly applicable for RL. The motor’s states and parameters are accessible and can be altered through single function calls, hence being a flexible framework which can be used with a broad range of RL algorithms.

Figure 2 shows how the agent has control over the environment. The agent controls the converter between the simulated voltage supply and motor.

Figure 2: Physical Structure of the Environment’s Components[1]

GEM has various converters build into its framework. These converters may be controlled with either a discrete or continuous action space. The discrete action space has been used together with a single quadrant converter as shown in figure 3. The discrete actions exists out of the direct switching states of the transistors. Consequently, the agent has control over the supplied voltage and decides whether it will be fed to the motor or not.

Figure 3: Single Quadrant Converter (1QC)[1]

Figure 4 visualises the control flow of the agent and its electric motor environment. The supplied voltage is used to calculate the next state st+1using a mathematical representation of the electric

motor. st+1 is then used to determine the reward for the agent’s action. A limit observation

is build into the framework to teach the motor’s limits to the agent. This is mandatory when deploying the agent on a real-world electric motor. The agent will receive a substantial low reward when exceeding these limits and the training episode will end. The observation containing the motor’s state and reference are combined with the calculated reward and delivered to the Deep Q Network. Together with the observation the reward will be vectorised prior to feeding it into the network.

(10)

Figure 4: Control Flow of a Step Cycle of the Environment[1]

Figure 5: DC Series Motor[1]

Various electric motors are build into the GEM, all compatible with a continuous or discrete action space. The motors are represented by a system of differential equations (ODE system 1). The DC series motor as shown in figure 5 has been controlled by the agent which consists of circuits connected in series. Thus, the armature and excitation voltages are summed up (uin= uA+ uE) and the currents are the same (iin= iA= iE) throughout the circuit.

     di dt dω dt      =    1 LA+ LE (−L0 Eiω − (RA+ RE)i + u) 1 J(L 0 Ei 2_{− T} L(ω))    (1)

The states are specific to each kind of motor since it is the concatenation of the motor’s state combined with its torque and input voltages. The DC Series motor’s state (Eq. 2) is build out of the velocity, torque, current, input voltage and the voltage prior to conversion.

sdcSeries= [ω, T, i, uin, usup] (2)

Each state variable k is normalized by its limit to a range of [-1, +1] or [0,+1], depending on its plausibility. The reward for state variable k is defined by the error between the reference (the desired state s∗

k,t) and the actual state sk,twhich is the state as a result of the action made

by the agent. When a specific action at is executed a change in the environment is perceived.

This change consists of the transition to the next state. The reward is then based on one of the four following errors. The weighted sum of absolute error (Eq. 3), the weighted sum of squared error (Eq. 4), the shifted weighted sum of absolute error (Eq. 5) or the shifted weighted sum of squared error (Eq. 6).

(11)

The weighted sum of absolute error (WSAE) Rt= − N X k=0 wk|sk,t− s∗k,t| (3)

the weighted sum of squared error (WSSE)

Rt= − N X k=0 wk(sk,t− s∗k,t) 2 ₍₄₎

the shifted weighted sum of absolute error (SWSAE)

Rt= 1 − N

X

k=0

wk|sk,t− s∗k,t| (5)

the shifted weighted sum of squared error (SWSSE)

Rt= 1 − N X k=0 wk(sk,t− s∗k,t) 2 ₍₆₎

All error formulae were experimented with, but the SWSAE has been used as the final choice to maintain readability since it always returns a positive value.

A reference function is present in order to determine s∗_k,t. The Wiener Process Reference Gen-erator has been used in order to determine the desired state. The aforementioned reference is characterised by a bit of randomness and noise which makes the control more challenging and comparable to a real-world setting. The Wiener Process Reference Generator draws pseudo-random samples from a normal Gaussian distribution (Eq. 7) with mean µ = 0 and a standard deviation σ as a uniform distributed value between (3e−3, 3e−2). A sample is drawn at each time step t, determining the difference between the reference value at time t and the next reference value at time t + 1.

G(x) =√ 1 2πσ2exp

−(x − µ)2

2σ2 ₍₇₎

The Wiener Process Reference Generator does not violate the electric motor’s limits. This is due to the generator truncating the drawn value to a safe value if the motor’s limit is exceeded.

In order to compare the DQN Agent to a classic method of motor control a classic PI-Controller has been implemented. The PI-Controller framework is available within GEM. The cumulative reward per episode has been collected over 200 episodes. Results for a stock and fine-tuned environment were gathered to visualise the divergence between a optimised and non-optimised PI-Controller. This additionally assures a realistic comparison between the DQN Agent and the classic method of motor control. The fine-tuned environment is based on the load parameters as stated in the examples included in the GEM, these differ from the default load settings of the environment. The DQN agent is taught using the default environment.

(12)

3.2 Deep Q Network

A Deep Q Neural Network is built with the elementary principles of reinforcement learning. In contrast to unsupervised learning where its goal is to find certain similarities and divergence between data points, reinforcement learning has the goal to find an optimised action model that maximizes the total cumulative reward of the specific agent.

A couple of complications arise when transferring classic Q-learning to Deep Q learning. The nonlinear deep neural network used to approximate the Q-values is sensitive to the high correla-tion between the samples (st, at, Rt, st+1) at time step t and those of t + 1. The decision made

by the agent at time t has a significant effect on the next state at time t + 1.

Experience replay is implemented as shown in listing 1 to deal with this correlation. This eliminates forgetting previous experience and reduces the correlations between the experiences themselves. A buffer of 15000 state, action, reward and next state tuples were maintained during training. This buffer represents a memory containing past experience. Random batches of size 32 for training using the CPU and batches of size 128 using the GPU were sampled during each updating phase of the network. This process breaks the sequence dependence and so eliminating the correlations between the samples at time step t and t + 1.

Listing 1: Pseudo Code for Experience Replay Class experienceReplay:

Initialize size of memory, amount of burn in, and create dynamic replay memory with ,→ given memory size

# Returns random batch of batch size entries of replay memory def sample batch(self, batch size):

Take batch size random indices

return corresponding entries of replay memory

# Function to add state with its values to memory

def addToMemory(self, state, action, reward, done, next state):

Append state, action, reward, done value and next state to replay memory

The replay memory warms up at the start of the learning phase, saving 5000 pseudo-random samples. The replay memory is made dynamic, meaning that the oldest entry will be deleted when the replay memory is full while adding new experience. This is desired because older ex-perience can be seen as less relevant than newer exex-perience.

The second complication occurs when updating the network itself. The classic Q-learning equa-tion (Eq. 8) contains an update for the Q-value of state st and action at. In Deep Q

Net-works the update of the Q-value also causes a change in the action for the target value at st+1

(maxa0Q(s_t+1, a0)).

Q(st, at) Q(st, at) + α(Rt+ γmaxa0Q(s_t+1, a0) − Q(s_t, a_t)) (8)

This prevents the network from learning since the target is continuously changing with each update. In order to solve this problem, a copy of the neural network is used as a target network to stabilise maxa0Q(s_t+1, a0). Allowing time for the network to properly backpropagate by only

(13)

The Q-function is translated to a loss function (Eq. 9) to allow for backpropagation.

L(θ) = (Rt+ γmaxa0(Q(s_t+1, a0; θ_t)) − Q(s_t, a_t; θ))2 (9)

θ and θtrepresent the network’s and target network’s parameters. The backpropagation itself is

performed by PyTorch’s backward() function.

The network is implemented in a manner that it is able to interact with an OpenAI Gym environ-ment by reason of the GEM being build on this environenviron-ment as well. The network is represented as a class as shown in listing 2 to provide all needed functionality in the same object. This class contains all functions to obtain an action and the Q-values of a state. The actions are drawn in an -greedy manner. Meaning that % of the actions are drawn randomly instead of picking the one with the highest expected reward. This eliminates the problem of premature convergence by allowing for exploration, decreasing the probability of sub-optimal behaviour caused by not considering other potential superior actions. Epsilon decreases each 100000 steps. This will halt exploration which is necessary to assure that no wrong actions will be chosen when the agent has gained enough knowledge of which action to take when.

The network’s input and output sizes are fixed, with the input size being the amount of variables in each state and the output size being the amount of possible actions.

Listing 2: Pseudo Code for Network Class Network():

Initialize device (CPU or GPU) number of inputs (state size), number of outputs (size of ,→ action space), and learning rate

Initialize network using PyTorchs sequential function

if device is ’GPU’:

set network to cuda using PyTorchs cuda() function

Initialize Adam Optimizer using PyTorchs optim.Adam() function

# Function that returns action given the state def get action(self, state, epsilon):

if random number between 0 and epsilon: return random action

else:

# Greedy

return action with max Q−value.

The Deep Q-Learning algorithm is implemented as presented by DeepMind[8]. Each episode will be 50000 steps long. For each step in each episode an (-)greedy action atwill be taken and the

reward Rtand next state st+1 will be observed. This transition is stored in the replay memory.

Every four steps the network will be updated by calculating the loss using a random batch out of the replay memory and backpropagating over the network using that loss. The network’s parameters are then updated using gradient descent.

(14)

Listing 3: Deep Q-Learning Algorithm with Experience Replay Initialize replay memory, network, target network, and parameters

for each episode: Reset environment for each step in episode:

Network.get action(state, epsilon)

Take action and observe reward and next state

experienceReplay.addToMemory(state, action, reward, done, next state)

if total steps % 4 equals 0:

experienceReplay.samplebatch(self, batch size) Calculate loss

Backpropagate over network Update networks parameters

Every n (2500 in this case) steps, set target network equal to original network

The shape of the network has been chosen by experimentation with different sizes. Multiple short learning phases of 10 episodes with random network sizes were executed to determine three appropriate candidates. These candidates were examined more closely, the one obtaining the highest rewards in 50 episodes will be used as the final network size. The curve that grows the most expeditiously was also taken into consideration. The following final candidates were examined more closely:

1. Ninputs× 8 × 8 × 4 × Noutputs

2. Ninputs× 16 × 32 × 16 × 4 × Noutputs

3. Ninputs× 16 × 16 × 4 × Noutputs

The value has been selected by comparing the curves of the cumulative rewards over 50 episodes. The curve growing the most stable and obtaining the highest cumulative reward has been chosen. The following ’s were examined:

1. 10% (0.10)

2. 20% (0.20)

3. 30% (0.30)

The learning rate α has great influence on the final result. α has been selected in the same manner as , by comparing the curves and the rewards after 50 episodes. The following learning rates were examined:

1. 0.01

2. 0.001

3. 0.0001

4. 0.00001

(15)

Training computation can be executed on a CPU or a GPU. The preferred device has been determined by comparing the execution time of 50 episodes in combination with the maximum cumulative reward obtained after these episodes for both the devices. An Nvidia RTX 2070 eGPU and an Intel Core i7-7920HQ @ 3.1GHz CPU have been used.

4 Results

4.1 Network Sizes

Figure 6: Cumulative Rewards per episode for various Network Sizes

The various network sizes show (Fig. 6) exceptional distinctive behaviour. The bigger network shape of Ninputs × 16 × 32 × 16 × 4 × Noutputs seems inapplicable to this research by

show-ing consistent growth in the first 3 episodes but decreasshow-ing and prematurely convergshow-ing within episodes 3-10. Although the smaller network shape of Ninputs× 8 × 8 × 4 × Noutputs shows

more desired behaviour by not converging at a poor cumulative reward, the network shape of Ninputs× 16 × 16 × 4 × Noutputs seems to show substantially higher stability by growing

ex-peditiously to a adequate cumulative reward followed by gradually converging to its maximum observed cumulative reward. A network shape of Ninputs× 16 × 16 × 4 × Noutputshas been chosen

(16)

4.2 Learning Rates

Figure 7: Cumulative Rewards per episode for various Learning Rates

The various learning rates show (Fig. 7) diverse growth of their cumulative rewards per episode. Even though 1 ∗ 10−3 seems the most rewarding at first glance it actually is not the preferred choice for a extended learning phase. The learning rate of 1 ∗ 10−3 has the highest reward at the start of its learning phase. Learning rates of 5 ∗ 10−3and 1 ∗ 10−2 show the same behaviour but converge at a lower cumulative reward compared to 1 ∗ 10−3. Learning rates of 1 ∗ 10−4 and 1 ∗ 10−5 show contrasting behaviour, even decreasing within the first 10 episodes. This phenomenon occurred in various experiments and is caused by the warm up of the replay mem-ory. If the pseudo-random actions are suboptimal during this startup phase the agent learns the wrong behaviour and has to unlearn this experience. All the experiments in this research have been repeated multiple times and despite this behaviour did converge at the same maxima. The aforementioned smaller learning rates show the most stable growing over a extended period of learning. They keep on increasing by the time that the others where already converged. Figure 8 shows episodes 80-200 of learning rate 1 ∗ 10−3 and 1 ∗ 10−4. The learning rate of 1 ∗ 10−4 has shown the most stable behaviour and the highest rewards when learning for longer periods of time by not converging prematurely. The learning rate of 1 ∗ 10−4 _{has been used for the}

(17)

Figure 8: Cumulative Rewards per episode (80-200) for Learning Rate 1 ∗ 10−3 & 1 ∗ 10−4

4.3 GPU vs CPU

Figure 9: Duration of execution and obtained cumulative rewards for GPU and CPU

Training using a GPU drastically decreases training speed in comparison to using the CPU (Fig. 9) for this specific problem. The GPU exceeds 2 hours of training when reaching a cumulative reward of 47000. Contrarily, the CPU takes about 1 hour to obtain the same reward. Each episode lasts approximately 30 seconds if the training execution is performed by the CPU while in contrast the GPU takes roughly 2.5 minutes per episode. By virtue of these results, all further training has been performed using the CPU.

(18)

4.4 Epsilons (

0

s)

The results for training with various values of show (Fig. 10) a clear perfect match for the exploration-exploitation dilemma. Excessively low probability of exploration or excessively low probability of exploitation both show unstable behaviour an converge too soon. Using = 10% in the -greedy algorithm shows inconsistent behaviour for the first 13 episodes. The cumulative rewards decrease rapidly in the first phase of training and thereupon showing unstable growth. Likewise = 30% shows the same instability but in contrast has a higher cumulative reward at the start of its episode. = 20% resembles the perfect balance between exploration and exploitation. The cumulative rewards do not converge too soon and the graph is characterised by stable growth. = 20% has been chosen for the remainder of this research.

(19)

4.5 AI vs PI

Combining all prior research to setup the whole DQN, the agent and the algorithm by com-posing the environment using the maximum rewarding network size, learning rate, epsilon and training hardware results in motor control as shown in (Fig. 11). The DQN agent control of

Figure 11: Electric Motor Control using a DQN Agent

the motor’s torque has a maximum cumulative reward of 42000 and the control of the motor’s current has a maximum cumulative reward of 48200. The electric motor’s torque controlled by the PI-Controller (Fig. 12) shows acceptable but unstable results. The minimum cumulative reward is approximately 35000 and the maximum is roughly 43000. Fine-tuning the PI envi-ronment shows a slight increase in performance (Fig. 12b) but stability has not been achieved. The PI-Controller’s cumulative reward maxima for the motor’s torque are not exceeded by the DQN agent. Nonetheless notice the difference in stability, the DQN agent’s control converges at approximately 100 episodes. The agent then fluctuates with a negligible error away and back to its maximum. The average over all episodes of the PI-controlled torque is approximately 40000 (Fig 12a & 12b). The DQN agent maintains a higher reward after a period of training compared to this average, without fluctuating in a unreliable manner.

(20)

(a) Torque (T ) PI Controlled (b) Torque (T ) PI Controlled Fine-tuned

Figure 12: Cumulative Rewards per episode Controlling Torque (T ) using PI-Controller

(a) Omega (ω) PI Controlled (b) Omega (ω) PI Controlled Fine-tuned

(21)

An improvement is present when comparing the motor PI control of its velocity to the afore-mentioned PI control of its torque. The fine-tuned environment shows a maximum cumulative reward of 48000 (Fig. 13b). The stability issue is still present even though this particular result shows almost perfect motor control with an average of 96%(48000/50000 = 0.96) of the steps being accurate. In contrast to these results, the control of the motor’s velocity by the DQN agent is insufficient. The average over the cumulative rewards of all episodes in the fine-tuned PI-controlled environment is approximately 46500 while the DQN agent converges at roughly 39800. The PI-controller even shows maxima of 48000.

(a) Current (i) PI Controlled (b) Current (i) PI Controlled Fine-tuned

Figure 14: Cumulative Rewards per episode Controlling Current (i) using PI-Controller

The motor’s current has been PI controlled the most reliable when comparing it to the PI control of the motor’s torque and velocity. The error between the maximum and minimum 43550 − 41625 = 1925 represents the lowest error achieved in all PI controlled experiments for this research. Nevertheless notice how the DQN agent’s control of the motor’s current exception-ally exceeds the maximum cumulative reward of the PI controlled current (Fig. 14a). Even the fine-tuned environment (Fig. 14b) has way lower maxima than those obtained by the DQN Agent.

Visualisations have been made to transfer to more tangible results. Figure 15 represents the control of the motor’s torque by the DQN agent. A prominent change is present. The agent (blue line) is clearly not able to follow the desired reference (green line) prior to the learning phase. While subsequently almost perfect motor control is achieved. Conforming to the results shown in cumulative rewards, the control of the motor’s current has even more accurate results. As expected, controlling the motor’s velocity does not show satisfactory results. Still a slight improvement compared to the control prior to the learning phase is noticeable. The slope of the agent’s control matches with the reference after training.

(22)

(a) Torque (T ) control prior to training phase

(b) Torque (T ) control after learning phase

Figure 15: Torque (T ) control before (15a) and after (15b) training, the green line represents the reference and the blue line represents the DQN agent’s control

(a) Current (i) control prior to training phase

(b) Current (i) control after learning phase

Figure 16: Current (i) control before (16a) and after (16b) training, the green line represents the reference and the blue line represents the DQN agent’s control

(a) Omega (ω) control prior to training phase

(b) Omega (ω) control after learning phase

Figure 17: Omega (ω) control before (17a) and after (17b) training, the green line represents the reference and the blue line represents the DQN agent’s control

(23)

All PI motor control has sufficient but unstable performance. Fine-tuning the environment re-sults in a slight up shift of the graphs presenting motor control with higher efficiency compared to the prior environment.

The DQN agent controls the motor’s current and torque more efficiently and reliable than the PI-Controller. Exporting the network’s ”knowledge” allows for instant improved performance and efficiency when deploying it on a real-world motor compared to the PI-Controller.

5 Conclusion

The purpose of this research was to determine whether electric motor control could be opti-mised using modern machine learning techniques. And if it is beneficial compared to a classic PI-Controller. The results have shown significant improvements for the control of the motor’s current and torque compared to the classic PI-Controller. The DQN agent improves efficiency and reliability by following the reference more closely and by providing control with higher sta-bility, maintaining a small range of error at each step after the training phase. The demand of adjusting the controller manually is eliminated, meaning that the need of experts is highly re-duced. This is due to the agent optimising its control himself. After training, the knowledge can be exported allowing for instant improved motor control compared to the classic PI-Controller when exploiting the agent in a real-world environment.

The DQN agent was not able to properly control the motor’s velocity, raising the question whether DQN is the correct machine learning algorithm to control all the motor’s attributes. The Q-function has difficulty with mapping the long term consequences of an action. Which is the result of the delay between the action and the speed up of the motor’s velocity. When speeding up an electric motor (or even a conventional combustion engine) there is a significant delay between the time of increasing its supply of power and reaching the desired speed. There is no delay when controlling the motor’s current since the agent controls the converter between the voltage supply and the motor itself. When increasing the voltage, an immediate increase of current is perceived. Which directly correspondences with the motor’s torque, allowing for the same immediate change. There are various reinforcement learning algorithms which should be able to cope with this delay. It is desired to research electric motor control using various rein-forcement learning algorithms. This would serve an answer to whether other algorithms provide a better fit to the problem. The reference itself could also be an inappropriate fit to the motor’s velocity because of its randomness. Sharp peaks are not correctly matching the velocity’s move-ment throughout time. The movemove-ment of the velocity of an electric motor is characterised by smoother slopes than those generated by a Wiener Process reference.

Theoretically the agent should be applicable to various kinds of electric motors without fur-ther adjustment, this has not been examined in this research. Because of the promising results it is encouraged to perform further research on different kinds of electric motors. Furthermore the results in this research did not show the repeated experiments. It is suggested to compare the different learning trajectories of the same experiments to map the stability of the implemented DQN. This may further substantiate the results of this research.

(24)

References

[1] Gym Electric Motor (GEM): An OpenAI gym environment for electric motors. https: //github.com/upb-lea/gym-electric-motor, 2020.

[2] OpenAI Gym. https://gym.openai.com/, 2020.

[3] Aldo Balestrino, Andrea Caiti, Vincenzo Calabro, Emanuele Crisostomi, and Alberto Landi. From basic to advanced pi controllers: A complexity vs. performance comparison. In Valery D. Yurkevich, editor, Advances in PID Control, chapter 5. IntechOpen, Rijeka, 2011.

[4] S. Das, O. F. M. R. R. Aranya, and N. N. Labiba. Brain tumor classification using con-volutional neural network. In 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), pages 1–5, 2019.

[5] Vincent Fran¸cois-Lavet, Peter Henderson, Riashat Islam, Marc G. Bellemare, and Joelle Pineau. An introduction to deep reinforcement learning. Foundations and Trends inR

Machine Learning, 11(3-4):219–354, 2018.

[6] Daniel G¨orges. Relations between model predictive control and reinforcement learning. IFAC-PapersOnLine, 50(1):4920 – 4928, 2017. 20th IFAC World Congress.

[7] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. Deep-Mind, 2013.

[8] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Belle-mare, Alex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518:529–33, 02 2015.

[9] A. B. Nassif, I. Shahin, I. Attili, M. Azzeh, and K. Shaalan. Speech recognition using deep neural networks: A systematic review. IEEE Access, 7:19143–19165, 2019.

[10] Vladimir Nasteski. An overview of the supervised machine learning methods. HORIZONS.B, 4:51–62, 12 2017.

[11] M. Schenke, W. Kirchg¨assner, and O. Wallscheid. Controller design for electrical drives by deep reinforcement learning: A proof of concept. IEEE Transactions on Industrial Informatics, 16(7):4650–4658, 2020.

[12] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, and et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.

[13] Arne Traue, Gerrit Book, Wilhelm Kirchg¨assner, and Oliver Wallscheid. Towards a Re-inforcement Learning Environment Toolbox for Intelligent Electric Motor Control. arXiv, eess, 2019.

[14] Muhammad Usama, Junaid Qadir, Aunn Raza, Hunain Arif, Kok-lim Alvin Yau, Yehia Elkhatib, Amir Hussain, and Ala Al-Fuqaha. Unsupervised machine learning for networking: Techniques, applications and research challenges. IEEE Access, 7, 2019.

(25)

[15] I. Wagner. Worldwide number of electric cars 2018. https://www.statista.com/ statistics/270603/worldwide-number-of-hybrid-and-electric-vehicles-since-2009/.

[16] Mohamed Zaky. A self-tuning PI controller for the speed control of electrical motor drives. Electric Power Systems Research, 119, 02 2015.

Efficiency Optimisation in a Virtual Environment

Efficiency Optimisation

in a Virtual Environment

Efficiency Optimisation

in a Virtual Environment

Contents

1

Introduction

2

Related Work

2.1

Electric Motor Control

2.2

Deep (Reinforcement) Learning

3

Method

3.1

Gym Electric Motor Toolbox

3.2

Deep Q Network

4

Results

4.1

Network Sizes

4.2

Learning Rates

4.3

GPU vs CPU

4.4

Epsilons (

s)

4.5

AI vs PI

5

Conclusion

References

Epsilons (