The Deep Learning Hype

(1)

The Deep Learning Hype

Joel Cornelje

10820361

Bachelor thesis

Credits: 12 EC

Bachelor Information Science

University of Amsterdam

July 2018

Supervisor

Gerard Alberts

Second examiner

Arjan Vreeken 

(2)

Acknowledgements

I would like to express my sincere gratitude to my supervisor Gerard Alberts, without whose constructive feedback and suggestions I would have been lost. Furthermore, I would like to thank my mother for her enduring support. Also, I would like to thank Frank van Harmelen for his time and the valuable information

provided. Lastly, I would like to thank Patrick Schelb for introducing me to deep learning and suggesting it as a subject for my thesis.  

(3)

Abstract 

This paper is concerned with the current optimism in ‘Artificial Intelligence’ (AI) and more specifically in its subfield deep learning. In the past, optimism has been seen in AI which resulted in disappointment because high expectations could not be met. Many breakthroughs have been seen in the technology of deep learning recently. Examples are the improvement in image and speech recognition and the victory of AlphaGo over the best human Go player. The basis of deep learning technology has been around for quite some time. Algorithms were made inspired by the brain. Though the resemblance between the algorithms and brain was very loose, metaphors of the brain as ‘neural network’ and ‘neuron’ were used to describe the techniques. The techniques only caused recent breakthroughs because of a combination of three events: new hardware, more data and better algorithms. Periods of optimism have emerged in AI which caused large investments and great attention to the field. Overoptimistic goals were being set which caused periods of disappointment where fundings were retracted and interest was lost in the field. The technical side of deep learning is covered which includes the explanation of deep neural networks and convolutional neural networks. Limitations of deep learning are discussed. Examples are the low interpretability of the techniques, the great amounts of data needed, the fact that current deep learning techniques are artificial narrow intelligence, that neural networks can be deceived and neural networks not being able to recognise appropriate similarities. A possible cause of the overoptimism in AI, leading to periods of disappointment, is that the field is named ‘Artificial Intelligence’. Researcher try to set unrealistic goals, aiming towards or something close to true artificial intelligence leading to inevitable failure. It is difficult to predict whether another period of disappointment will commence in the near future caused by current optimism in AI. Looking at the history of AI it does seem wise to be more realistic optimistic than overoptimistic so that a potential period of disappointment would be less disappointing.  

(4)

Chapter 1 Introduction

5 1.1 Algorithms . . . 5

1.2 Deep Learning Algorithms . . . 6

1.3 Optimism in AI . . . 9

Chapter 2 History of Deep Learning

11 2.1 Artificial Intelligence . . . 11

2.2 Machine Learning . . . 11

2.3 Deep Learning . . . 12

2.4 Deep Learning Revolution . . . 15

2.5 AI Winters . . . 16

2.6 First-Step Fallacies . . . 18

Chapter 3 Deep Learning Techniques

22 3.1 Deep Neural Network . . . 22

3.1.1 Identifying number 9 on an image . . . 23

3.1.2 ‘Learning’ to identify a 9 . . . 28

3.2 Convolutional Neural Networks . . . 30

3.2.1 Coping with larger images . . . 30

3.2.2 How convolutional neural networks ‘learn’ . . . 33

3.3 Alternative Learning and Neural Networks . . . 34

Chapter 4 Limitations of Deep Learning

35 4.1 Interpretability of Algorithms . . . 35

4.1.1 Education . . . 35

4.1.2 Journalism . . . 36

4.2 Other Limitations of Deep Learning . . . 37

Chapter 5

39 5.1 Findings . . . 39

5.2 Discussion . . . 40

Reference list

43 Appendices

46 A Summary . . . 46

B Interview (Dutch) . . . 65

(5)

Chapter 1 Introduction

Chapter 1.1 Algorithms

Algorithms are widely used for a variety of online applications. From determining the news people see on Facebook or movies and series recommended on Netflix to the moves a computer makes in a game of chess. All are installed with algorithms to make the decisions. For the general public it can be unknown what an algorithm exactly is and does. An algorithm is nothing more than a set of instructions to solve a problem. Algorithms are made up of code, which is the medium used to tell the computer what to do. The algorithm bases the decisions it makes to solve a problem on data. David Korteweg, researcher at Bits of Freedom gave a simple example: an algorithm could be made to give a glass of water to the oldest person in a group of people. The algorithm will look at the data which in this case involves all the ages of the people in the group. It will find which person is associated with the highest age and decide that person has to receive the glass of water. 1

Algorithms can be very large and complex, basing decisions that are made on large amounts of data. Their merit is in that they can analyse and make multiple decisions on a large amount of data in a short period of time. A great variety of tasks are made possible, such as determining the images Google shows, which orders Deliveroo drivers receive or even acting as a judge in court.

Katleen Gabriels, a university lecturer of philosophy and ethics at the Technical University of Eindhoven describes how some problems could arise when using algorithms. There was some controversy in 2016 concerning Google Images. When searching for ‘three white teenagers’ images would be shown of white teenagers being happy and having a good time. When searching for ‘three black teenagers’ mug shots of black people were shown. It demonstrates that algorithms do not always give neutral results and prejudices can slip into the data. 2

Algorithms can also act as a manager. Journalist Rens Lieman started working for two food delivery services UberEats and Deliveroo in order to write a book on how the companies, and the algorithms they use, work. Lieman also wanted to find out what it is like to be an employee for the companies. Both jobs basically have an algorithm which tells the drivers what to do and when to do it. There is minimal contact with a human supervisor because the algorithm already acts as one. It introduces a shift into the dynamics between employer and employee. Sometimes the choices made by algorithms could be seen as unreasonable by drivers. For example, when picking up customers with the taxi-like service of Uber, pick-up points would be far away or drivers would be encouraged to visit a certain area where rides would carry a financial bonus. When arriving there the bonuses would be gone. It would lead to annoyance and frustration because of the wasted time and effort. A study showed that when Uber drivers are taught how the algorithm works and why it makes certain decisions, the drivers are more amenable to work with it. Even though at first sight it could seem an unreasonable decision. 3

Korteweg, David and Bart Krull. “Verslaafd aan het Algoritme” Reading, Pakhuis

1

de Zwijger, Amsterdam, April 25, 2018.

Gabriels, Katleen and Bart Krull. “Verslaafd aan het Algoritme” Reading, Pakhuis

2

de Zwijger, Amsterdam, April 25, 2018.

Lieman, Rens and Bart Krull. “Verslaafd aan het Algoritme” Reading, Pakhuis de

3

(6)

Some companies use algorithms to make legal decisions. e-Court is such a company which functions as a private judge using an algorithm. Journalist Tim Staal researched how it works. e-Court is enabled when there are payment problems between customers and health insurance companies. Unpaid bills from customers are taken over by e-Court and the customer information is then put in the algorithm to output a judgement. The judgement would then be signed by a human judge with a digital signature. Staal and his team found that the e-Court’s algorithm skips steps human judges would normally take into consideration. Steps such as checking whether the policy of the health insurance was not in conflict with the European rules or the financial situation of the defendant. Since the discovery of the discrepancies, legal charges have been brought against e-Court. 4

There is however positive potential in the use of algorithms within the legal system. When experiencing a legal problem, an algorithm could be used as a tool to decide whether it is useful to consult a lawyer. Hiring a lawyer can be an expensive process, and therefore predictions can be made if it is beneficial to hire one and make an appeal or not.

Algorithms can also be used to select applicants for a job. Co-founder of Seedlink Tech Rina-Joosten Rabou described how Seedlink helps companies to decide who would be the best applicant for a specific job. Seedlink uses an algorithm which looks at subconscious behaviour. Language is primarily analysed to observe the behaviour. The algorithm analyses speech from the best performing employees to find what they have in common. The speech of the applicants is then analysed to find similarities in the best performing employees. Based on the similarities the algorithm will recommend who to hire. Important to note is that the algorithm does not necessarily check which words are chosen but the way something is said. For example, when the application was in English, there was no difference found in the acceptance rate between native and non-native speakers. Furthermore, the diversity in some companies of accepted employees that use Seedlink has increased by 40%. 5

The examples show that the use of algorithms can have positive as negative effects on people. It depends on how they are used and to what purpose, resulting in different consequences. Algorithms can make it possible to select someone for a dream job which in other cases would not have been probable, to ending up with high debts because certain rules had been violated in court. Even though algorithms only process data to obtain the best solution, it does not mean the solution is a neutral one. Data is a way of representing reality and certain choices are made when choosing how to generate the data. Therefore, a bias can already be formed as a result of the choices already made, ensuing in a biased algorithm.

Chapter 1.2 Deep Learning Algorithms

Many breakthroughs have been achieved recently in the use of algorithms. Image and speech recognition have improved significantly. Facebook is able to suggest

Staal, Tim and Bart Krull. “Verslaafd aan het Algoritme” Reading, Pakhuis de

4

Zwijger, Amsterdam, April 25, 2018.

Rabou, Rina-Joosten and Bart Krull. “Verslaafd aan het Algoritme” Reading,

5

(7)

which person is on a photo when uploading it to the network. Voice response 6

features in smart phones or on websites have become noticeably better. In 2001 an accuracy of about 80% for US English was achieved. In 2017 Google increased 7

the accuracy to 95.1%. Language processing has seen great improvements. In 8

2016 Google Translate supported 103 languages and translated 140 billion words a day. In the same year Google’s AlphaGo beat world champion Lee Sedol in the 9

board game Go. 10

It has become possible to superimpose a person’s head on a video of someone else’s body, making it look as if they are doing something they have never done before. The algorithm can create things from celebrity fake porn and fake news to making yourself the lead character in a movie. The phenomenon goes by the name of ‘deep fakes’. In January 2018 the application ‘FakeApp’ was launched making it possible for people to create their own videos with little knowledge of programming. Though the outcomes are not always entirely realistic, it is significantly cheaper and easier than similar techniques. 11

A similar application is VoCo. It is a program which is able to make someone say something which they have not said, by only using an audio recording of that person. Thus, with deep fakes it can look like someone has done 12

something they have never done, using VoCo it can sound like someone has said something they have never said.

The techniques used here are part of the field named ‘Artificial Intelligence’ (AI). The goal of AI is to make machines behave ‘intelligently’. It can include behaviour such as image and speech recognition, playing chess or problem solving activities. The technique which makes it suddenly possible for algorithms to reach all the recent breakthroughs is named ‘deep learning’. Deep learning is a subfield of AI. It uses techniques which are inspired by how neurons in the brain function to each other, though the resemblance is very loose.

Deep learning techniques consist of algorithms which can ‘learn’ from the data it has been given without any human interference. For example, in image recognition the deep learning algorithm will be shown thousands of images of a monkey. The algorithm will find its own way to identify monkeys on other images without a human instructing how to do it. It is very useful as programmers do not need to spend time on how the algorithm is able to recognise features of monkeys.

Simonite, Tom. “Facebook Can Now Find Your Face, Even When It's Not

6

Tagged”. Wired. 2017. Retrieved from

https://www.wired.com/story/facebook-will-find-your-face-even-when-its-not-tagged/

Pinola, M. Speech Recognition Through the Decades: How We Ended Up With

7

Siri. PCWorld. 2011. Retrieved from https://www.pcworld.com/article/243060/ speech_recognition_through_the_decades_how_we_ended_up_with_siri.html? page=2

Protalinski, E. Google’s speech recognition technology now has a 4.9% word

8

error rate. VentureBeat. 2017. Retrieved from https://venturebeat.com/2017/05/17/ googles-speech-recognition-technology-now-has-a-4-9-word-error-rate/

Wong, Sam. Google Translate AI invents its own language to translate with. New

9

Scientist. 2016. Retrieved from

https://www.newscientist.com/article/2114748-google-translate-ai-invents-its-own-language-to-translate-with/

Borowiec, S. AlphaGo seals 4-1 victory over Go grandmaster Lee Sedol . The

10

Guardian. 2016 Retrieved from https://www.theguardian.com/technology/2016/

mar/15/googles-alphago- seals-4-1-victory-over-grandmaster-lee-sedol Jongsma, Pieter and Ewa Scheifes. “Sign of Time #19, Real Fake” Reading,

11

Pakhuis de Zwijger, Amsterdam, April 24, 2018. Jongsma, Pieter and Ewa Scheifes. 2018.

(8)

Because all the images of the monkeys can just be fed to the algorithm, more use can be made of computing power.

The graph of Google Trends in figure 1.1. demonstrates how ‘deep learning’ has become a popular search term. The graph shows the relevance of the term based on Google searches. It indicates that starting in 2012, Google searches were increasing for ‘deep learning’ and a peak being reached recently.

!

Figure 1.1 Graph of popularity worldwide of the term ‘deep learning’ based on Google Trends from Januari 1 2010 to July 1 2018.

In the not too distant future the expectancy is that deep learning technology will be able to solve a variety of problems in different sectors. Self-driving cars could reduce the number of car accidents by reducing the number of human errors which cause them. On European Union roads alone there were 25,600 fatalities in 2016 13

and it is believed that more than 90% of car crashes in the US are caused by human error. The errors are often due to distracted drivers, drunk-driving and speeding. 14 15

Self-driving cars could eliminate human errors, brining the number of accidents down. At the moment deep learning seems the most leading method for creating self-driving cars. Even so, the technology to fully replace all human driven cars 16

with self-driving cars remains still something for the future. 

Some researchers believe the technology to solve some important issues is very near. Max Welling, professor of machine learning at the University of Amsterdam, argues that AI could help prevent a lot of medical mistakes. Unfortunately mistakes are common: in the US alone it is the third leading cause of death. By eliminating human errors causing mistakes many lives could be saved. 17

Computers can already solve certain tasks at the same level as humans. One deep learning program developed to identify melanomas, a serious type of skin cancer, was able to detect them equally well as 21 other dermatologists. 18

Another example of a deep learning program was the ability to detect mitosis (cell

CARE. EU road accidents database. European Commission. 2017

13

Singh, S. Critical reasons for crashes investigated in the National Motor Vehicle

14

Crash Causation Survey. (Traffic Safety Facts Crash•Stats. Report No. DOT HS 812 115). Washington, DC: National Highway Traffic Safety Administration. 2015.

NHTSA Public Affairs. USDOT Releases 2016 Fatal Traffic Crash Data.

15

Washington, DC, NHTSA Media. 2017.

Ackerman, Evan. How Drive.ai Is Mastering Autonomous Driving With Deep

16

Learning. IEEE Sprectrum. 2017. Retrieved from https://spectrum.ieee.org/cars- that-think/transportation/self-driving/how-driveai-is-mastering-autonomous-driving-with-deep-learning

BMJ 2016;353:i2139. Retrieved from https://doi.org/10.1136/bmj.i2139

17

Esteva, A., Kuprel, B Novoa, R.A., Ko, J., Swetter, S. M., Blau, H. M., & Thrun,

18

S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017. Retrieved from https://www.nature.com/articles/nature21056

(9)

division) successfully. Being able to detect and measure mitosis is important in cancer prognoses. 19

Such detections are normally done by trained experts making it an expensive and time consuming process. Also, experts are not always able to identify mitosis. Doctors do not always have a lot of time to check everything 20

about their patients. Long working hours leaves little time to keep up with current literature. Therefore, patterns in complex data may not be recognised by doctors, especially when rare diseases appear. Programs could be developed to handle all the data and make a prognoses instantly. 21

Another prediction is how efficiency and productivity can be increased in factories using AI. Quality control can be challenging when the variety of products is large or there is a high production of a particular item. It will also be necessary to not only check the quality control of the product but also of the machinery used. AI programs could be developed to check them and identify a problem with the cause as soon as it occurs. Optimising the program by providing sensors and computer vision algorithms results in equipment not having to be checked after a certain time frame as it is constantly monitored. Manufactures can therefore repair the problems faster preventing expensive delays. 22

Chapter 1.3 Optimism in AI

As seen, many problems in various sectors could be solved in the future with the help of deep learning. In return it increases the optimism in the AI community in general. In spite of the appearance of the realistic predictions, optimism can exceed performance in the AI community. According to a survey of machine learning researchers named “When Will AI Exceed Human Performance? Evidence from AI

Experts“ there is a 50% chance of AI outperforming humans in all tasks in 45

years. It involves computers being able to accomplish every task better and more cheaply than human workers and is called ‘high-level machine intelligence’ (HLMI). Other predictions that have been made are that humans will be outperformed in tasks such as generating a song which will reach the Top 40 by 2027, beating the best Go player by 2033 with only practicing on the same amount of games or less than that practised by the best Go player in their whole life and writing a bestselling book by 2049. 23

The survey was based on responses from 352 international AI researchers. They were asked to give predictions on when certain milestones within AI would

Cireşan, Dan C., Alessandro Giusti, Luca M. Gambardella, and Jürgen

19

Schmidhuber. "Mitosis detection in breast cancer histology images with deep neural networks." In International Conference on Medical Image Computing and

Computer-assisted Intervention, pp. 411-418. Springer, Berlin, Heidelberg, 2013.

Schmidhuber, Jürgen, First Deep Learner to win a contest on object detection in

20

large images, First Deep Learner to win a medical imaging contest, First Deep Learner to win cancer detection contests. 2013. Retrieved from http://

peopl.idsia.ch/~jueargendeeplearningwinsMICCAIgrandchallenge.html

Welling, Max. Een pitbull waakt voor het laaghangend fruit. NRC live. 2016.

21

Retrieved from https://nrclive.nl/opinie_welling/

Smith, Geoff. Machinations Of Efficiency: Robots And AI On The Factory Floor.

22

MTB magazine. 2018. Retrieved from https://www.mbtmag.com/article/2018/06/ machinations-efficiency-robots-and-ai-factory-floor

Katja Grace, John Slavatier, Allan Dafoe, Baobao Zhang & Owain Evans, “When

23

Will AI Exceed Human Performance? Evidence from AI Experts” (Oxford: Oxford

(10)

be reached. The reason the survey was held was to help policymakers and researchers anticipate future trends in AI. The assumptions on which the 24

predictions are based had not been asked in the survey, even though the subtitle reads ‘Evidence from AI Experts’. It seems to imply that because AI experts predict it, without the assumptions being explained, it must be a good indication of the predictions.

Another indication that a great deal of optimism exists is seen in the large investments in AI. Many companies have heavily invested in AI research and startups. It has been especially so in the past 5 years which has seen an increase in the use of AI startups by leading commercial entities such as Amazon, Apple, Facebook, Google and Microsoft. It is seen in figure 1.2 such companies have acquired more AI startups each year than the previous year since 2013. 25

!

Figure 1.2 Amount of AI startups acquired by companies each year. 26

In the history of AI there have been periods of optimism which were followed by periods of disappointment. In the periods of disappointment fundings in the field were retracted and interest was lost by many. This paper is written from a certain concern whether the current optimism in AI can be seen as justly or that another period of disappointment will follow.

This paper investigates the optimism surrounding AI. Specifically the optimism in its subfield deep learning will be researched. Chapter 2 will cover how deep learning has reached the position it finds itself in today, why it has reached such vast popularity recently and what the consequences of optimism in AI were in previous years. In chapter 3 the technical side of deep learning will be analysed to reach a deeper understanding of the technology. Chapter 4 will study whether deep learning carries any limitations and what effects such a limitation could have. Finally, in chapter 5 the findings will be discussed. 

Katja Grace, John Slavatier, Allan Dafoe, Baobao Zhang & Owain Evans. 2017.

24

CB Insights. The Race For AI: Google, Intel, Apple In A Rush To Grab Artificial

25

Intelligence Startups. CBInsights. 2018. Retrieved from https://

www.cbinsights.com/research/top-acquirers-ai-startups-ma-timeline/

CB Insights. 2018.

(11)

Chapter 2 History of Deep Learning

Chapter 2.1 Artificial Intelligence

In 1948 Norbert Wiener published the book ‘Cybernetics: Or Control and

Communication in the Animal and the Machine’ . It stated that the information 27

theory where information was the central point, was better able to explain how certain phenomena worked rather than the then more commonly used Newtonian model where energy was the central point. The phenomena included how humans, animals and machines control and communicate. 28

Alan Turing published the paper ‘Computing Machinery and Intelligence’ in 1950. The paper was based on whether machines could be 29

distinguished from humans. The question ‘Can machines think?’ was proposed. Because the definitions ‘think’ and ‘machine’ are difficult to formulate, a different approach was taken. Turing’s test was created as alternative to the question whether machines can think. An interrogator would communicate via a teletype to a person or machine. When the interrogator could not distinguish whether it was in dialogue with a person or machine, the machine would win. It provided researchers to set 30

the goal of making machines do something what normally would require thinking. A conference was held in 1956 at Dartmouth College bringing scientists together from different fields to discuss their work on making machines behave intelligently. Their common interests were that they believed thinking could be possible outside the human brain by understanding it in a scientific way. It was agreed by all that the best non-human instrument to achieve it, was the digital computer. There was a dispute on what the new field of study should be named. McCarthy, one of the organisers of the conference, argued strongly for Artificial

Intelligence. While not everybody agreed, the term stuck. ‘Artificial 31

intelligence’ (AI) would become a wider field with different uses causing different subfields to emerge.

Chapter 2.2 Machine Learning

One of the subfields of AI was introduced by Arthur Samuel in 1959. Samuel published the paper ‘Some Studies in Machine Learning Using the Game of Checkers’ in which it was tried to create a program which could play the game of checkers. The paper was concerned with:

Wiener, Norbert. "Cybernetics: Or Control and Communication in the Animal

27

and the Machine." Scientific American 179, no. 5 1948.

McCorduck, Pamela. Machines who think: A personal inquiry into the history

28

and prospects of artificial intelligence. AK Peters/CRC Press, 1979, p. 52 Turing, A. "Computing machinery and intelligence" Mind 59, no. 236. 1950.

29

Turing, A. 1950.

30

McCorduck, Pamela. 1979, p. 111-115

(12)

‘the programming of a digital computer to behave in a way which, if done by

human beings or animals, would be described as involving the process of learning’ 32

The ‘learning’ would take place when a machine would not be explicitly programmed to perform a task but would make decisions based on data using statistical techniques. It would become a subfield of artificial intelligence and 33

grow under the term ‘machine learning’ introduced by Arthur Samuel in the paper. Just as with AI, machine learning would have different subfields. The subfields of machine learning were mostly based on using statistical techniques to achieve the machine to ‘learn’.

Chapter 2.3 Deep Learning

One of the subfields of machine learning emerged by a combination of different events. In 1943 McCulloch and Pitts published a paper which deducted that modelling the system of neurons in the human brain could make it possible to imitate human behaviour. Because of the on-off behaviour of neurons in the brain it was thought possible to model it as a neural system. A mathematical model was created associated with nerve behaviour which they called a ‘neural net’. The neural net was supposed to be able to simulate any algorithm. The model of the net with the synapses connecting the all-or-nothing neurons was found to be oversimplified in comparison to the human brain. At the time little was known about the operations of the neurons in the brain. 34

Frank Rosenblatt invented a hypothetical nervous system named ‘perceptron’ in 1958. It was an algorithm intended to illustrate some of the fundamental properties of an intelligent system such as recognising images. It was a type of a shallow ‘artificial neural network’ (ANN). The term shallow implies 35

that the perceptron contains, apart from the in- and output layer, one other layer or none. Figure 2.1 shows an illustration of the perceptron. It is a shallow network as it contains no other layers than the in- and output.

!

Figure 2.1. Rosenblatt’s perceptron with three inputs, three weights and an output.

Samuel, Arthur L. "Some studies in machine learning using the game of

32

checkers." IBM Journal of research and development 3, no. 3. 1959, p. 211-229 Samuel, Arthur L. 1959, p. 211

33

34

Rosenblatt, Frank. "The perceptron: a probabilistic model for information

35

(13)

Briefly explained, the perceptron is an algorithm with inputs, named ‘McCulloch-Pitts Neurons’ and an output, named an ‘activation unit’. The inputs are 36

connected with the output with what are called ‘weights’. The weights carry a certain value which multiply the value of the input. The outcomes are then summed up together. When a certain threshold is reached the neurons will ‘fire’, activating the units in the output. The input to output moves in one direction, making it a feedforward network. The perceptron does not carry multiple other layers than the in- and output layer, making it a single-layer network. The explanation of having one or multiple layers, and how they exactly work will be further explained in chapter 3.

Special hardware was made for the perceptron to function. The machine was named the ‘Mark 1 perceptron’ which would have image recognition tasks. It 37

is an example that at the time research in AI was not only theoretical but also practical.

In 1965 Ivakhnenko & Lapa published on creating a perceptron. It was a 38

feedforward network. Also, it consisted of multiple layers, making it a multilayered network. The extra layers are called 'hidden layers’. Briefly explained, the layers here mean the input is not just being processed once but multiple times. The inputs are multiplied in one layer with the weights and the outcomes are not summed up immediately but multiplied again with another weight in the next layer. The process can be repeated multiple times until the outcomes are summed up in the output layer. In feedforward networks the depth can be seen by the number of hidden layers. A measurement used is the ‘credit assignment path’ (CAP). The 39

CAP depth is the number of hidden layers plus the output layer. Generally, networks are considered ‘deep’ when the CAP depth is larger than 2, meaning more than one hidden layer. Hence, the network in figure 2.2 is considered a ‘deep neural network’ (DNN). The use of deep neural networks can be combined with different techniques to achieve better results.

!

Figure 2.2 A deep neural network with two hidden layers.

Crevier, Daniel. AI: the tumultuous history of the search for artificial

36

intelligence. Basic Books, 1993, p. 104

Bishop, Christopher M. "Pattern recognition and machine learning”. Springer.

37

2006, p. 196

Ivakhnenko, A. G. and Lapa, V. G. Cybernetic Predicting Devices. CCM

38

Information Corporation. 1965.

Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural

39

(14)

In 1959 Hubel & Wiener found two types of cells in the visual cortex: simple cells and complex cells. They proposed a model where the cells could function in a 40

hierarchical manner. Simple cells would identify features such as edges. Complex cells would receive inputs from simple cells and respond to patterns regardless of the location. 41

Inspired by Hubel & Wieners’ model, Fukushima proposed a model in 1979 for visual pattern recognition with deep neural networks. An artificial multilayered neural network was created named ‘neocognitron’. The 42

neocognitron was supposed to function in a similar way the simple and complex cells functioned in the visual cortex.

The network of neocognitron consisted of ’S-cells’ which had to extract features and ‘C-cells’ which would be unaffected to positional errors of the features. S-cell maps would be created from the original image. The S-cell maps would contain features of the image. The process would later be known as ‘convolution’. From the S-cell map a C-cell map would be created to cause less positional errors in the features. The process from S-cell maps to C-cell maps is known as ‘downsampling’ or ‘subsampling’. S-cell maps could then be created from C-cell maps, C-cell maps from S-cell maps again and so on. The process could be repeated multiple times.4344 The processes are similar to modern techniques which will be discussed further in chapter 3.

Lecun et al named the combination of deep neural networks and the techniques similar to that of the neocognitron ‘convolutional neural networks’ (CNN) in 1999. Convolutional neural networks will be explained 45

further in chapter 3. In reality S-cells and C-cells differ greatly from actual simple and complex cells in the visual cortex. Though, the techniques used by the neocognitron are similar to modern, contest-winning, image recognition techniques which also use convolution and downsampling techniques. 46

In 1970 a master thesis by Linnainmaa described the basics of minimising errors in a network similar to how a modern neural network ‘learns’. Though the 47

general techniques used were not new at the time they were now more efficiently implemented. First it is calculated to what degree a prediction from the network is 48

incorrect. The technique is then used to calculate from output to input how to lower the incorrectness of the network. It is achieved by strengthening certain connections between cells which decrease the incorrectness of the network whilst

Hubel, David H. and Torsten N. Wiesel. "Receptive fields, binocular interaction

40

and functional architecture in the cat's visual cortex." The Journal of physiology 160, no. 1 (1962): 106-154.

Ringach, D. L. Mapping receptive fields in primary visual cortex. The Journal of

41

Physiology, 558(Pt 3), 717–728. 2004.

Fukushima, K. Neural network model for a mechanism of pattern recognition

42

unaffected by shift in position - Neocognitron. Trans. IECE, J62-A(10):658–665. 1979

Schmidhuber, Jürgen. 2015, p.10

43

Fukushima, K. Neocognitron. Scholarpedia. 2007

44

LeCun, Yann, Patrick Haffner, Léon Bottou, and Yoshua Bengio. "Object

45

recognition with gradient-based learning." In Shape, contour and grouping in

computer vision, pp. 319-345. Springer, Berlin, Heidelberg, 1999.

Schmidhuber, Jürgen. 2015, p.10

46

Seppo Linnainmaa. The representation of the cumulative rounding error of an

47

algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki. 1970

Schmidhuber, Jürgen. 2015, p. 11

(15)

weakening other connections which increase the incorrectness. The technique is called ‘back-propagation’ and will be further explained in chapter 3. 

Back-propagation is somewhat similar to the theory of how neurons adapt in the human brain when learning. The theory can be described as ‘cells that fire 49

together wire together’ . In other words, cells tend to become associated with each 50

other when the cells are repeatedly active at the same time. Eventually, activity in one cell will then activate the other cell. The theory is known as the Hebbian theory and proposed by Donald O. Hebb. 51

The term used today to describe the use of deep neural networks is known as deep learning. Though the term has not been completely defined, it is mostly used for deep neural networks which can process relatively raw data and the network learning features from the data with prior limited knowledge. In comparison developers of shallow learning are often required to provide some prior knowledge and engineer specific features which may help the network.

Deep learning is considered a subfield of machine learning. While deep neural networks have been around for quite some time, it was not until 1986 that the term deep learning was introduced by Dechter to machine learning and it seems not to artificial neural networks until 2000 by Aizenberg. 52

Different techniques have been developed with the aim of creating machines which can imitate humans by using the brain as a role model. The way cells in the brain work and the way ‘cells’ in artificial neural networks work, differ greatly. Nonetheless, considerable achievements have already been made. Especially recently with deep learning.

Chapter 2.4 Deep Learning Revolution

Deep learning techniques have existed for quite some time. The basic ideas for modelling actual neural networks of the human brain date back as early as the 1940’s. Even though it was found the model was oversimplified, the created model was still metaphorically named a neural net. The use of metaphors continued with Rosenblatt’s perceptron. It was an algorithm which is seen as a simple neural network. Also, the inputs were named McCulloch-Pitts Neurons. It seemed when researcher in the field of AI were talking about neural networks and neurons, they were speaking metaphorically, not about actual neurons and neural networks. The use of the metaphors will be elaborated on in chapter 3. Deep neural networks, meaning the network contained more than one hidden layer, date back to 1965.

The question which arises now is why the deep learning techniques were not used back then to achieve the current breakthroughs. Why is it that deep learning is only recently receiving such attention? Frank van Harmelen, Professor in Knowledge Representation & Reasoning in the AI department at Vrije Universiteit Amsterdam, speaks on the ‘deep learning revolution’ starting around 2012-2013. In the history of artificial neural networks, periods of optimism and disappointment have appeared. After an optimistic period, artificial neural networks would show limitations which would then be followed by a period of

3Blue1Brown: ‘But what *is* a Neural Network? | Chapter 1, deep learning’.

49

2017. Retrieved from https://www.youtube.com/watch?v=aircAruvnKk

Hebb, Donald O. "The organization of behavior: A neuropsychological theory."

50

1949.

Hebb, Donald O. 1949.

51

Schmidhuber, J. Deep Learning. Scholarpedia. 2015

(16)

disappointment and therefore less interest in the field. The limitations could be theoretical or practical, i.e. if the limitations could be fixed others would come to light. For example, the network would not be able to ‘learn’ and if it could learn it would take a great amount of time. Another reason could be that the network needed such a great amount of training data that it was not practical to use. Van Harmelen suggests a combination of three innovations which have made the deep learning revolution recently possible. 53

The first is that computing power has increased greatly over the last decades. The use of ‘graphical processing cards’ which were used in gaming machines were unexpectedly found to be useful for the processing of artificial neural networks. Secondly, data is no longer scarce due to the digitalisation of society and the availability and accessibility of large amounts of data on the internet. Websites keep track of all the click streams, people can upload photos to different sites and banks have millions of bank transfer records. Thirdly, new algorithms have been created which have led to better results. The three innovations: new hardware, more data and new algorithms have made the revolution possible. 54

What is seen as the starting point of the revolution by some researchers are the results of the annual computer vision contest ImageNet in 2012. It was a competition started by AI researcher Fei-Fei Li in 2009. She published a public database ‘ImageNet’ with 3 million labeled images to be used for training. For example, an image of a dog would have the label ‘dog’ and an image of a cat the label ‘cat’. Researchers then made algorithms to identify the objects on the images. The winners in 2012 who used a deep learning technique achieved an accuracy almost twice as high as those that took second place. Since then, new milestones 55

have been achieved which have resulted in a deep learning revolution. Van Harmelen states that currently artificial neural network researchers are showing breakthroughs which are still amazing experts on a weekly to monthly basis. 56

Chapter 2.5 AI Winters 

It is not the first time there has been such optimism towards AI. As early as 1954 the Georgetown-IBM experiment was held which demonstrated ‘machine translation’ (MT). In the experiment 60 sentences were translated from Russian to English. The system was not complete as it only contained six grammar rules and a vocabulary of 250 words. Nonetheless, the experiment received a great deal of public interest and resulted in front page news in the New York Times. During the following months machine translation received further publicity from various other top American newspapers. Leon Dostert who was in charge of the experiment foretold that in five years it would be possible to translate several languages using electronic processes. Financial investments by US agencies in machine translation research became more prominent in 1956 during the cold war when the significance of automatic translation from Russian to English became apparent. In the following decade progress was slow and the ‘Automatic Language Processing

Van Harmelen, F. (2018, June 25). Personal interview with G. Alberts

53

Ibid.

54

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet

55

classification with deep convolutional neural networks." In Advances in neural

information processing systems, pp. 1097-1105. 2012.

Van Harmelen, F. 2018, June 25.

(17)

Advisory Committee’ (ALPAC) report of the US Government caused a retraction of fundings in 1966. 57

In 1973 another report which was known as the ‘Lighthill report’ announced some pessimistic predictions for AI in general. The paper was intended to aid the British Science Research Council in evaluating requests to finance further in AI research. As a result, the British Government retracted funding to many British universities. 58

In 1974 the ‘Defense Advanced Research Projects Agency’ (DARPA) of the United States also retracted grants which had already given 15 million dollars over 5 years. It involved DARPA’s ‘Speech Understanding Research’ (SUR). It is believed the desired speech understanding DARPA wanted to achieve was not met. 59

The financial retractions unveiled disappointments in AI technology. Multiple high expectations that were set could not be met. It started a period of low funding and less interest in AI research. The period would later be named the first ‘AI winter’ and lasted from around 1974 to 1980. 60

The term ‘AI Winter’ was first introduced in 1984 at the annual meeting of the ‘Association for the Advancement of Artificial Intelligence’ (AAAI). The term was taken from the ‘Nuclear winter’ as a worst case scenario for AI researchers. A characteristic way such a winter would commence is that first the optimism in the research community would diminish. After, the public opinion would follow, leading to AI researchers being ridiculed with fundings and investments being stopped abruptly. At the AAAI meeting it was discussed another AI winter could be possible in the near future. In that same year, Schank and Minsky were also worried that the enthusiasm of the technology of AI was becoming uncontrollably large. 61

It started at the beginning of the 80’s when AI began achieving breakthroughs again. Expert systems, which emulated the decision-making process of humans, flourished and were being used for real-life situations such as medicine and construction projects. Optimism was growing in AI again and in 1981 the 62

Japanese Ministry of International Trade and Industry created a plan to develop the ‘fifth generation computer’. The computer would be able to behave in similar ways to humans: recognising speech and image which made it possible to have a casual conversation or interpret pictures. Also, it would be able to learn and make decisions. Once in production, it would be affordable enough to be used in households.6364 Another promising event were the new insights in the processing of artificial neural networks by John Hopfield in 1982. Also, a new method to train the networks was popularised. The two events led to the revival of the use of

Hutchins, W. John. "The Georgetown-IBM experiment demonstrated in January

57

1954." In Conference of the Association for Machine Translation in the Americas, pp. 102-114. Springer, Berlin, Heidelberg, 2004.

Smith, Chris, Brian McGuire, Ting Huang, and Gary Yang. "The history of

58

artificial intelligence." University of Washington 2006, p. 18 Crevier, Daniel. 1993, p. 115-117

59

Howe, Jim. “Artificial Intelligence at Edinburgh University : a Perspective”.

60 University of Edinburgh. 2007. Crevier, Daniel. 1993, p. 203-204 61 McCorduck, Pamela. 1979, p. 419 62 McCorduck, Pamela. 1979, p. 436 63 Crevier, Daniel. 1993, p. 211-212 64

(18)

artificial neural networks. By 1985, 150 companies were investing 1 billion 65

dollars in total for the development of AI applications and hard- and software companies were being set up to support the trend, with sales reaching 425 million dollars in 1986. 66

But by the end of the ‘80s the trend was again decreasing. In 1987 Apple and IBM both developed computers which were more powerful and cheaper than the specialised hardware made for AI. It was fatal for the specialised AI hardware industries. Furthermore, expert systems did not seem useful for human experts and the goals set for the fifth generation computer were still not met, which had already received investments of 850 million dollars. The expectations had exceeded the 67

possibilities. Again, funding was cut and interest in AI diminished causing another AI winter in the late 1980’s. 68

Chapter 2.6 First-Step Fallacies

Since John McCarthy introduced the field of making machines behave intelligently with the name ‘Artificial Intelligence’ (AI) in 1956, many overoptimistic goals 69

have been set with many failures. AI optimism was established after even only a few successful goals had been reached and was seen as the basis to reach the ultimate goal of AI researchers, such as creating a machine at least as intelligent as a human. Only, the basis for the goal was not a valid basis. Logician Yehoshua Bar-Hillel noted that the AI optimism was based on the ‘first-step fallacy’. First-step 70

fallacies have the characteristic that when a successful first step is established, the idea of a successful last step for a certain goal is also created without any argument provided in support. Hubert Dreyfus, philosopher and professor at the University of California, Berkeley, compares it with climbing a hill. When a successful first step is made up a hill, it does not mean the goal to reach to sky can be made when one keeps on going. 71

The overoptimistic goals set in AI research do not only cause a bit of disappointment when it is not achieved. It can have financial consequences, make the possibilities in the field less credible and waste time and effort of people involved. Because of the optimism researchers may not recognise some goals set are impossible to achieve in the first place. Researchers grind for a solution until they find the method used is unable to provide one. It is a characteristic for a first-step fallacy. Dreyfus has analysed six first-first-step fallacies which have been made in the history of AI. By analysing the fallacies, it can be understood how such a 72

fallacy emerges which can help identify current or future first-step fallacies in projects relating to AI.

1. Good Old-Fashioned AI Crevier, Daniel. 1993, p. 215 65 Crevier, Daniel. 1993, p. 199-200 66 Crevier, Daniel. 1993, p. 210-212 67

McCorduck, Pamela. 1979, p. 442 & 532

68

69

Hubert L. Dreyfus, A History of First Step Fallacies, (Berkeley: Springer

70 Science+Business Media, 2012) Ibid. 71 Ibid. 72

(19)

At the end of the 50’s two computer scientists Newell and Simon claimed that the human mind and digital computers could be understood as physical symbol systems. Their initial success was that they were able to program computers to perform problem solving activities. They claimed on that basis that intuition, insight and learning were not exclusive to the human brain but that any high-speed computer could be programmed to show the same skills. They foresaw that in the near future computers and humans would be able to handle the same amount of problems. The false assumption made was that physical symbol systems could be a basis for computers to handle problems the way humans did. The predictions failed to come to fruition because of the common-sense knowledge problem. It involves computers not having the ability to process common sense reasoning. The problem will be explained further in the following first step fallacy. The use of symbolic AI would later be known as Good Old-Fashioned AI (GOFAI).

2. The Frame Problem 

Another problem correlating to the common-sense problem AI researchers encountered was to know which facts were relevant in a specific situation. When for example an action occurs such as closing the blinds in a room, how does a computer determine which facts remain unchanged and which should be updated? The intensity of illumination will have changed, shadows will have changed but the number of chairs in the room have not. It is known as the ‘frame problem’. Minksy introduced a solution where programmers could use ‘frames’. It would allow computers to know which facts were relevant in a specific situation. When going to a birthday party for instance the ‘birthday party frame’ would be used. Only facts relevant to birthday parties would be employed which could include balloons, cake, children receiving presents and so on. Minksy predicted in 1968 that within a generation intelligent computers like HAL in the film ‘2001: A Space Odyssee’ would exist. However, problems soon emerged using the frames. Consider the following story presented to a computer:

‘It was Jack’s birthday. Mary and Jane were going to Jack’s. “Let’s take him a kite” says Mary. “No” said Jane “he’s already got one, he’ll make you take it back”. 73

Because the birthday party frame would be used the computer would be able to know Mary and Jane were going to Jack’s for a birthday party. Though it was not explicitly said, using the frame it would also be possible to deduct the kite was a present for Jack. The real difficulty emerged when deciding what “it” referred to. Grammatically it could refer to the kite Jack already had or the one Mary wanted to give Jack. Which rules could be set to make ‘it’ refer to the new kite? One could be: “You have to take the new one back, you cannot take the old one back” or “If 74

you’ve already got one you do not want another one just like it” . But the rules do 75

not always apply. Not wanting something just like something else does not apply to dollar bills for instance. Presumably, many pieces of knowledge like it exist.

Dreyfus, Hubert, L. "Artificial Intelligence: The Common Sense Problem"

73

YouTube video. Posted by INTELECOM, 11 Apr. 2018, https://www.youtube.com/ watch?v=SUZUbYCBtGI

Ibid.

74

Ibid.

(20)

Suppose everything were stored in on long list, then the relevance of when which rules should apply would cease to exist. Common sense is impossible to 76

formulate. No multitude of facts will suggest a solution

Another problem is when recognising the relevant facts for a particular situation, the computer must first be able to recognise a situation and distinguish it from other situations which had to be continuously reapplied. It also indicated that there was a flaw in the whole approach, demonstrating that Minsky falsely assumed the frame problem could be solved by only using frames.

3. Expert systems 

Expert systems were being created to function in closed environments. In certain domains the systems were nearly able to act as wel as human experts. In 1983 Feigenbaum advised the US to invest in expert system to not fall behind Japan’s fifth generation computer. The succes of expert systems was seen as a step toward the goal of genuine expertise. But it turned out expert systems did not function as well as expected with no correct basis to believe it was even possible in the first place.

4. Cog

In 1991 Brooks published an article on ‘ant-like' devices he had created named ‘animats’. Animats used sensors to pick up small sets of relevant features. The success Brooks had with the devices resulted in the overoptimistic step of making a humanoid robot named ‘Cog’. Cog was supposed to show features such as speech and self-regulatory activities. The reason the animats were successful was because they were not required to respond to context, not solving the frame problem. Thus, the frame problem also remained present for Cog. The false assumption Brooks made was that artificial insect intelligence was a first step to human intelligence. 5. Cyc

Douglas Lenat undertook a ten year project to store and organise common sense knowledge in a system called ‘Cyc’. Lenat tried formalising what he called ‘consensus reality’ - ‘the things we assume everybody knows’ . Lenat was 77

determined to define all common sense. It turned out to be a very difficult goal as discussed in the second first step fallacy. Lenat still has not been able to achieve it. 6. Singularity is near

Chalmers argues that once AI becomes more intelligent than humans, an intelligence explosion will occur. As every AI generation creates a more intelligent AI generation, it would create the technological singularity. The advancement of technology suggests to Chalmers that there will be successful AI, and once that is reached AI+ will exist. A+ would then create A++ and so on. Chalmers assumes the successful AI to be created in the not too distant future though there is still no

Ibid.

76

Lenat, Douglas B., and Ramanathan V. Guha. "Building large knowledge-based

77

(21)

evidence to believe progress is being made towards successful AI. Thus, the step to AI+ is made on a false assumption.

By looking at the first-step fallacies a pattern is perceived that they often start with a success. For example, performing successful problem-solving activities, breakthroughs within expert systems or the creation of animats. The achievements are then used to overestimate a prediction or the choice of the next goal. A direct success is not always necessary. Predictions and goals can also be based on the great amount of optimism in the technology.

Optimism in the progress of AI is seen in all of the first-step fallacies. Naturally, optimism in progress does not automatically make something a first-step fallacy. It can encourage people to work harder and to achieve goals. It is therefore important to note whether it is just optimism or overoptimism. It can be examined by researching on which assumptions the optimism is based. When the assumption is false, the optimism which it is based on can be seen as overoptimistic. The goal or prediction set by the overoptimism would then also not be possible to achieve. Thus, it seems that the most important way to identify a first-step fallacy is by examining a false assumption which a prediction or goal is based on. Analysing the assumptions will make it possible to deduce whether it is based on a first-step fallacy or not.

(22)

Chapter 3 Deep Learning Techniques

Chapter 3.1 Deep Neural Networks

The technical side of deep learning will be discussed. Deep learning makes use of neural networks. They are not actual neural networks from the brain but it is metaphorically used. The neural networks in deep learning represent algorithms. Neural network algorithms can differ in architecture and uses but all function in a similar way. The neural networks used in deep learning are inspired by the brain, but resemble it very loosely. Nonetheless, the algorithms are named ‘artificial neural networks’ (ANN). Though, the term is probably created to evade confusion between actual neural networks and artificial neural networks, people within AI usually speak of ‘neural networks’ when speaking about artificial neural networks. Because the resemblance between artificial neural networks and actual neural networks is low, it is debatable whether artificial neural networks is a correct term to use. The algorithms functions very differently from actual neural networks. Many different artificial neural networks architectures and methods exist. One such a network can be illustrated as seen in figure 3.1.

!

Figure 3.1 An artificial neural network. 78

The network in figure 3.1 consists of five different layers. The first layer receives all the inputs, it is named the ‘input layer’. The last layer produces the outputs of the network, and is known as the ‘output layer’. In between the layers three layers exist, which are known as ‘hidden layers’. The amount of hidden layers can vary. Networks consisting of two hidden layers or more are considered ‘deep neural networks’ (DNN’s). The network in figure 3.1 is therefore considered a deep neural network. Some deep neural networks consist of thousands of hidden layers. When a neural network consists of 1 or less hidden layers it is considered a ‘shallow network’. The data which is fed to the network in the input layer flows from the input layer through the hidden layers and ends in the output layer, making it a ‘feedforward network’.

The units in the input of the perceptron by Rosenblatt discussed in chapter 2.1.3 were called McCulloch-Pitts Neurons. In the artificial neural networks currently used, the term used to describe all the units which make up layers of the

Nielsen, Michael, A. “Neural Networks and Deep Learning”. Determinations

78

(23)

networks has become ‘neuron’. As with the term neural network, neuron is used metaphorically. Again, the neurons used are inspired by actual neurons, but resemble them very loosely. The neurons in the artificial neural networks are named ‘artificial neurons’. Within the field of artificial neural networks, people talk about artificial neurons when speaking of neurons. The neurons in the network represent a value.

All neurons are connected with all neurons in the next layer but not with neurons in the same layer. Thus, any neuron in the input layer is connected with all neurons in the first hidden layer. Any neuron in the first hidden layer is then connected with all neurons in the second hidden layer and so on. It makes the network a ‘fully connected network’. The connections between neurons are named ‘weights’ and also carry a value. Finally, all neurons, except for neurons in the input layer, carry a value which acts as a threshold which is named the ‘bias’. The functions of the entities and how the neural network works will be explained in the following example.

Chapter 3.1 is mostly inspired by the explanations of Nielsen and 79

3blue1brown while Chapter 3.2 is mostly inspired by the explanation of 80

Ujjwalkarn . 81

Chapter 3.1.1 Identifying number 9 on an image

Deep learning techniques can function in different ways but include the basics of the following techniques. The techniques are especially used in computer vision but can also be used in different domains. First, an example will be giving on how a network can identify whether a number on an image contains a 9 or not. Then, an explanation will be given how a network is able to ‘learn’ to perform in such a way. The images will be in black and white on a 7 by 7 matrix, making it consist of 49 pixels. Possible ways a 9 could be illustrated are:

!

Figure 3.2 Examples of number 9 on a 7x7 matrix.

All the pixels will be associated with a value for the network to process. Here, the value is either 1 or 0. Black being 1 and 0 being white. Thus, number 9 could be ‘seen’ in the following way by a network:

Nielsen, Michael, A. 2015.

79

3Blue1Brown: ‘But what *is* a Neural Network? | Chapter 1, deep learning‘,

80

’Gradient descent, how neural networks learn | Chapter 2, deep learning’, ‘What is backpropagation really doing? | Chapter 3, deep learning’. 2017. Retrieved from https://www.youtube.com/playlist?

list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

Ujjwalkarn. ‘An Intuitive Explanation of Convolutional Neural Networks’. (Data

81

Science Blog). 2016. Retrieved from: https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

(24)

!

Figure 3.3 number 9 on a 7x7 matrix visualised in 1’s and 0’s.

All the values of the 49 pixels are used for the first layer in the network: the input layer. In all networks the input layer is where the processing of information starts. Because the image contains 49 pixels, the input layer of the network wil consist of 49 neurons: each neuron in the input layer represents a pixel. The input layer with the values of the matrix in figure 3.3 can be illustrated in the following way. Note that not all 49 neurons are visualised but only the first seven and last seven:

!

Figure 3.4 The input layer illustrated with 49 neurons. The first seven and last seven neurons are illustrated, in between the rest of the neurons exist.

Further, the network will consist of two hidden layers, each with ten neurons. In reality, networks might have more hidden layers with any other amount of neurons. Two hidden layers and ten neurons are chosen to simplify the example. The output layer consists of only one neuron. All neurons are connected with the neurons in the next layer. The connections between neurons are named ‘weights’. The network can be seen in its full in figure 3.5. While not every neuron in the input layer is visualised, every neuron in the input layer is connected with all neurons in the first hidden layer.

(25)

!

Figure 3.5 A DNN with 49 neurons in the input layer, two hidden layers with 10 neurons each and 1 neuron in the output layer.

The following will demonstrate a way the network could identify the number 9 on a 7x7 image. It will be shown in four steps: the first step the network will take, is to check all the values of the pixels: which pixel has a value of 1 and which a value of 0. The next step could be to combine the values to identify edges or stripes:

!

Figure 3.6 Two different edges and a stripe.

The step after could be to combine the edges and stripes to identify loops and lines:

!

Figure 3.7 A loop and two different lines.

(26)

! Figure 3.8 A 9 on a 7x7 image.

When imagining the steps in the layers of the network, the input layer covers the values of the pixels, the first hidden layer recognises the stripes and edges, the second hidden layers recognises the loops and lines and the output layer recognises whether it is a 9 or not.

!

Figure 3.9 Function of each layer illustrated without the connections shown between the layers.

Every neuron in the network has a different function. In the input layer neurons have the function of representing the values of the pixel. It wil check which pixel has a value of 1 and 0. In the first hidden layer the function of a neuron could be to identify an edge or stripe. In the second hidden layer the function of a neuron could be to combine the edges or stripes of the previous neurons into a loop or line. The neuron in the output layer could try to combine the loops and lines to create a 9. The neuron in the output layer will be a value between 0 and 1. When it is above 0.5, it will identify the image as containing a 9. Below 0.5 the network will identify the image as not containing a 9. The output number also shows the certainty of the network. Thus, when an output of 0.9 is given, the image may look more like a 9 than when an output of 0.6 is given.

The connections between the neurons as seen in figure 3.5 and figure 3.10 are named ‘weights’. They also represent a value. The value is multiplied with the value of the neuron. The higher the value, the more important the connection. In

(27)

the example, the weights are able to help identify an edge or stripe. A pixel which could represent a part of an edge would have an important connection with the neuron which identifies that edge. For the explanation the neuron which identifies an edge will be named the ‘edge neuron’. Here, weights with a value of 2 or more are considered important connections.

A pixel which does not represent that edge connected with the edge neuron will have a connection which is not important. For the neuron it does not matter what value the pixel has as the connection is not important. A weight with a value of 0 is considered a connection not important.

!

Figure 3.10 Four neurons of the input layer connected with the ‘edge neuron’ in the first hidden layer showing the important connections.

Also, every neuron after the input layer has a certain threshold it should pass to be activated. The threshold is called a ‘bias’. Only if the bias is met, will the neuron become active and ‘fire’ to the next neurons. In other words, the bias acts as a bias to be inactive. For example, the edge neuron in the first hidden layer could have a bias of 4. If two neurons (pixels) in the input layer associated with that edge neuron have a value of 1 and have weights connection them with the value of 2, the value of the neurons will be multiplied with the weights and added up: 1*2 + 1*2 = 4. The total value of that edge neuron in the first hidden layer would be 4. To pass the bias the value of the neuron should be 4 or higher which is the case here. Now that it has passed its bias, the neuron can fire to the next neurons.

The other neurons in the input layer (pixels) connected with that edge neuron which are not relevant will have weights with a low value. When the pixels surrounding the edge have to be white, emphasising the edge, the weights associated with the neurons (pixels) surrounding the edge may be of a negative value.

(28)

!

Figure 3.11 Four neurons of input layer connected with one neuron in hidden layer showing the biases.

The values of the weights from the input layer to the edge neuron in the first hidden layer can also be visualised in a matrix. Only the weights with a high value are relevant for the edge neuron. Here, a high value is considered a 2 and a low value a 0.

!

Figure 3.12 Matrix showing values of the input layer and a matrix showing the values of the weights connected to the ‘edge neuron’

When multiple edge neurons ‘fire’ making up a circle, the circle neuron in the second hidden layer will be able to pass its bias. It will cause the circle neuron to ‘fire’ to the output layer. When the circle neuron and the other neurons making up a 9 ‘fire’, the neuron in the output will be able to recognise the 9.

A simplified version of how a deep neural network works was given. The amount of neurons and layers depends on the task. Experimenting to find the best architecture is needed. Deep neural networks could have 1000’s of hidden layers with millions of neurons. In reality the network might not find edges and then loops, like most humans might do, but will have its own discovered way to identify the 9’s. It leads to the next point: How a network ‘learns’ to identify a 9.

The Deep Learning Hype