Face Identification pre-trained with a Siamese network

(1)

submitted in partial fulfillment for the degree of

master of science

Emiel van Dongen

11845597

master information studies

data science

faculty of science

university of amsterdam

Date of defence: 2018-07-19

Internal Supervisor External Supervisor Title, Name dr. Thomas Mensink Robbert van Hintum

Affiliation UvA, FNWI, IvI Xomnia

(2)

Face Identification pre-trained with a Siamese network

Emiel van Dongen

Informatics Institute

University of Amsterdam

emiel.dongen@gmail.com

abstract

Face identification has become an increasingly large do-main in the field of computer vision. It is widely researched as well as applied in real life. A face identification system for, e.g. surveillance cameras should be able to detect faces as well as identifying them. A new dataset is created con-taining 46 surveillance videos using a camera at a high an-gle. This paper analyzes three methods for face detection: the Deep Neural Network face detector of OpenCV (Open Source Computer Vision library of Python), Histograms of Oriented Gradients (HOG) and the "Finding Tiny Faces" de-tection method, after which the best method is used to ex-tract faces from the videos. Using a train and test set with 764 and 1270 of the extracted faces respectively, we evalu-ate the effectiveness of two straightforward identification methods. Furthermore, a demo is created that takes a video as input and outputs a list of names of the people present in the video. Lastly, we have tried to improve our base-line (generated by the models mentioned above) by training four Convolutional Neural Networks (CNN) of which two were pre-trained with a Siamese network.

1 introduction

Face identification is widely used in various areas with dif-ferent objectives. A general description of face identifica-tion can be formulated as follows: given still or video im-ages of a scene, detect and identify one or more persons in the scene using a stored database of faces [3]. Facebook ap-plies this technique to the pictures people post online and sends a notification if a picture is posted with your face in it. But another large social media platform, Snapchat, makes users able to alter their face after Snapchat has detected it. There are even experiments to try and take automatic at-tendance at schools.

Since it is used in so many areas, it has been studied sub-stantially as well [8]. More and more large scale datasets have become available in recent years to train and test face detection and identification algorithms. Different methods

have been examined, varying from relatively simple meth-ods [16] to complex neural networks, as well as the applica-tion on different datasets such as still images of frontal faces and videos captured from a distance. This project consists of two major goals: creating a face identification demo for a surveillance camera and researching the possibility of im-proving a basic face identification algorithm with a Siamese network.

The first goal includes building a face detection model that iterates through the frames of a video and saves the faces that appear in each frame. After which a face identi-fication model will identify the people in the extracted im-ages. To create this demo, two experiments have been com-pleted: a test with three different face detectors (the Deep Neural Network face detector of OpenCV, Histograms of Oriented Gradients and the "Finding Tiny Faces" detection method) and a test with two basic methods that can be used for identification which creates a baseline for the second goal of this project. The demo adopts the best performing detection and identification methods and converts the in-put (a video) to a list containing the names of the people in the video.

The second goal is to improve the baseline by testing four different convolutional neural networks: a CNN trained from scratch, a pre-trained CNN on the ImageNet dataset, a CNN trained with a Siamese network and a CNN pre-trained on the ImageNet dataset as well as a Siamese net-work. All of the neural networks are based on the ResNet (residual network) framework, which has proven to per-form well on image and face classification [10, 8].

A Siamese network aims to return similar vectors for similar images and vice versa. Intuitively, a neural network pre-trained on a Siamese network will already learn to rec-ognize important features of faces. Furthermore, a Siamese network takes pairs of images which means that a small dataset can increase drastically in size. To our knowledge, there have not been any experiments applying this way of pre-training a CNN on a face dataset.

With these experiments we aim to answer the following main research question:

(3)

• Using a small face dataset, how can we get a high enough performance of the identification model to be used in real-time?

With the following sub research questions:

• Which face detection method has the highest perfor-mance on our dataset?

• Does a complex convolutional neural network per-form better than much simpler methods?

• Does adding a Siamese network to the pre-training process add to the performance of the model?

2 related work

Face identification has been a popular subject in computer vision for decades. New methods have been developed to detect faces in videos or images as well as for identification. This section will first review recent research of face detec-tion, after which we will discuss methods for face identifi-cation.

2.1 Face detection

Face detection has become one of the most studied sub-jects in computer vision [20, 4, 11] due to the enormous amount of applications that require face detection. Re-searchers have experimented with numerous methods and the first face detection method that became popular is the Viola & Jones method [18]. Their classic method uses three key contributions: an image representation called the "In-tegral Image", a simple and efficient classifier which is built using the AdaBoost learning algorithm and a method for combining classifiers in a cascade. The first contribution allows the features to be computed fast, the second creates the opportunity to select a small number of critical visual features and the third allows background regions of the im-age to be quickly discarded in order to spend more compu-tation time on promising face-like regions.

A newer and improved detection method uses a His-togram of Oriented Gradients (HOG). This method creates a representation of an image by replacing the pixels with intensity vectors. The HOG has proven to be a successful object detection method when it was created by Dalal & Triggs [5]. Deniz et al. [6] have experimented with HOG’s

for face detection and achieved 95% accuracy on the FERET dataset (which contains frontal face images).

Hu & Ramanan [11] focus their research on three aspects of the problem in the context of face detection: the role of scale invariance, image resolution and contextual reason-ing. The vast majority of face detection researches aim to be scale invariant, but the cues for recognizing a 300 pixel tall face are fundamentally different than those for recognizing a 3 pixel tall face. Features extracted from multiple layers of single feature hierarchy can be used to detect tiny faces in a picture. The best results were generated by training heat map predictors using a fully convolutional network defined over a ResNet architecture. This algorithm is currently one of the newest face detection methods researched.

OpenCV [2] released version 3.3.0 with a highly im-proved deep neural networks module [15]. The deep learn-ing face detector of OpenCV is based on the Slearn-ingle Shot Detector (SSD) framework with a ResNet base network. Liu et al. [19] originally developed an SSD which localizes the objects in a frame as well as classifies them. The SSD does these two tasks in a single forward pass of a network.

Our works evaluates the use of HOG’s and two of the face detection methods based on convolutional neural net-works. We compare the results of an older and more simple method to one of the newest and far more complex meth-ods available as well as a method that can easily be adopted using an open source library of Python.

2.2 Face identification

One of the oldest and simplest classification and pattern recognition methods is the k-Nearest Neighbor (k-NN) al-gorithm [12], which can be used for face identification as well [16]. Although other methods like Support Vector Ma-chines have been proven to get a better performance, k-NN is usually chosen due to its fast execution time on a small dataset with a reasonable performance [7].

Ahonen et al. [1] used the Local Binary Pattern His-tograms (LBPH) representations for face identification. They split an image containing a face into smaller parts and extract the LBP histograms from these parts. This paper led to multiple other researches and findings in the same field, including the findings of Zhang et al. [21]. They have ex-perimented with the LBPH method and added the AdaBoost learning algorithm in order to select the LBP based features and check the similarity between two images.

(4)

clas-Table 1: Data

Videos Frames Hours Persons Unique persons Faces

Training 41 914.760 118 NA 48 764

Testing 5 99.030 13 149 48 1270

sifying the ImageNet database with a deep convolutional neural network. He et al. [10] have created a way to train even deeper neural networks without an increase in error and computation complexity, making use of residual learn-ing. Gruber et al. [8] experimented with residual learning for the face identification task, achieving over 90% identifi-cation rate on the Casia-WebFace database (containing pub-lic frontal face photos). Koch et al. [13] have used a CNN in a Siamese network architecture and one-shot learning for image classification which resulted in 92% accuracy on the Omniglot dataset (which contains handwritten characters from different languages).

Our work differs in experimenting with these methods using a dataset of surveillance cameras where face images are not always frontal or of a high quality. Additionally, im-plementing the Siamese network architecture for the task of pre-training has not yet been experimented with to our knowledge.

3 data

In total, the dataset contains 46 videos which corresponds with roughly 130 hours of data. These videos are used to create one training and two test sets. A test set is required for the face detection part and both a training and a test set are required for the identification part.

The test set of the detection contains one week of data, which is equal to thirteen hours. Two frames per second were extracted from the test set which resulted in 99.030 frames. The motivation to extract two frames per second was because the computation time of the detectors would become too high if every frame was used and because a per-son walks up the stairs in around five seconds which means that the persons who walks passed the camera appears in approximately 10 frames. There are 149 people in the test set and we aim to extract at least one picture of every per-son. Figure 2 shows what the camera records if no one is in the frame an figure 3 shows how it looks like when some-one walks up the stairs.

Figure 2: Empty frame

After the first experiment, the best performing face de-tector is used to detect and extract images of faces in all of the videos. The extracted images of the same week as in the first experiment are used as an (unaltered) test set for the second experiment. The rest of the videos are used to extract images for the training set. Since we want to have a more balanced training set as well as images with a high resolution, we have handpicked the images per per-son. The training set contains 49 classes (48 people and 1 class for negative images) and the classes consist of dif-ferent amounts of images, varying from 3 to 32 (figure 1). The test set has the same 49 classes, but is much more im-balanced. This is due to the fact that some people appear more in the videos than others. A number of these images have been filtered out in the training set, but not in the test set which is also why the test set is larger in size than the training set. Eventually, a training and a test set were cre-ated with respectively 764 and 1270 gray scale images of size 224x224.

The required dataset for a Siamese network consists of pairs of images, similar and dissimilar. An image can be paired with every other image in the dataset, a small dataset like ours can be significantly increased in size this way. We have created a dataset with 50.000 image pairs, with a ra-tio of 0.35 and 0.65, similar and dissimilar respectively (far more dissimilar pairs of images can be made). Data aug-mentation has been applied: horizontal flip with a probabil-ity of 0.3 and a random rotation between 0 and 20 degrees.

(5)

Figure 1: Distribution of train classes

4 face detection methods

Three different face detection methods are tested. The OpenCV face detector trained with a deep neural network. Second is the Dlib face detector which uses Histograms of Oriented Gradients. This is one of the most simple and early developed methods for object detection. Lastly, we will test the "Tiny Face Detector".

4.1 Histogram of Oriented Gradient

The local intensity gradient is a directional change of the intensity of the color in an image. HOG features are based on the idea that the appearance and shape of a person can be characterized by these gradient vectors. This method is robust to changes in pictures with the same person (e.g. lighting or positioning) since creating the histograms gives translational invariance [17].

Figure 3 shows how an image is converted to a HOG rep-resentation. The contours of the person in the image are clearly visible as well as the eyes and mouth (as well as some objects in the background).

4.2 OpenCV DNN face detector

An RGB frame (colored) is the input for this detector. The three color dimensions are cut off and the image features are pre-processed by creating a blob (Binary Large OBject), which is an object that contains all the relevant informa-tion about each contour [2]. The pre-processing steps that

are conducted when the blob is created are: supplying the spatial size that the expects and subtracting mean values. A preloaded CNN is used to detect faces in this blob.

4.3 Tiny Face Detector

Hu et al. developed this detector in order to solve the chal-lenge of detecting small objects. They searched for the best template, which can be seen as a scanning-window detec-tor, to be used for a fixed-size image, considering the object detection as a binary heatmap prediciton problem where pixel position (x, y) specifies the confidence of a fixed-size detection centered at (x, y).

Hu et al. discuss the role of scale, resolution and context. Additional context (enlarging the template) helps specifi-cally for small sized faces, it is not beneficial anymore with faces larger than 300 pixels. They found a way to improve test results based on the resolution of an image as well, by downsampling a large image (larger than 140 pixels in height) by 2 and upsampling a small image (smaller than 40 pixels) by 2 and using a medium sized template for both of them. They used the same resolution for images of a size between 40 and 140 pixels.

The model makes use of multi-task learning, which lets the model perform detections on different scales while sharing information across the scales. The architecture of the model is divided in a 3-level image pyramid. The up- and downsampling is included in the pyramid. An image is fed in the pyramid, up- and downsampled and after which the three images are fed into three

(6)

convolu-Figure 3: HOG representation

tional neural networks. These networks predict template responses which are then merged and given as an output for the final detection.

5 Face identification methods

This section will cover the three different methods (with CNN’s implemented in four different ways) that are used to build a functional face identification model.

5.1 Local Binary Pattern Histograms

This recognizer uses the LBP combined with histograms to create a vector of the face images. The LBP operator labels the pixels by sliding a filter over the image, where the center pixel functions as a threshold for the surrounding pixels [1] (xc, yc): LBP (xc, yc) = n X m=0 s(im, ic)2m

where n loops over the number of neighbors (m − 1) of the center pixel, imand ic are the gray level values of center

and surrounding pixel and s(im, ic)is 1 if x >= 0 and 0

otherwise. If the result of s(im, ic)is 1, it is converted to

the corresponding integer. The filter then moves up one pixel until the whole image is converted (figure 4).

Figure 4: LBP representation

To create a LBP representation, there are four param-eters: the radius, neighbors, grid X and grid Y. The ra-dius equals the amount of pixels between the circle of data points. The neighbors represents how many pixels are used to calculate the new binary value of the center pixel. Grid X and Y are the number of cells in respectively horizontal and vertical direction. These two numbers tell the recog-nizer how large the grid has to be for each histogram. The image is then divided in multiple grids and the histograms are extracted and concatenated. This is done for training and test data and the recognizer will compare the test im-age with every imim-age in the training data and return the similarity between single images as a result.

5.2 K-Nearest Neighbor

KNN is a data classification method which can be used as a face identification method [16]. The MN dimensional (row-column) features are converted to a MN-dimensional vec-tor. Each of these vectors represent a data point in an MN-dimensional space. Ideally, the vectors of the same per-son are near each other. When a new data point is added to the space, we can determine what its nearest neighbors

(7)

are. K stands for the number of nearest neighbors to look at, which is the only parameter for this algorithm. The distance between the new data point and the neighbors is mostly calculated with a Euclidean distance, described as the following: d(p, q) = v u u t m X i=1 (qi− pi)2

where N is the number of variables in the vector, and qi

and piare the values of the ith variable at points p and q

respectively. The prediction is the class corresponding with the most nearest neighbors.

5.3 Convolutional neural networks

As we have mentioned in the related works section, con-volutional neural networks are used for face detection and identification on more than one occasion. In this research, we have experimented with four different CNN’s. Before we dive into the architecture of the network, first let us ex-plain what a convolutional neural network is.

A CNN is a neural network which has an input layer, hid-den layers and an output layer. However, the hidhid-den layers that typically appear in a CNN are: convolutional, pooling, activation and fully connected layers [14]. So called build-ing blocks contain these kind of layers and eventually form a CNN architecture. These blocks transform the input layer to an output layer, which then continues to the next block. A convolutional layer takes an input and slides a filter of, e.g. 3x3x3 over every possible area of the image (no partial filters) with a certain stride. The stride determines the steps that the filter moves over the image (usually 1). The pixels in the area of the image where the filter is, are multiplied by the numbers in the filter. These multiplications are summed and the result is one number at the position of the pixel in the middle. Pooling works same way as a convolutional layer: sliding a filter over every possible area of the image. Only with pooling, the action which the filter does is based on the sort of pooling. With max pooling, the filter simply takes the highest value of every subregion. In the case of pooling, a filter of 2x2 and a stride of 2 is typically used. Next is the activation layer determines if a neuron in the network should be activated or not. In other words, it gives a weight to the neurons. There are multiple ways to add an activation layer, but the fastest way to do this is with a ReLU layer [14]. The following function is used when a ReLU

layer is added: f(x) = max(0, x). With this function, the neuron will become 0 if it is negative and will keep its value if it is positive. At last a fully connected layer is added to the network to convert the input (the output of the last layer) to an N-dimensional vector (N is the number of classes).

5.3.1 Residual network

Very deep neural networks are difficult to train, due to ex-ploding gradients which occurs when accumulated gradi-ents become too large. Sometimes, deeper neural nets do not always result in a higher performance. Residual learn-ing solves this problem by uslearn-ing skip connections between layers. These are extra connections between nodes in dif-ferent layers of the network to skip one or more layers.

5.3.2 Siamese network

A Siamese network does not classify its inputs like a normal neural network. Instead, it learns to detect similarities be-tween them. A Siamese network consists of two identical, in architecture as well as in weights and biases, (convolu-tional) neural networks. Two input images each go through the layers of one of the networks, where the last layers are fed in a loss function. The loss function determines the sim-ilarity between the two images. The loss is back propagated through the network, where the parameters are updated in the same way, and new sets of images are fed into the two networks.

Contrastive Loss: we use a contrastive loss function where similar and dissimilar pairs contribute separately to the loss: L(W, Y,−→X1, − → X2) = (1 − Y )1 2(DW) 2_{+ (Y )}1 2max(0, m − DW) 2 (1)

where m > 0 is a margin, Y is the label (1 for a similar pair, 0 for a dissimilar pair), DWis the Euclidean distance between

vectors−→X1and

− →

X2(the outputs of the neural network) and

W are the parameters for the function GW which is used

in calculating DW: DW = ||GW( − → X1) − GW( − → X2)||2

This contrastive loss function is employed to pull neighbors together and push non-neighbors apart by learning the pa-rameters of the neural network [9].

(8)

6 detecting

faces

in

custom

dataset

Three face detectors were tested with different parameters. The results are evaluated with a precision and recall score. Since the goal is to extract every person who is present in the frames and because we can incorporate a class "nega-tive" in the dataset for the face identification part, the recall is the most interesting score:

Recall = T P T P + F N

where TP is the number of unique persons extracted and FN is the number of unique persons missed. Precision is calculated as the following:

P recision = T P T P + F P

where FP is the number of negative images (not a face). More insight in the performance of the detectors is gained by calculating a second recall score. The TP in this case is the number of faces extracted and FN is the num-ber of faces missed. A partial dataset of 312 faces (in 300 frames) was created to determine this performance. These frames were manually chosen (all of them contain one or multiple faces).

Implementation details: All of the face detectors re-quire a confidence level determining how certain the model has to be before extracting a face from the frame. This level has to be between 0 and 1 (100% certainty). The experiment is executed with a confidence level of 0.80, 0.50 and 0.20.

6.1 Results

Table 2: Performance of face detectors (%) Detector Conf. Level Prec. Recall-1 Recall-2

0.80 97.6 72.8 57.1 OpenCV DNN 0.50 92.8 81.3 69.9 0.20 71.1 85.2 84.6 0.80 100.0 88.6 89.1 Tiny Faces 0.50 98.9 93.3 92.6 0.20 98.1 95.3 93.6 0.80 92.1 48.3 32.7 Dlib HOG 0.50 84.5 55.7 35.9 0.20 79.2 61.1 45.5

Based on the precision and recall of a one week sample of the dataset, the tiny face detector of Hu et al. exceeds the performance of the other methods. Additionally, the tiny face detector performs best on the second recall test. Since there is not much difference between the precision scores of the three tiny face detectors, the one with the highest recall is preferred. This means that the dataset for experiment 2 is created using the tiny face detector with a confidence level of 0.20.

7 identifying extracted faces

The second experiment consists of two parts: testing the more simple k-NN and LBPH identification methods and trying to achieve a higher performance by experimenting with the four proposed CNN’s.

The proposed methods have to work with an imbalanced dataset during training as well as during testing. In order to observe the performance when the methods could train and test in a more ideal situation, the same tests are executed on the top 20 classes (where each class has more than 15 images in the training set).

During these experiments, the score metrics will be ac-curacy (the percentage of correctly classified faces) and an F1-score, which is calculated as follows:

F 1 = 2 ∗ precision ∗ recall precision + recall

Implementation details part 1: The images in both the training and the test set are flattened after which the k-NN algorithm is tested with: k = [1, 2, ..., 14, 15]. The parameters of LBPH recognizer have been tested as follows: r = [5, 6, 7, 8, 9], n = [10, 12, 14], X = [4, 5, 6] and Y = [4, 5, 6].

Implementation details part 2:In this part of the ex-periment, we employ the four CNN’s mentioned in section 1. The ResNet18 [10] architecture is loaded four times from the PyTorch package in Python with the pre-trained param-eter marked as True or False. Now we have four models, two pre-trained on the ImageNet dataset and two with ran-dom weights. One of each is used for training and testing on our dataset and the other two are first pre-trained with the Siamese network structure. We will henceforth refer to these as res, res-ImageNet, res-SiameseNet and res-All as

(9)

the network without pre-training, with pre-training on Im-ageNet, with pre-training on the Siamese network and with ptraining on ImageNet and on the Siamese network re-spectively. Table 3 show the results and figure 5 show the learning curves for each of the convolutional neural net-works.

A contrastive loss function with a margin of 2 and the Adam optimizer with a learning rate of 0.0001 are applied to the res-SiameseNet and res-All. The models are trained for 20 epochs with batch size 128.

We have applied a cross entropy loss function and the Adam optimizer with a learning rate of 0.001 to all the mod-els at classification training. A learning scheduler is applied to the learning rate with a multiplicative factor of 0.1 and a step size of 7. The model is trained with a batch of size 64 and for 25 epochs.

7.1 Results

Table 3: Performance of face recognizers (%) Algorithm Dataset Accuracy F1-score

KNN-9 Top 20 55.4 49.6

KNN-5 All 38.2 35.3

LBPH-[8,12,5,5] Top 20 83.3 83.5

LBPH-[8,12,4,5] All 72.9 72.2

Table 3 shows the accuracy and f1 score of the k-NN and the LBPH method on both the entire dataset and on the top-20 dataset. It is interesting to see that reducing the datasets to 20 classes results in a noticeable increase in performance: over 10%. Evidently, the LBPH recognizer exceeds the per-formance of the k-NN classifier with a best result of 72.9%.

7.2 Results

We were not able to improve the performance of the LBPH classifier, but there are some interesting results. First of all, there is one enormous outlier (bottom row table 4). This model, with an accuracy of 88.3%, was produced by res-ImageNet. Unfortunately, we could not reproduce the re-sult which implies this was a lucky shot. Except for the outlier, the results of res, res-ImageNet, res-SiameseNet and res-All were consistent throughout five separate training rounds. Table 4 shows the models with the best perfor-mance.

Table 4: Performance of CNN’s (%) Algorithm Dataset Accuracy F1-score

Res Top 20 58.4 58.1 Res All 53.4 51.3 Res-ImageNet Top 20 77.3 78.9 Res-ImageNet All 64.4 63.8 Res-SiameseNet Top 20 61.3 62.5 Res-SiameseNet All 48.3 49.2 Res-All Top 20 80.4 80.3 Res-All All 67.2 67.9 Outlier All 88.3 88.8

None of the performances exceeded the LBPH classifier and pre-training on the Siamese network show a decrease in accuracy. Res-SiameseNet and res-All both fail compared to res and res-ImageNet respectively, suggesting that the proposed pre-training is unhelpful for a convolutional neu-ral network. Figure 6 and 7 show the normal distributions of predictions of Euclidean distance of 128 image pairs (sim-ilar and dissim(sim-ilar) with and without training. The Siamese network clearly learns from training whether two images belong to the same class, but not as well as it should. The loss does not decrease below 0.39 and it reaches this point after 300 iterations in the first epoch, which implies that is stops learning after that moment. Since a margin of 2 is used for the contrastive loss function, a loss of 0.39 suggests that the model is not able to learn similarities well enough on this dataset.

8 conclusions and future work

We have collected a new dataset containing videos, face images and annotations for training and evaluating face detection and identification based on surveillance (given the angle and placement of our camera). The HOG detec-tion method that has proven to work well on frontal face images, performs drastically less than the other methods. The adopted face detection algorithm, finding tiny faces, reaches highest performance as is shown in table 2. We have applied multiple face identification algorithms and proposed to pre-train the CNN’s with a Siamese network. The LBPH classifier reached highest performance and the convolutional neural networks were not able to improve this result on the test set (except for one lucky shot). If we compare the results of the CNN’s that were not pre-trained

(10)

Figure 5: Training

Figure 6:Normal distribution without training Figure 7:Normal distribution with training

with a Siamese network and the two CNN’s that were, we see a decrease in performance when the ResNet18 model is loaded without pre-training on ImageNet and a slight increase if the model is pre-trained on ImageNet. How-ever, it is interesting to see that SiameseNet and res-All both perform better when a more balanced dataset with less classes is used. For further research we would suggest to use a more balanced dataset with more and high qual-ity images per class. The architecture of a Siamese network with one shot learning could be explored. However, since our Siamese network stops learning so quickly, this would require a better dataset as well. Other than that, other algo-rithms or methods can be used to apply face detection and face identification.

9 acknowledgements

I would like to thank my supervisor Thomas Mensink from the University of Amsterdam for his help with the design and execution of the experiments and my supervisor Rob-bert van Hintum from Xomnia for his support in the process of writing this thesis.

References

[1] T. Ahonen, A. Hadid, and M. Pietikainen. Face description with local binary patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037–2041, Dec 2006.

[2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

[3] R. Chellappa, C. L. Wilson, and S. Sirohey. Human and machine recognition of faces: a survey. Proceedings of the IEEE, 83(5):705– 741, May 1995.

(11)

[4] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and alignment. In ECCV 2014, pages 109–122, 2014.

[5] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In In CVPR, pages 886–893, 2005.

[6] O. Déniz, G. Bueno, J. Salido, and F. De la Torre. Face recogni-tion using histograms of oriented gradients. Pattern Recogn. Lett., 32(12):1598–1603, September 2011.

[7] Dhriti and Manvjeet Kaur. Article: K-nearest neighbor classification approach for face and fingerprint at feature level fusion. International Journal of Computer Applications, 60(14):13–17, December 2012. [8] Ivan Gruber, Miroslav Hlaváč, Miloš Železný, and Alexey Karpov.

Facing face recognition with resnet: Round one. In Andrey Ronzhin, Gerhard Rigoll, and Roman Meshcheryakov, editors, Interactive Col-laborative Robotics, pages 67–74, Cham, 2017. Springer International Publishing.

[9] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Con-ference on Computer Vision and Pattern Recognition (CVPR’06), vol-ume 2, pages 1735–1742, 2006.

[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.

[11] Peiyun Hu and Deva Ramanan. Finding tiny faces. CoRR, abs/1612.04402, 2016.

[12] A. G. Jivani. The novel k nearest neighbor algorithm. In 2013 In-ternational Conference on Computer Communication and Informatics, pages 1–4, Jan 2013.

[13] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. 2015.

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012. [15] Adrian Rosebrock. Face detection with opencv and deep learning.

https://www.pyimagesearch.com/2018/02/26/ face-detection-with-opencv-and-deep-learning/. Accessed: 06-05-2018.

[16] Eko Setiawan and Adharul Muttaqin. Implementation of k-nearest neighbors face recognition on low-power processor. volume 13, pages 949–954, 2015.

[17] C. Shu, X. Ding, and C. Fang. Histogram of the oriented gradient for face recognition. Tsinghua Science and Technology, 16(2):216–224, April 2011.

[18] Paul Viola and Michael J. Jones. Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154, May 2004. [19] Dumitru Erhan Christian Szegedy Scott Reed Cheng-Yang Fu

Alexander C. Berg Wei Liu, Dragomir Anguelov. Ssd: Single shot multibox detector, 2015. cite arxiv:1512.02325Comment: ECCV 2016. [20] Stefanos Zafeiriou, Cha Zhang, and Zhengyou Zhang. A survey on

face detection in the wild: past, present and future. 138, 04 2015. [21] Guangcheng Zhang, Xiangsheng Huang, Stan Z. Li, Yangsheng

Wang, and Xihong Wu. Boosting local binary pattern (lbp)-based face recognition. In Advances in Biometric Person Authentication, pages 179–186, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.