Fraud detection in video recordings of exams using Convolutional Neural Networks

32  Download (1)

Full text


Bachelor Informatica

Fraud detection in video

record-ings of exams using

Convolu-tional Neural Networks

Aiman Kuin

June 20, 2018












With the digitalisation in education and the use of digital platforms like BlackBoard and Canvas there exists a way of taking exams in these digital environments. Instead of answering a set of questions with pen and paper, these digital exams allow students to type their answers in the digital environment used by their university. To keep the integrity of the exam in place during these exams we have to consider the options of students trying to cheat on their exams. During these exams it is difficult to keep track of every student’s screen at the same time to check if anyone is showing fraudulent behaviour. Even when recording all students’ screens it is very labour intensive to inspect every student one by one. This thesis uses three different convolutional neural networks on students’ screen recordings to determine if they show fraudulent behaviour or not. Convolutional neural networks are widely known for their effectiveness in image classification and showcase this in this paper, allowing us to build a framework to automate this labour intensive process. The results look promising, achieving an accuracy of 96.8% when classifying frames into fraud and not fraud.



1 Introduction 7

1.1 Research question . . . 8

2 Theoretical background 9 2.1 Proctoring digital exams . . . 9

2.2 Image classification . . . 9

3 Neural Networks 11 3.1 The workings of an artificial neural network . . . 11

3.1.1 Function and structure . . . 11

3.1.2 Training a neural network . . . 12

3.2 Convolutional neural networks . . . 13

3.3 Convolutional layer . . . 13

3.4 Pooling layer . . . 14

3.5 ReLU layer . . . 14

3.6 Fully connected layer . . . 15

3.7 Training a convolutional neural network . . . 15

3.7.1 Backpropagation . . . 15

3.8 Evaluation . . . 15

4 Applying convolutional neural networks to screen recordings of digital exams 17 4.1 Our dataset . . . 17

4.2 Annotating and preprocessing for training . . . 17

4.3 Methodology . . . 18

4.3.1 VGG16 model . . . 18

4.3.2 Inception-v4 . . . 18

4.3.3 MobileNets . . . 19

4.4 Fine tuning three models . . . 20

4.5 Experimental setup . . . 20

4.6 Episode-constrained cross-validation . . . 20

4.7 Training our models . . . 21

5 Results and validation 23 5.1 VGG16 model . . . 23

5.2 Inception-v4 . . . 24

5.3 MobileNets . . . 24

5.4 Overview . . . 25

6 Stitching the framework together 27 6.1 Evaluating the framework . . . 28

7 Conclusions and future work 29 7.1 Conclusions . . . 29




The University of Amsterdam allows students to take their digital exams on their own laptop. Additionally, the university also allows the possibility to take an exam from anywhere in the world and not just the closed-off lecture hall in the university [1]. This gives students a lot of freedom to choose in which environment and with which tools they want to take their exams. Furthermore, it allows students that do not live near the university, and potentially have followed a class halfway across the globe, to take their exams off campus. One must think about integrity when taking these exams off campus. When students are taking an exam in a closed-off lecture hall with a proctor sitting behind them, how does the university, more specifically the proctor and professor, preserve integrity?

There are multiple ways to cheat with a laptop both on and off campus as described in [2]. One example of this is a USB keyboard hack. This allows a student to save all their notes into a USB key injector which is attached to a personal external keyboard. Once the exam starts, the student opens any variant of a text box and the keyboard will type out all their notes for them. There are many ways to attempt to monitor and detect all these different ways of cheating such as not allowing these external tools. Another approach by Sgaard proposed the use of software called Safe Exam Browser, monitoring everything students do in their browser [3]. However, using this becomes difficult when allowing search engines such as Google during an exam which can also be seen in [4]. Some of the most notable issues described in this study were that whitelisting every website or resource was impossible. As a result of having such wide Internet access comes the danger of students communicating with others during their exam.

An attempt to go even further to detect fraud in digital exams has been done by a tool known as ProctorExam1. This tool features which records the entire monitor of the student during the exam. In addition to this, two cameras are to be setup such that the face of the student is clearly visible in one of the cameras. The other camera is to be placed behind the student to be able to view their surroundings. This should allow the proctor to check if the students’ screen recording matches the screen seen from behind the student with the second camera. These three video streams are then examined by someone to determine if the student has committed fraud or not [5].

However, the task of watching all these videos is very labour intensive, as there has to be a trusted proctor watching all the video streams. This task can be automated by building a framework where teachers only receive a list of students that have shown suspicious behaviour at certain time stamps. This framework could use different classification methods to detect a student cheating.

We propose to create a framework that allows students to take an exam using any resources they want, including search engines such as Google, similar to one of the scenarios mentioned in the case study done by Brouwer et al. in [4]. The framework will consist of three parts: an interface, video processing and frame classification. The interface would send videos of all the students’ screen recordings to a pipeline that consists of a series of methods. This pipeline would be separated into video processing and frame classification. The first part, video processing,


would shorten the video from a two to three hour long video down to a few thousand frames, leaving as few duplicates or similar looking frames as possible. The second part of the pipeline, which this thesis is focused on, will be to build a series of methods that are able to take in the processed videos from the first half of the pipeline and classify the frames with a trained convolutional neural network (CNN) into normal or suspicious behaviour. Finally, the results would be visible in the interface for teachers to see all the instances of fraudulent behaviour and give a final judgement of whether or not the students cheated on their exam.


Research question

The goals of my project are to create a series of methods which take in processed videos and examine the frames in order to detect and identify potentially fraudulent frames. The result should show which frames showed suspicious activity so that the corresponding time stamp can be found. I attempt to do this with by fine tuning existing convolutional neural networks (CNNs) with images of screen recordings. The goal of the entire project would be to create a framework that can be used in practice and would reduce the workload of the professors and proctors such that they can manually operate and oversee the entire process themselves instead of using external services. As such, I attempt to answer the following research question:

“Can an existing convolutional neural network be fine tuned with a sparse dataset to detect potential fraud of online exams?”

To answer this research question we also answer the following subquestions:

“Can an additional pipeline be created to annotate frames of a video with ease in order to train new convolutional neural networks?”

“How well does the fine tuned convolutional neural network perform on the dataset of screen recordings?”



Theoretical background


Proctoring digital exams

Studies of fraud detection of digital exams has been conducted several times. Cluskey Jr, Ehlen, and Raiborn mentioned that students that cheat find the cost of getting caught and punished lower than that of the benefit of a higher grade [6]. He proposed a solution called the Online exam control procedures (OECPS) that would severely reduce the students’ ability to cheat. However, making use of all these procedures seems unrealistic. One of the solutions he mentions is to let every student take their exam at the same time. This would prevent the students from sharing questions and answers between groups set to take the tests at different times. This is not always feasible with tests for online courses where students are dispersed around different parts of the world.

Another paper describes a framework with a similar setting as ProctorExam, having the students set up two webcams [7]. But, instead of placing the second webcam behind the student, they attach it to a pair of glasses that the student wears during the exam. This camera then captures the view of the student. The webcam also has a built-in microphone to capture audio along with the visual cues of the student. This data is then processed through a series of methods to generate high-level features to train and test their classifier, a Support Vector Machine (SVM). This paper also mentions the option where the student only has access to a Blackboards Respondus Lockdown Browser (RLB) for their exam. This is not an optimal solution when exams require Internet access to certain sites when dealing with open book exams, just like in our case. The solution mentioned in this paper is to check how many windows a student has active and when this exceeds one it assumes that the student is cheating. This seems impractical when students have both an online source open alongside their test, totalling for at least two windows. Thus, another solution has to be found. Other methods involving multiple cameras from multiple angles such as described earlier have been used [8].

From the paper of Atoum et al. we could gather that despite some of their decisions for active window detection, the use of a support vector machine seemed promising [7]. SVMs are widely used for machine learning purposes, such as image recognition and classification. We could use one of these machine learning algorithms to classify frames of the screen recordings of the students in our project.


Image classification

Many algorithms exist to classify images. Lu and Weng give an overview of different methods that are used for image classification [9]. They define different categories based on criteria and characteristics of the different methods. These categories include supervised and unsupervised learning, which are later discussed in section 3.1.2.


databases such as ImageNet2, CIFAR-10 and CIFAR-1003 contain thousands of images from multiple objects and animals that can be used by researchers to train and test their machine learning algorithms. These databases are different however, CIFAR-10 and 100, like the name would suggest, classify their images into ten and one hundred types of objects, referred to as classes, respectively. ImageNet goes more in depth than this. Where CIFAR classifies a dog and stops, ImageNet takes it further and classifies a multitude of species of dogs. This results in thousands of classes or synsets as they are called.

Challenges started to surface with these big datasets to see whose model performs best and allows models to be compared as they are built and trained on the same dataset. These challenges have pushed the state-of-the-art networks to evolve further and further to what it is today. Models that make use of Rectified activation units showcase amazing results, often being able to classify these images better than humans [10].

Many models are available and offered for use by deep learning frameworks such as Tensor-Flow and Keras. They allow for easy and quick employment of deep learning applications for programmers. These models often come pre-trained on one of the datasets mentioned above.

2ImageNet dataset,



Neural Networks


The workings of an artificial neural network


Function and structure

Before we can discuss convolutional neural networks we must first describe an Artificial Neural Network (ANN) which are usually referred to as ‘regular’ neural networks because of its similarity to a biological neural network. In a biological neural network (as seen in figure 3.1), signals flow from the dendrites through the cell body, activating it which creates another signal that passes through the axon into other dendrites.

Figure 3.1: Schematic image of a biological neuron.

ANNs consist of three parts: an input layer, a hidden layer, and finally an output layer, where each layer consists of a certain amount of nodes as can be seen in figure 3.2. The information flow is much like that of a biological neural network, where a signal can activate a node (similar to a cell) and set off the nodes activation function which creates a new signal that is passed on to the next layer [11].









1_0 Input layer Hidden layer Output layer σ0 σ1 σ2

Figure 3.2: Example of an artificial neural network with 1 hidden layer. The weighted sum of node X1 0 is calculated using nodes X0 0, X0 2, X0 2 and their connections (outlined in red) σ0, σ1, σ2.

A new signal is created by taking the sum of all the input signals and passing it forward to the next layer of nodes where this process repeats until the final layer is reached and the final output is given. However, the parameters received by the prior layers are not all equally important. The combination of certain nodes may mean more than others. Because of this, we have to keep track of the importance of all incoming connections of a node which are sourced from the nodes from the previous layer. This is done by assigning a weight to every connection, which turns the sum of inputs into a weighted sum. The weighted sum of node X1 0 in figure 3.2 would be defined as σ0· X0 0+ σ1· X0 1+ σ2· X0 2.

The newly created signal’s value can vary greatly in range due to the set of weights and input values. This can result in the a great range of numbers as the final result in the output layer. Defining the value of every number would be bothersome. Furthermore, when working with probabilities it is common to work with numbers that define the range of 0-100%, often done with numbers between zero and one. This is where the activation function comes in. A simple example of an activation function is the sigmoid function [12]. This function takes any number as input and returns a value between zero and one. A more commonly used activation function is the softmax function which reduces all numbers to values between zero and one in such a way that their sum equals one.


Training a neural network

After creating a layout of an artificial neural network we need to initialise the values of the weights for all connections. This can be done by either setting the weights explicitly or by training the network with a random set of weights [11]. Most often there is no prior knowledge of the network so, it has to be trained from scratch, meaning we have to try different values and see what works best. Three major learning paradigms can be used to do this: supervised learning, unsupervised learning and reinforcement learning.

Supervised learning takes in examples of input, such as images of objects, that have to be classified with their desired output, such as which class they belong to. This is done such that all classes have their own output node. These examples are then inserted into the network to generate an output which will most likely be nonsense at first with random weights. This output is then compared to the expected output and an error, often the mean squared error, is calculated for every class. Through algorithms such as backpropagation, later explained in section ??, the weights are changed to result in a smaller error and the process is repeated. This continuous process is known as training the network [13].

Unsupervised learning takes in examples of input without the desired output. There are no predetermined set of categories for the network to classify. Multiple algorithms are able to find


small patterns and blobs which are then used to create clusters. Unsupervised learning is used to develop features from unlabelled data [14].

Reinforcement learning is different in a way that it does not take in any examples of input or output, but instead it focuses on what to do with any data. There are events with results and the aim is to find the actions that offer the greatest reward [15].


Convolutional neural networks

Convolutional neural networks are a class of regular neural networks that are made to minimise processing [16]. They require fewer parameters and nodes, making them easier to train with only a small loss in performance. CNNs have an input and output layer with one or several hidden layers, similar to the ANNs discussed in section 3.1. The difference lies in the hidden layers as seen in the CNN in figure 3.3. In the initial hidden layers, features are defined that consist of edges and corners. These features gradually become bigger and start to have the ability to recognize shapes. Following the example in figure 3.3, these shapes that make up parts of the input image can consist of visual objects like square windows and pointy leaves. At the end of the hidden layers these features are used to classify the input image into various classes.

Figure 3.3: Illustration of a convolutional neural network [17].


Convolutional layer

Like the name suggests, one of the features of a convolutional neural network is convolving the received parameters of the nodes from the previous layer. Convolution allows us to detect shapes and edges with the use of certain filters that are able to locate big changes in values in adjacent indices. For example, the Sobel operator, which consists of two 3x3 filters as seen in figure 3.4. One filter is used to detect horizontal edges and the other is used to detect vertical edges. Convolving images with these filters gives an approximation of their derivative, resulting in an image where the edges are highlighted as seen in figure 3.5.


(a) Input image. (b) Result of the x filter. (c) Result of the y filter.

Figure 3.5: Example of applying both Sobel filters to an image.

This type of layer with a lot of convolution operations is called a convolution layer. This layer convolves the input with a filter that is defined for every node, much like a weight is defined for every node with an ANN. On top of this, CNNs make use of parameter schemes to share these filters between different nodes and layers, resulting in a feature map. Sharing certain filters decreases the number of free parameters and reduces the amount of memory needed for the network.


Pooling layer

The next type of layer is called the pooling layer. There are two main types of pooling: max-pooling and average-max-pooling. As the names suggest max-max-pooling picks the highest number and average-pooling averages all the numbers in the window. Its goal is to reduce the data between 2 layers and generalise. It does this by sliding through the data vector with a set window, picking one number inside this window, and then the stride determines how far the window slides before this process is repeated. An example of this is shown in figure 3.6. Sometimes when the window reaches the end of a row or column the window will be too big and wont fit on the remaining row or column. There are several ways to solve this issue known as padding. One way is to not apply pooling to these numbers and pretend they never existed in the first place. Another way is to pad the input with additional rows or columns filled with zeros, this would make it so that the last row or column is always taken into account.

Figure 3.6: Example of max-pooling with a 2x2 window and stride of 2.

These two types of layers leave us with a few parameters such as window and filter size that are constant throughout the network and must be specified upfront. These parameters are known as the hyperparameters of the network.


ReLU layer

The next type of layer is known as the ReLU layer (Rectified Linear Units). This layer pro-cesses every value x of an image and applies a simple non-saturating, nonlinear function f (x) = max(0, x) [16]. This can be compared to the sigmoid functions of an ANN, but without many of its downsides as described in [18].



Fully connected layer

Finally we have the fully connected layers. All neurons in this layer are connected to all neurons in the previous layer, just like ANNs. This layer actually classifies our input image based on the results from the previous layers. In other words, which combination of features have been found. Just like with ANNs certain nodes are expected to have a high value for certain classes so these layers also make use of weights for the connections between their nodes.


Training a convolutional neural network

Training a CNN is similar to training an ANN. Both make use of backpropagation algorithms to train their weights. The difference between the two is that a CNN can train not just the weights in the fully connected layer, but also the values of the filters found in the convolutional layers.

Most of the trainable parameters of a CNN can be found in the fully connected layers. Because of this it is most prone to overfitting. To reduce the chance of overfitting a method called dropout is used [19]. This method has a chance of removing a node with its connections with probability 0.5 from the network. The model is then trained on roughly half of its nodes for one iteration. During the next iteration all the nodes and connections are inserted back into the network and this process is repeated. This forces the network to not rely on the presence of one node or feature, but to spread out its weight across several connections.



When feeding our training set to our network once, we are left with some output that is often not optimal. We can compare the received output with the desired output when we make use of supervised learning. When comparing these two outputs we can calculate the cost of our network. We want to minimise this cost in order to receive a more favourable output from our network. Backpropagation is the algorithm that describes this process [20] [21]. Every output node has their own error value which is then propagated back through the network. Every node in the hidden layers then has an associated error value that it uses to update their weights. The effect this error has on the change of the weight depends on the learning rate of the network. A high learning rate rapidly changes the weights and neurons, while a low learning rate makes for slower, but more accurate changes. This is done for all the data in the training set, each wanting their own changes to the network. This process requires a long time to do for every training example. Because of this, it is common to do this process in smaller batches of our training set.



There are several metrics to measure the performance of our network. We will do this by use of a confusion matrix as shown in figure 3.7. There are three metrics which will be used when evaluating these models. The first, and most important, is accuracy. This tells us how accurate our network is in classifying images to their right class and is calculated as follows:

accuracy = T P + T N all observations

The second metric is recall. This shows how many cases of fraudulent behaviour were found by our network. The higher the value, the better our network is able to recognize fraud. Recall is calculated as follows:

recall = T P T P + F N


The third metric is precision. This shows what portion of fraudulent instances found by our network should actually be classified as fraud. A lower precision would increase the workload of teachers that have to make the final assessment at the end of our pipeline, as mentioned in section 1. We calculate precision as follows:

precision = T P T P + F P True  positive  (TP) False  negative  (FN) False  positive  (FP) True  negative  (TN)

Predicted class



True False True False



Applying convolutional neural networks to

screen recordings of digital exams


Our dataset

The dataset used in this paper consists of three screen recordings of students taking a digital exam. Every video is two hours long and consists of students following a protocol where they pretend to be taking a test and have to commit fraud by use of communication with others. Examples of these are going to be websites such as Facebook or WhatsApp to chat with friends or Gmail to check their email. These activities would be classified as fraud during an exam. Besides this, the students also work in a specific blackboard exam environment where they read questions and fill in their answers. In addition to this, there are multiple examples of non-fraudulent activities, such as using Google to find the answer to a question and writing down notes in Microsoft Word.


Annotating and preprocessing for training

To detect fraud we first need images of examples of both fraud and not fraud so that we can use supervised learning. To do this we need to gather many training images on which to train our network. We can obtain these images by saving every frame of the videos of students’ screen recordings as an image and label these with either fraud or not fraud. This is very labour intensive and takes too long to do for every student.

Tools exist to label parts in videos, but there exists no applications that offer everything needed for this process. One of these tools is ANVIL, which we use because of its simplistic user interface and ability to place labels between two points in a video as seen in figure 4.1 [22]. After all labels have been placed, the user can save these with ANVIL and create a .anvil file. Despite its name, this file is nothing more than regular XML. Because no programs exist that fit into this process and give us our desired output, we have made a script to do this. However, to do this effectively the user has to specify the start and end frames alongside the label for every ‘clip’ as ANVIL only provides a start and end time which can lead to inaccurate calculations for the frame numbers. Our script then parses the label, start and end frame along with the video that was labelled and creates a directory with subdirectories for every class. Every subdirectory is filled with the all the images of all frames specified between the two frame numbers.


Figure 4.1: Labelling videos in ANVIL.




VGG16 model

The first model of the three is the VGG16 [23]. Its architecture can be divided into a few components as seen in figure 4.2. It consists of 16 layers with ReLU-activations between every layer, after which the softmax function is applied and the final predictions are shown.

Figure 4.2: The VGG16 architecture.

The first two layers consist of two convolutional layers with a width of 64 that apply a convolution with a 3x3 filter. A 3x3 filter is chosen because it is the smallest size that is able to capture the notion of left and right [23]. Following these two layers is a max-pooling layer with a window size of 2x2 and a stride of 2. After these 5 sets of convolutional layers followed by a max-pooling layer there are three fully connected layers that were regularized with dropout. Two of these layers have 4096 channels (nodes) and the final layer has 1000 channels for the 1000 classes to identify within the dataset. As previously stated, the softmax function is then applied to normalize the output and show the final predictions.



The second model is Inception-v4, also known as InceptionResNetV2 [24]. This model has significantly more layers than VGG16, adding up to 467 layers4. Despite this, Inception-v4 has less than half the amount of parameters compared to the VGG16 network5. This model makes

4Example implementation of the Inception-v4,


use of inception blocks as seen in figure 4.3. These blocks allow the use of different filter sizes rather than having to commit to one. The network can train on multiple hyperparameters at once.

Figure 4.3: Inception block A of the Inception-v4 [24].



The final model is MobileNets, a small and efficient model made for mobile and embedded video applications [25]. This model stands out next to the two other models as it has by far the fewest trainable parameters. It also makes use of depthwise convolution, which is much more efficient than regular convolution [26] [25]. These two features combined make for a small and efficient network that can be used in small systems. This model could increase performance when expanding our framework to run in real time combined with scaling it to analyse 100 students at the same time instead of one by one.

MobileNets is made out of many convolution layers and fully connected layers that make up most of the network as seen in figure 4.4. There is only one occurrence of average-pooling at the end of the convolutions before forwarding the data to the fully connected layer.



Fine tuning three models

There lies one important difference in what these networks output and what we want for our framework. We simply have 2 classes while all these networks are trained to classify 1000 classes based on the ImageNet dataset mentioned in section 2.2. We fine tune these three models by removing the last fully connected layer and replacing it with a new layer with only two nodes for our two classes, fraud and not fraud. The weights in this last model are then trained on our dataset of 25000 images, 12000 of these are labelled as fraud and 13000 are labelled as not fraud. The frames of the videos are resized to 224x224 as these models were all trained on images of this size. The other layers are not retrained on our data and keep their respective weights as trained in [23] [24] [25].


Experimental setup

As explained in sections 4.1 and 4.4, the dataset consists of three videos of two hours. However, the first 15-30 minutes are usually filled with the participant setting up their screen recording and waiting for the exam to start. These parts have been cut from the videos as they are not part of the digital exam itself. The participants that made these videos were instructed to follow a protocol with a list of actions that they had to perform while taking their test. Because of this, all three videos showcase the same kind of behaviour and actions. In our case, this results in most cases of fraud being born out of communication with someone through the WhatsApp web application. A small number of other instances of fraud are shown such as communicating with someone through Facebook, Skype or email applications such as Gmail. This results in an uneven dataset which is a big problem in the field of machine learning as they create unfavourable results [27]. In addition to this, the videos contain more instances of not fraud than fraud, creating an even more unbalanced dataset. To counteract this problem and keep the datasets more balanced we did not fully label these three videos of two hours, but only parts of each video. Many instances of not fraud and WhatsApp for fraud have been left unlabelled and were therefore not parsed and added to our dataset. This resulted in a relatively small dataset of 25000 images from six hours of videos. The dataset was divided into a train, validate and test set with 50%, 25% and 25% of the images respectively.

Of these 25000 images, 12000 are labelled as fraud and 13000 as not fraud. The fraud images mostly consist of the participant accessing sites that enable communication, like WhatsApp, Facebook, Skype and Gmail. A small part of the dataset consists of small popups of these applications when looking at another application that would be classified as not fraud. This small popup shows that there is an application running that could be classified as fraudulent and has the potential to communicate information through these small windows. We want to see if we can make our network robust enough to even detect these small differences. The not fraud class mostly consists of behaviour like looking at the exam itself or looking up answers through Google. The remainder of this class consists of instances where the participants open applications like Microsoft Word, Excel and Paint. These applications do not allow communication with other people and are therefore allowed to be used by the student.

The traditional way to estimate the performance of a classifier is to train the classifier on one set and test it on another independent set [28]. Another approach is k-fold cross validation which rotates the available data over the different sets k times, averaging their performance at the end. This paper applies both these methods, making use of 2-fold cross validation, to the two different datasets discussed in sections 4.5 and 4.6. The reason for choosing 2-fold cross validation is because of the amount of time it takes to train these models. Testing different values of k would require several more days of training which we were unable to do due to time constraints.


Episode-constrained cross-validation

When working with videos, one has to keep the narrative of all the shots in mind. A shot is a sequence of frames that run for an uninterrupted period of time. It can be the case that due to


separating narrative data of these shots in a train and test set that our models could over-perform and show higher results because of these datasets [28]. Van Gemert, Veenman, and Geusebroek proposed a new way of partitioning their data into episodes instead of shots. Every episode is a collection of shots with the same narrative.

Because of this we test our models with two different datasets, one where all these shots are spread across our train validate and test set and one where we all shots of the same narrative are kept together. This prevents leaking near additional information to another set which could make our model show better results than it should. However, since we are dealing with a small dataset we do not have a great variety of episodes. We divided our dataset into a few episodes, one for each type of application used in the videos.


Training our models

When training our model we have to choose an optimizer and a cost function that the network will use. We use the optimizer called Adam, which performs as well as the typical stochastic gradient descend, but is more efficient and has less memory requirements [29]. We mostly use the default parameters as stated in [29] with the exception of a smaller learning rate of 0.001 as we have little variations in our images as discussed previously. The cost function used in our models is the binary cross entropy as we only have two classes.

Finally, we train our model in three epochs, which means that all the data is fed to the network three times. Additionally, we use small batch sizes of ten. This is done because of the limited computational power and time at hand, and while initially testing the network with smaller datasets it seemed sufficient to produce good results.



Results and validation

An overview of the results can be found at the end of this chapter in section 5.4.


VGG16 model

Our first mode, the VGG16, performs exceptionally well when choosing the traditional method that uses distinct sets as described before. It yields results of 96.8% accuracy. On the other hand, using the episode-constrained cross-validation technique looks less promising, yielding a result of 67.1% accuracy.

The difference in these results can most likely be explained due to the nature of the images of the training set. Applications such as Facebook and Gmail are not customizable. They have a strict predefined layout that is consistent throughout all of its users. This makes it easy for the network to recognize and detect some features in images of these applications, which might not be directly recognizable in the images of other applications. An example of this would be to try and recognize Facebook by only showing the network images of WhatsApp. This can be seen when calculating the precision, which is merely 60.5%.

We can see in figure 5.1a that most of our performance when using the traditional method is lost due to the low recall. This is most likely because of the small amount of images with a small popup on top of an application that would not be classified as not fraud, as discussed in section 4.5. Our model is likely unable to recognize these small popups as features and thus classifies these as not fraud. We could attempt to solve this issue by increasing the size of our input images from 224x224 to something big enough such that these popups can easily be found.

11143 857 21 12979

Predicted class



True False True False

(a) Results of the traditional method

10904 1096 7121 5879

Predicted class



True False True False

(b) Results of the episode-constrained method




Just like the first model, the Inception-v4 performs exceptionally well and almost as good as the VGG16 model. It yields results of 96.0% accuracy with the traditional method that uses distinct sets and 46.8% when applying episode-constrained cross-validation. Additionally, the precision of this model when using the episode-constrained method is lower than that of the VGG16, it classifies almost everything as fraud, scoring a precision of 47.3%. The confusion matrices of these results can be seen in figure 5.2. The reasons for these performance differences are likely to be tied to those of the VGG16, with the difference being that it also does not recognize instances of the universities’ email application when using the traditional method. This might be because the logo of the University of Amsterdam is displayed in this application, which is mostly found in applications like BlackBoard whose images are labelled as not fraud. As for the episode-constrained method, it is likely overfitting on the small set of features of an episode which were also present in the test set, resulting in the model classying almost everything as fraud giving us a low precision.

11049 951 26 12974

Predicted class



True False True False

(a) Results of the traditional method

11639 361 12948 52

Predicted class



True False True False

(b) Results of the episode-constrained method

Figure 5.2: Results of the fine tuned Inception-v4 network.



Our final model, MobileNets, produces underwhelming results compared to the two other models. Scoring an accuracy of 48.8% with the traditional method and 48.2% with the episode-constrained cross-validation method. The confusion matrices of these results can be seen in figure 5.3. The metric that sticks out is the recall of 6.7% with the traditional method. These results were partially expected, as MobileNets does not produce state-of-the-art results, but focuses on creating a small and efficient model that trades in a reasonable amount of accuracy to reduce size and latency [25]. However, in this case, the loss of accuracy causes a worse than random performance. This underperformance combined with the low precision make this model not worthwhile for use in our framework.


804 11796 1596 11404

Predicted class



True False True False

(a) Results of the traditional method

11957 43 12903 97

Predicted class



True False True False

(b) Results of the episode-constrained method

Figure 5.3: Results of the fine tuned MobileNets network.




























Figure 5.5: Overview of recall of every model for both methods.















Stitching the framework together

As mentioned in section 1, classifying images of videos of students’ screen recordings is only one part of the framework. There is also a video processing part which reduces each shot to a singular frame. This reduces the length of the entire video down to a much smaller size, reducing the needed memory and computational power that is used during training. Finally we have an interface that calls these two processes and communicates the results directly to the teachers using this framework. Figure 6.1 shows the entire framework as a whole. The process starts with a video of a student’s screen being inputted into our pipeline through the interface. The interface then sends the video to the video processing part. The video then gets reduced to a single frame per shot, leaving us with a few thousand frames. The frame numbers of these shots are then fed to a script that cuts these images from video and saves them as images. These images are then classified with (at least) one of the networks discussed in this paper into fraud or not fraud. An overview of the frames classified as fraudulent are then displayed to the teacher through the interface, where they decide whether or not the student committed fraud.



Evaluating the framework

Due to time constraints and the goals of this project we were unable to evaluate the entire framework as a whole. We mentioned earlier in section 1.1 that our goal was to reduce the workload enough so that it would be feasible for teachers to do this entire process by themselves. To do this we would need to setup our framework and record another video of a student taking an exam and cheating. We would then require a teacher to use our framework. The teacher would not have seen this video and wants to know if the student cheated or not.



Conclusions and future work



The University of Amsterdam allows students to take their digital exams on their own laptop in digital exam rooms. We created a framework that allows students to take these exams outside the digital exam rooms. This framework processes a video of a student’s screen and reduces the amount of frames of a two hour long video down to a few thousand frames. These frames are then classified into fraud or not fraud by means of fine tuned convolutional neural networks. An overview of all fraudulent behaviour would then be presented to the teacher in an interface that holds it all together. The second part of this process, which this thesis has focused on, makes use of three existing convolutional neural networks to classify these images.

In this paper, we applied and fine tuned three models, VGG16, Inception-v4, and MobileNets to the video recordings of digital exams. The VGG16 and Inception-v4 stayed true to their name and produced near state-of-the-art results which are excellent at classifying images of screen recordings into fraud and not fraud. These models achieved an accuracy of up to 96.8% when using the traditional way of training the classifier on one set and testing it on another independent set. However, using episode-constrained cross-validation seemed less promising, achieving results of up to 67.1% accuracy. This is likely due to the nature of these applications. Features of one application might not directly apply to another and are therefore unable to classify all images correctly. The MobileNets model performed worse than random and is therefore deemed unsatisfactory for our framework. This allows us to create a framework that can hopefully be used to automate fraud detection for proctoring digital exams within the University of Amsterdam. This would eliminate the existing very labour intensive process of having to go through all these videos by hand. We have also created a separate process that allows us to easily annotate videos to train our models by using ANVIL with a series of scripts for preprocessing.


Future work

Despite showing promising results, this framework still needs major improvements before it can be used in practice. The dataset consisted out of merely three videos where every video follows a same predefined protocol. This resulted in a very sparse dataset which would need to be extended with a greater variety of actions to retrain our models. Furthermore, there is lots of room to experiment with the input sizes of images. This could allow the network to find a different set of features of the images which could lead to better results. When doing this we could also look at the training results of our networks and see if there are any differences in them.

In addition to this we would need to extend our framework to also make use of different cameras to track what the student is doing during the test. If they are taking their exam at home we need processes that can verify that the correct student is taking the test and that the student is in a isolated room without others to assist him during his exam. This would require a mix of video and audio processing that were not within of the scope of this paper.


Finally, as discussed in section 6.1, we need to evaluate the entire framework to see how it performs against the current method of using ProctorExam.



[1] Brouwer et al. “Cheat me not: automated proctoring of digital exams on Bring-Your-Own-Device.” In: (2018).

[2] Phillip Dawson. “Five ways to hack and cheat with bring-your-own-device electronic ex-aminations”. In: British Journal of Educational Technology 47.4 (2016), pp. 592–600. [3] Thea Marie Søgaard. “Mitigation of Cheating Threats in Digital BYOD exams”. MA thesis.

NTNU, 2016.

[4] Natasa Brouwer, Andr´e Heck, and Guusje Smit. “Proctoring to improve teaching practice.” In: MSOR Connections 15.2 (2016).

[5] ProctorExam. ProctorExam - Live Proctoring. url: https : / / www . proctorexam . com / products/#proctoring (visited on 04/23/2018).

[6] GR Cluskey Jr, Craig R Ehlen, and Mitchell H Raiborn. “Thwarting online exam cheating without proctor supervision”. In: Journal of Academic and Business Ethics 4 (2011), p. 1. [7] Yousef Atoum et al. “Automated online exam proctoring”. In: IEEE Transactions on

Mul-timedia 19.7 (2017), pp. 1609–1624.

[8] Kenrie Hylton, Yair Levy, and Laurie P Dringus. “Utilizing webcam-based proctoring to deter misconduct in online exams”. In: Computers & Education 92 (2016), pp. 53–63. [9] Dengsheng Lu and Qihao Weng. “A survey of image classification methods and techniques

for improving classification performance”. In: International journal of Remote sensing 28.5 (2007), pp. 823–870.

[10] Kaiming He et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1026–1034.

[11] Ajith Abraham. “Artificial neural networks”. In: handbook of measuring system design (2005).

[12] Robert Hecht-Nielsen. “Theory of the backpropagation neural network”. In: Neural net-works for perception. Elsevier, 1992, pp. 65–93.

[13] Ian Goodfellow et al. Deep learning. Vol. 1. MIT press Cambridge, 2016.

[14] Quoc V Le. “Building high-level features using large scale unsupervised learning”. In: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE. 2013, pp. 8595–8598.

[15] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Vol. 1. 1. MIT press Cambridge, 1998.

[16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105.

[17] Yufeng Ma et al. “Effects of user-provided photos on hotel review helpfulness: An analytical approach with deep leaning”. In: International Journal of Hospitality Management 71 (2018), pp. 120–131.


[18] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. “Deep sparse rectifier neural networks”. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 2011, pp. 315–323.

[19] Geoffrey E Hinton et al. “Improving neural networks by preventing co-adaptation of feature detectors”. In: arXiv preprint arXiv:1207.0580 (2012).

[20] J¨urgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural networks 61 (2015), pp. 85–117.

[21] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning representations by back-propagating errors”. In: nature 323.6088 (1986), p. 533.

[22] Michael Kipp. “Multimedia annotation, querying and analysis in ANVIL”. In: Multimedia information extraction 19 (2010).

[23] Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).

[24] Christian Szegedy et al. “Inception-v4, inception-resnet and the impact of residual connec-tions on learning.” In: AAAI. Vol. 4. 2017, p. 12.

[25] Andrew G Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).

[26] Fran¸cois Chollet. “Xception: Deep learning with depthwise separable convolutions”. In: arXiv preprint (2016).

[27] Eva Alfaro-Cid, Ken Sharman, and Anna I Esparcia-Alc´azar. “A genetic programming approach for bankruptcy prediction using a highly unbalanced database”. In: Workshops on Applications of Evolutionary Computation. Springer. 2007, pp. 169–178.

[28] Jan C Van Gemert, Cor J Veenman, and Jan-Mark Geusebroek. “Episode-constrained cross-validation in video concept retrieval”. In: IEEE Transactions on Multimedia 11.4 (2009), pp. 780–785.

[29] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: CoRR abs/1412.6980 (2014). arXiv: 1412.6980. url:




Related subjects :