Determining Efficient Shot Boundary Detection on Screen-Recorded video’s of Digital Exams

(1)

Bachelor Informatica

Determining Efficient Shot

Boundary Detection on

Screen-Recorded video’s of

Digital Exams

Kyrian Maat

June 8, 2018

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

The ever-growing use of BYOD in university settings allows for digital exams anywhere on site. Moderating these exams on fraudulent behaviour is a necessity to create this reality. Currently, human proctoring is one of the most common approaches to evaluate for potential fraud, which is labour-intensive and costly. In this paper a framework is proposed to detect changes in screen-recordings and detect fraudulent behaviour automatically and present these instances to the examiner. This thesis primarily focuses on comparing four approaches for Shot Boundary Detection (SBD) on screen-recordings of digital exams to reduce the frames needed for inspection on fraudulent behaviour. The approaches in this thesis are tested on their accuracy, frame reduction efficiency, resizing flexibility and speed. To evaluate these attributes of the SBD approaches, a test set created from 6 hours of video of a digital exam recording with various changes visible on screen to simulate shot boundaries. After experimentation, the Edge Change Ratio is measured as fastest but most inaccurate SBD method, the Mean Square Error is measured as the most efficient in frame reduction and the Image Hashing method is concluded as the most practical approach for SBD on screen-recordings for digital exams.

(4)

(5)

Introduction

In the advent of the rapid digitization of this world, it would be extremely beneficial to create environments anywhere on the campus that enable students to partake in their digital exams and allow them to use the resources on the internet to aid them [1]. Usage of their own laptop to participate in the exams mimics their own working environments and could allow students to perform well [4]. The flexibility of examination due to bring your own device (BYOD) is in stark contrast to the present day scenario with universities needing special examinations halls to allow students to work on digital exams. There is therefore an interest to introduce this new approach of digital exams on university grounds by using BYOD, allowing for exams anywhere on campus sites, which not only for enchances flexibility but also scalability of digital exam testing [4].

However, one cannot expect that all students participate fairly in the digital exams if these students are allowed access to their own material. There is thus a need to protect the integrity, by preventing students from partaking in fraudulent behaviour during these digital exams. Fraudu-lent behaviour can include the usage of chat applications to communicate with others or opening mails with pre-written answers [21] [5]. Moderating the behaviour of the student partaking in the exam is thus a necessary step for allowing digital exams.

Moderation of digital exams currently exists in the form of proctoring, which monitors the student’ activity either in real-time such as Kryterion or post exam similar to Tegrity [9]. This is done by an examiner or proctorer and will need to be done for each individual assessment. Considering that the modern digital exams usually have a time limit of 2 hours, one might need to search through tens to hundreds of 2-hour video’s to check for fraudulent behaviour. This is time-consuming and requires external resources, i.e. trained proctorers to be used for digital exams.

There is thus a severe need for automation in fraud checking in digital exams, so as to free up all these resources that are currently being used for digital exams. To fully automate the process of processing digital exams video’s, a framework consisting of three parts is proposed.

The interface is the core component that the examiner will be able to utilize to check for fraudulent behaviour. The interface will accept video’s and will act as a video player to allow the examiner to skim through the video.

To properly summarize the video and show points of interest in the video, a technique known as Shot Boundary Detection (SBD) will be used. This technique attempts to find transitions between scenes in video’s and the framework will use this to summarize the video in key frames and allow the examiner to skim through the video more effectively.

After the video has been chopped into key frames a Neural Network is employed to check for fraudulent behaviour. A Convolutional Neural Network (CNN) that has been trained on images containing communication such as Whatsapp or Facebook is used to analyze the frames that the SBD technique discovered.

The framework as a whole thus attempts to split a video sequence of a screen-recording from a digital exam into a smaller subset of images. These images are found by using SBD and analyzed via a CNN. The interface shows the detected instances of fraud and allows an examiner to verify the exam. The pipeline-approach of the framework can be observed in Figure 1.1.

(8)

This framework will allow for a fast and automated approach to fraud checking and verification, compared to the current tools.

Figure 1.1: Overview of the proposed framework

This thesis will cover the aspects of the video processing section of the framework. To detect transitions, SBD techniques such as histogram comparisons and mean square error will be utilized and compared to determine which algorithm most accurately finds the shot boundaries. By finding the shot boundaries of the video sequence, the entire video can be ”summarized” in a fraction of the total frames of the video sequence. The accuracy of the SBD and the reduction of frames are the key components that the algorithms will be compared on.

Because the frames that should be detected need to give an indication for possible fraud, it’s also important to define instances of fraud in our framework. We consider fraudulent behaviour to be the usage of any mail instances such as Outlook, the usage of file-sharing software such as Dropbox and any chat applications such as Facebook or Whatsapp. The SBD algorithms should therefore be able to recognize the differences of webpages and chatboxes to appropriately summarize the video sequence.

1.1 Research question

To accomplish the goal of minimization of frames by finding shot boundaries, this research aims to answer the following question: ”What Shot Boundary Detection video processing algorithm, efficiently reduce the amount of frames of the input video such that all shot boundary frames can be displayed on the interface and propagated to the Neural Network?”.

This research question will be answered by answering the sub-questions present in the research question itself. We want to test how well the SBD algorithm finds shot boundaries, how well it reduces the total amount of frames and how fast the algorithm works.

(9)

CHAPTER 2

Theoretical background

2.1 Online Proctoring

2.1.1 General overview

To allow for examination anywhere on the world one has to ensure that the student does not engage in fraudulent behaviour. Normally, an examiner would be the one to check for this behaviour, but this is especially cumbersome if students are spread out and not present in the same exam hall. A solution that offers protection against fraudulent behaviour and is flexible is online proctoring, which is currently used to protect the integrity of digital exams. This can include only showing a video feed of the browser that the student is using to complete their exam, but can also be extended with an up-front camera video stream and a tertiary video stream of the surroundings, which are especially important for students that are doing their exams at a different location than the exam location [4]. With these set-ups, the students are constantly being monitored to prevent fraudulent behaviour during the digital exams. In a more relaxed environment with less surveillance, the student might be able to employ certain tricks to circumvent the precautions. For example, in the case that only the screen is recorded, a user could use a Virtual Machine (VM) to circumvent the screen recording of their browser [5]. It is also possible to let someone else make the exam for them if they are not monitored. The services that are currently available check for fraudulent behaviour and try to prevent extraordinary situations such as VM usage by adding extra monitoring camera’s.

2.1.2 Fraud in online proctoring

Proctoring can come in multiple fraud prevention levels and we borrow the definition described in [22], which identifies three levels. Detailed here are the 3 levels that are described and possible attacks that can be done on the set-up that is used for the fraud prevention level.

2.1.2.1 Fraud prevention level one

The first level consists of a screen-feed and one camera. This setup is the weakest in terms of fraud detection, but is easy to set-up and can be used readily. Depending on the position of the camera, certain attacks are possible, we will detail two camera set-ups and the attacks that are possible on these set-ups:

The first set-up is with a camera facing the student that is currently doing the assessment, which could be done with a web cam from a laptop or a camera at a slightly elevated height so it records the face of the student, see Figure 2.1a. The set-up introduces the aforementioned problem of not being able to see if the screen recording is recording what the user is currently doing, thus allowing the student to use a VM to hide their actions on the computer. Another possible attack with this set-up is that area around the student is not recorded which opens up the possibility for students to use external tools , which could include notes on the test situated

(10)

on the keyboard of the laptop or in extremer cases, a USB keyboard that allows for instantaneous appearance of previously written notes [5].

(a) Fraud prevention level one with web-cam facing student

(b) Fraud prevention level one with cam-era observing environment

Figure 2.1: Two possible scenario’s with the first fruad prevention level

The other set-up is done by having a camera that monitors the environment, but might not be able to see what’s on the screen of the student or see the face of the student, see Figure 2.1b. This immediately opens up the possibility to hire another student or confidant to let the test be made in the name of the assessed. It is not possible to confirm that the student is in fact the person that they are claiming to be [21]. In addition, this also enables students to use more commonly known cheating methods that are used even outside digital exams, through the usage of graphical calculators that contain hints or answers to exam questions or through smart watches that enable text messaging [17].

2.1.2.2 Fraud prevention level two

The second level consists of screen capture and two cameras, combining Figures 2.1a and 2.1b, where with proper set-up and audio most attempts at fraud are detected that were mentioned previously. We assume in this interpretation of the fraud prevention levels that the camera’s are situated as described in the scenario’s in the previous level, so one camera observing the environment and one monitoring the face of the student. However, because the student usually knows in advance where the camera’s are situated, the combination of these two camera’s still allows some of the methods to be used if the student uses their method in a unmonitored region. Additionally, methods such as remote access of the laptop or PC are still hard to detect and again allows for the student to bypass the need to make the exam on it’s own. Even the automatic appearance of notes on screen are still hard to detect with two camera’s, as it would need to be combined with text recognition or logging to identify who wrote the lines appearing on screen [22].

2.1.2.3 Fraud prevention level three

The third and final level uses full logging, screen capture, two cameras and live proctoring, which should deter most possibilities of fraud. It still allows for more obscure techniques that compromise the content of the exam. An example is mentioned in [5], called the ”cold-boot attack”, where the students fakes a hardware failure and allows the system to shutdown. Between booting up and getting back to working on the exam, the contents of the RAM are dumped to a file on the hard disk [12] of the student which can later be used to distribute the content of the exam.

This section is however, very costly and not all these techniques are fully automated for fraud detection and might still require examiners or reviewers to manually check frames for suspicious activity. Especially in the case of the second and third fraud prevention level, where screen-syncing, image recognition and audio thresholds will need to be used to fully combat the possibilities of fraud [22].

(11)

2.1.3 Current proctoring tools

Currently proctoring tools are used relatively effectively and in this section a few proctoring services are mentioned to examine what the average proctoring service entails. The services mentioned in this paragraph were reviewed in [9] which lists the features and specifications of the proctoring service. They split their observations in 3 separate groups: Online proctoring features, lockdown features and authentication.

We observe that proctoring tools mentioned in this paper have varying degrees of flexibility. Kryterion and Loyalist, for example, allows the proctor to pause or suspend the exam at any time, but are also do not allow proctors to examine the screen of the examined. None of the tools mentioned employ automatic fraud detection, but Kryterion does automate the detection of inappropriate keystrokes, audio levels and offers real-time forensics. The proctoring mentioned in these tools is mostly done by sending screen-feeds of the exams to proctors to validate the integrity, but as mentioned previously, this is insufficient to guarantee such integrity on it’s own [22].

Lockdown features are constraints that proctoring services use to achieve the previously mentioned integrity, while often sacrificing usability for the examined. Tegrity, Respondus and Kryterion, for example, prohibits the use of browser navigation and control buttons and does not allow for other applications to be running while the exam is in process. They also obfuscate the regular control schemes of users by preventing usage of right-clicking, copy and pasting text and function keys. These features are complementary to the screen capturing and proctoring, as on their own they can be circumvented as seen in the fraud paragraph.

Lastly, the proctoring services have to authenticate that the person participating as student for this exam remains the same person throughout the exam. It is stressed that it is not the same process as identification, which is avoided because it brings a plethora of privacy and legal issues to correctly perform for most proctoring services. All tools require some form of authentication, primarily through username/password combinations that are given for the test. Some also require a government-issued ID and in rare cases a proctoring service utilizes Facial Recognition (Kryterion) or a fingerprint reader (Software Secure). To further authenticate the entire process, video logs of web cams, which are used in every tool described in [9], are stored and most services review the session with the video afterwards and flag incidents.

All tools mentioned in the paper did not use any automated technology, although recent developments have shown that such frameworks can be created and work quite well [1]. They do require external camera’s aside from a webcam, which the proctoring tools mentioned previously do not need.

In the end, a comprise needs to be made between cost and integrity. The more the assessor is sure that the students won’t partake in malicious behaviour, the less the cost for monitoring will be. The integrity can never be guaranteed without a live examiner and proper constraints, but this in turn limits the student.

2.2 Shot boundary detection

To guarantee the integrity of the digital exams, we will have to analyze the video’s on fraud. Before we can do that however, we need to find the key points in the video, to minimize the processing and only consider frames with change. We will therefore first partition the video sequence of a digital exam into shots to accurately describe the video sequence. A shot is typically defined as series of related frames that are taken from one camera and detail the same sequence until it is broken up by a transition, boundary or cut. The structure of videos and shots can be observed in Figure 2.2. It is essential to also stress that the shot is continuous in time and space for the sequence of frames. A difference between two shots could thus be seen as a change between two frames such that the visual content of the frames is significantly differing.

(12)

Figure 2.2: Common structure of a video1

To detect shot boundaries one has to first detect the different types of transitions between the shots. The cut for example is a shot change that occurs quickly, encompassing 1 or 2 frames. Others, such as fades or dissolves use changes in brightness to carry on to the next shot. These are so called gradual transitions, as they are usually more than a few frames in length before the next shot is present, while the cut can be referred to as a hard cut as it occurs almost instantaneously [2]. An example of a hard cut can be found in Figure 2.3a

Earlier work has been done on detecting cuts and/or gradual transitions in all sorts of media with a myriad of techniques. We will focus on the aspect of detecting cuts to detect the different scenario’s of a digital exam, where hard cuts will be more prevalent than other transitions types. We will detail five different method classes; The first is based on histogram comparisons of frames where the frames are first converted to (colour) histograms and then these histograms are compared against each other. Secondly, pixel differences which include pair-wise pixel comparison but also pixel region comparisons. Thirdly, we will mention edge detection methods to subdivide the frames into features that are then compared. Fourth, we mention image hashing techniques which hash both frames and allows for comparison between the hashes to identify similar or possibly forged images by using the statistics and features present in both images. Lastly, we mention techniques that employ Machine Learning to find shot-boundaries, such as the Convolutional Neural Network which simulates visual processing akin to humans.

(a) Example of a hard cut, where the con-tent is changed significantly between the 2nd and 3rd frame ([31])

(b) The histograms of gray-level pixel val-ues corresponding to the first three frames of Figure 2.3a ([31]), where the second and third histogram differ significantly

Figure 2.3: Frames showcasing a hardcut and the corresponding gray-level histograms of the first 3 frames

(13)

2.2.1 Histograms

One of the most common techniques implemented for SBD is the usage of histograms. The frames that are being compared are observed and the histograms are computed, either gray-scale or colour, and compared with each other. If the difference between the histograms is above a certain threshold, a shot boundary is detected [2]. The main advantages to using histograms is that when using a constant background, the histogram differences are considerably more resistant to object motion, as the histograms ignore spatial changes within a frame [13]. A problem that might occur as a consequence of the spatial ignorance is that two images have similar histogram values, but are completely different in features and shape. An example of this is two images both have 50 high intensity values and 50 low intensity values. One of the images has all the 50 high intensity values on the left side and all 50 low intensity values on the right-side, while the other image has all of these values spread throughout the picture. The histogram values are similar, yet the visual content is different. The probability for the aforementioned scenario is very low, but it still a drawback in using this method. A possible solution is splitting the image up in regions and comparing the histograms of those regions instead of the entire image.

The usage of gray-scale histograms or colour histograms for SBD have both enjoyed success, yet colour histograms are more frequently mentioned as an approach. This is most likely related to colour histograms being more accurate in detailing the picture, due to having three histograms to work with. It is, however, computationally more expensive to work with three histograms instead of one.

Whichever representation for the image is used, the methods that work on the histograms remain similar. The differences between the histograms of two frames can be computed with a variety of methods, including simple distance metrics such as Euclidean or Manhattan distance. A different commonly used method which is slightly more complex is the chi-squared test, also referred to as X2, which determines the distance between the two histograms based on the binned data, which makes it sensitive to the amount of bins chosen. The approach is quite useful for detecting hard cuts as it enhances difference between frames, although it also makes it more sensitive for camera and/or object motion [19]. A different approach is using the sum of absolute differences just as in the pixel approach, except now applied to the histograms. An appropriate threshold can be chosen and [31] proposed to normalize the difference by the bins used and the total number of pixels in the image.

A simple SBD method utilizing histograms can be constructed quite easily. Let’s take Figure 2.3a and apply gray-scale histograms to the frames. This will result in histograms in Figure 2.3b. We can observe that the difference between the 2nd and 3rd histogram is quite large and using a distance metric we can classify that these frames belong to different shots. If we were to consider the entire video that the four frames of Figure 2.3a belongs to and transform all frames to histograms and then apply the distance metric, we can obtain the result in Figure 2.4. The histogram differences between the frames indicate whether a SBD occurs, if the difference exceeded a pre-set threshold.

Figure 2.4: The histograms difference of a sequence of frames. Hard cuts and gradual transitions are both present, but hard cuts can easily be detected with a proper threshold ([31]).

(14)

Apart from choosing a threshold, the decisions that one has to make using histograms are the amount of bins that are used for the histograms and the method to compare these histograms to decide whether a shot boundary was detected. Using the results found in [23], which compares approaches of frame comparisons by different universities and uses colour histogram comparisons, we see that the approach is to take a fixed number of bins, which is averaged at 512 bins (8 x 8 x 8). Some of the participants also opted to use quantization for binning, although this did not seem to have a better result. The methods used for histogram comparisons included the ones previously mentioned, however, many participants deemed the choice of the distance metric as not very substantial for the overall performance. As seen in [6], using Euclidean or Manhattan distance for comparisons performs well and accurately in detecting hard cuts.

Histograms are a good middle-ground for accuracy and speed [31], but might need proper threshold tweaking to consistently detect hard cuts according to the specifications of the user.

2.2.2 Pixel differences

The simplest algorithms for SBD make use of the sum of absolute differences between pixels which yields a metric that is then compared against a threshold to determine whether a shot is a new shot, i.e. a shot boundary is detected. The technique is quite sensitive to camera motion and the finding the correct threshold can be quite expensive [2]. To combat the sensitivity to motion a 3x3 averaging filter was used in [31] before pair-wise comparison between pixels. By smoothing with the neighbouring pixels, the effect of the motion is substantially decreased as less pixels will be judged as changed due to the movement. Since the smoothing lessens the intensity value , the pair-wise comparison between the two frames is now smaller and does not exceed the given threshold with which a ”change” is detected. This method is simple to implement and works reasonably well for detecting hard cuts, as large changes occur between the frames when such a cut is encountered. This also allows for more specific thresholding depending on how pixels are influenced by their neighbourhood. A few configurations can be observed in Figure 2.5.

Figure 2.5: Different neighbourhood options for pixel differencing ([29]), which can allow for more specialized thresholding.

Another method described in [31] and [2] is the usage of region matching instead of pair-wise pixel comparisons. The frames are subdivided into blocks (regions) of pixels and the statistical measures, such as the mean or standard deviation, of both regions intensity values are compared. If statistical measures differ beyond a given threshold, then the region is declared as ”changed”. If enough regions are considered ”changed” and thus exceed another threshold, then it is declared that a shot boundary was detected. For example, consider Figure 2.6, where we might consider the scene of a person. Small changes between the frames would need to be detected as a new shot.

(15)

Figure 2.6: Image split up in 16 regions, which can help detect new shots by small changes such as eye or mouth movement [7].

The approach of using regions, similar to the smoothing filter mentioned previously, aids in increasing tolerance against camera motion and object motion, as long as the object is small enough for the considered regions. It might, however, be significantly slower due to possible complexity of the statistical measures.

In these approaches, one can see that thresholds are commonly introduced to detect a shot boundary. These thresholds can be pre-set based on a statistical distribution or can heuristically chosen thresholds that previously worked for a certain approach. Both of these are, however, global thresholds that if chosen too high will allow for many undetected shot boundaries while a global threshold that is chosen too low will incite many false detections of shot boundaries [13], such as in Figure 2.7.

Figure 2.7: In this Figure, the x-axis is the frame number and the y-axis is the difference coefficient. The usage of a pixel differencing approach where the hard cuts might not be easily distinguished due to variability in frames [26]. Using a pre-set threshold where the difference between two frames needs to exceed a constant value would result either smaller peaks being incorrectly detected or only large peaks being detected.

(16)

Therefore, an adaptive threshold approach might be advised which is used to match the local activity of camera/object motions [26]. With the usage of a sliding window that examines the m successive frame differences, after which the presence of a shot boundary is checked at every window position from the middle of the window. Simply put, if the found metric observed between frame k and k+1 is the largest of the current window and this observed metric is at least α times larger than the second largest metric in the current window, we have a cut between the two observed frames [13], which can be observed in Figure 2.8. The adaptive threshold approach mentioned in [26] is slightly limited due to α again being chosen heuristically. In response to that, [14] describes that using the Gaussian distribution mentioned in [31] can be used to determine the α indirectly, yet this brings the problem of non-regulation for missed detections as the distribution measured on boundaries is not taken in consideration [13].

The approaches for pixel differences are thus ranging from simple yet sensitive or complex but computationally expensive, but are still a good starting point for the detection of hard cuts.

Figure 2.8: Using an adaptive threshold, it is possible to detect hard cuts depending on the current visual information, instead of a hard global threshold ([13])

.

2.2.3 Edge detection

Edges in images are points where sharp changes in image brightness can be observed and are usually summarized in curved lines to showcase the difference between two areas of brightness. When a cut or dissolve takes place, new intensity edges will appear and old edges will disappear when looking at the difference between two frames. By searching for new edges and disappearing edges and counting these, one can detect a cut. This method is the so-called edge change ratio (ECR) method and is mentioned in [28] which uses this structural approach to determine the structural representation of the images to perform SBD.

The key steps to performing ECR are the following as mentioned in [28]: The motion between frames is compensated by shifting the images so that they best overlap. To compute the best overlap, an algorithm is used to maximize the sum of all pair-wise pixel comparisons such that pixels that are similar are weighted most. Algorithms mentioned to be used for the motion compensation are census transform correlation and the Hausdorff distance, both being outlier-tolerant methods.

After shifting the images, the next step is calculation the edge change fraction. This is done by computing the percentage of edges that enter and exit between the frames. These are detected by first using the Canny Edge detector to calculate the edge map of the frame [27] and then comparing pixel values in a radius around detected edge pixels to compute the entering and exiting edges. Two ratios are defined for the entering and exiting edges: The ratio of edge pixels in frame k that have a distance higher than a specified threshold to the closest next edge pixel in frame k + 1 (exiting) and the ratio of edge pixels in frame k that have a distance lower than the same specified threshold to the closest next edge pixel in frame k+1 [13]. The edge change ratio is then defined as the highest ratio between these two ratio’s and if this is a local peak a shot boundary is encountered. The algorithm thus searches for peaks within a window of changing frames to detect shot boundaries and cuts are very profound as they lead to a singular peak in ratio value among low values [28]. An example of the process for ECR can be seen in 2.9.

(17)

Figure 2.9: Main computation steps of ECR ([28])

The problem with the edge detection approach is that it is computationally the most expen-sive out of the methods we have previously mentioned, while the method does not significantly improve upon SBD compared to histogram comparisons [27]. The method is, however, more invariant to sudden illumination changes than the colour histograms, which makes it a good approach to take for detecting flashlights.

2.2.4 Image hashing

A different approach to SBD could be the usage of image hashing. Hashes are typically used to encode large size of data into a smaller fixed size. Image hashing uses a technique called perceptual hashing [18] to create a hash based on the visual information in the image, which tries to create hashes such that if two images look similar to the human eye the hashes are similar as well. This is in stark contrast to cryptographic hashes where small changes can influence the entire hash encoding. The main advantage that image hashing brings is that once the hashes have been created, comparisons can be done quickly and have a high level of content representation from the original images [24].

The process of image hashing also relies on features just as the ECR method in the previous paragraph described. The features that are used for the hash can be computed in a variety of ways. The countless possibilities that are present in feature extraction and hash generation can create a multitude of approaches for image hashing. We will mention the approach in [32] as it uses important elements for image hashing and comparisons, but it is important to note there are countless more possibilities.

(18)

This approach is split in a 2-step process, which starts with the creation of the image hash. This is done by first re-scaling the image to a fixed size and converted to the same representation, so all hashes are similarly created. After that features are detected, globally and locally. [32] make use of Zernike moments of luminance/chrominance components to create a global vector, which describes the image as a whole. Local features are also extracted from the image, using salient regions which are regions of visual attention. The K largest regions are extracted and because most images do not have more than 6 salient regions, a K of 6 is chosen. In these salient regions, local texture features are computed. The paper describes the usage of coarseness and contrast, but other local features such as directionality, line-likeness, regularity, and roughness could also be used [25]. The process of detection of salient regions and finding local features based on visual information is observed in Figure 2.10.

Figure 2.10: Salient region detection and local feature extraction: (a) Original image. (b) Saliency map. (c) Salient region. (d) Four regions of interest with visual information ([32]).

After finding salient regions the hash is then computed as the combination of the Zernike moments, position and size of salient regions and the local texture features. For encryption purposes, the intermediate hash containing the Zernike moments vector and the intermediate hash of the salient regions and their corresponding local texture features are combined with a secret key and a modulus operation is performed. In total the paper uses 3 secret keys to compute the image hash. The block diagram of the entire image hashing method can be observed in Figure 2.11.

Figure 2.11: The block diagram of the image hashing method which combines global and local features into one hash, with secret keys to secure the content ([32]).

To compare images, one first has to compute both hashes using the same method. After obtaining the hashes, one can use the secret keys to recompute the combination of Zernike moments, salient regions and local texture features. The best way to compare the two images is now to check if the salient regions match between two images. After comparing the regions, one can now compare distance between the image hashes by using the global feature vector consisting of Zernike moments and the local feature textures. One can then use a simple distance metric, such as Euclidean distance to find the hash distance. If the hash distance is below a certain threshold, the images are considered similar. In practice, [32] mentions that the Zernike moments alone are enough to determine whether images are similar. Again, as in the previous paragraph the threshold has to be calculated, which means it might not be optimal.

(19)

The approach mentioned in this paragraph is but one example of the various image hashing methods that can be created. However, most if not all methods rely on some sort of feature extraction, as the visual information in the image are the most important for the created hash.

Image Hashing might be computationally expensive to use in a case-by-case scenario, but offer a great comparison method between two images similar to how a human would perceive the difference between two images. It also adds to the security and integrity of the process.

2.2.5 Machine learning

One can also see SBD as a pattern recognition problem and devise tools from the machine learning field to detect shot boundaries. Machine learning is built upon the principles of training and testing. Training is done to learn from examples that have known outcomes, i.e. we have a frame which we know is a new shot and tell the machine learned algorithm just that. After learning from the training data, the algorithm can be tested on a test set. The algorithm outputs whether this frame is a new shot and this data is then compared against the known outcome that we have so as to determine how well the machine learned algorithm works.

The approach mentioned in [27] mentions two kinds of machine learning methods to use for SBD, discriminative and generative classifiers. Generative classifiers can be incorporated with additional information and can usually explain their methods of shot transition detection, while discriminative approaches are similar to black boxes but can be used in cases that the correctness of the assumptions made by generative classifiers cannot easily be proven. Examples of discriminative methods are for example K-means, KNN, SVM and CNN for SBD, where the parameters and classification shapes are automatically created during training.

Utilization of machine learning methods leads to two problems, creating a training set that has a broad range of positive and negative examples for SBD and constructing features for the classifiers of the machine learning algorithm.

Creating a training set can be simplified by assuming that all shot boundaries are generated and as such, a small video dataset can be subsampled or bootstrapped to create small samples of a few frames. In [16] dissolves are generated based on a proper dataset and bootstrapped using a classifier. In [11] a small dataset of around 4 hours is used and subsamples of 10 frames are created as training data, with single shots or artificially combined with a transition to create shot boundary transitions.

The classification and construction of features for the machine learning algorithms. [8] men-tions the usage of a SVM where the features used are based on the wavelet coefficients vectors with a sliding window. The wavelet coefficients of blocks in the temporal window form the fea-tures. [27] mentions using continuity signals in temporal intervals as features for KNN and/or SVM’s. Because shot transitions are random temporal processes, it is important for the features of the classifiers to also be temporal.

[11] makes use of a CNN and poses the SBD problem as a binary classification problem that tries to predict if a frame is part of the same shot as a previous frame. The CNN is trained using cross-entropy loss which is minimized with stochastic gradient descent which, after training with the aforementioned dataset, achieves high processing speed and performance in detecting shot boundaries. However, there was no mention of the learning time that the CNN had before experiments.

An example of how a CNN can be employed in the detection of shot boundaries is found in Figure 2.12. It also indicates that CNN’s are a lot less flexible to use, as they have to be incorporated and built around to work, instead of a quickly programmed histogram difference which can instantly be used.

(20)

Figure 2.12: Shot boundary system making use of a 3D CNN using spatio-temporal information. The video is split in segments of 16 frames with 8 frames of overlap. Consecutive fragments with the same label are merged and a post-processing step using histogram-differences is used to reduce false positives ([15]).

Machine Learning algorithms can create large speed-ups once properly configured and tested. The hardest part is finding the correct metrics to compare and incorporating that into the used classifier.

(21)

CHAPTER 3

Shot Boundary Detection for digital exam

recordings

3.1 Data

3.1.1 General overview

To test the accuracy of the SBD techniques, video’s that that mimic the digital exam environment of the students are used. 6 hours of video are used in this thesis, split up into three video’s each containing 2 hours. The video’s were recorded with a 640x480 resolution and captured at 25 FPS. The content in these video’s are screen-captures of a student currently working on a digital exam, while also occasionally visiting other websites and chat applications, which indicate fraudulent behaviour. It does not have a defined set-up as the previously mentioned fraud prevention levels, however, the screen-recording is an integral part of preventing fraud of digital exams and this experiment will use that information as a guideline to be improved upon.

3.1.2 Annotation

The video’s on their own are not suitable to use as a test set as there are no ground truths encoded in the video. We first have to define the ground truths for two cases; what is fraudulent behaviour and what is a new shot? After answering these two questions, we can annotate the exact frames these events occur on to create a test set, where the SBD can be tested on.

To define fraudulent behaviour in our video’s, we have to consider that the student browses a multitude of different webpages and other screens such as the Explorer. An instance of fraudulent behaviour is to be marked when the student is visiting pages such as Facebook or the Whatsapp Browser Application. Any communicative environment should be flagged as a possible instance of fraud.

To capture these instances, shot boundaries are to be detected for every new page the student visits. However, in the test set we will also say a new shot is detected when a significant portion of the screen changes on the same page, such as a pop-up from Skype or a new ad displayed in Facebook. This not only tests the SBD techniques, but also mimic a real-life scenario of a student getting a message from a friend on their laptop, that might contain hints to their exam. Excluded from the definition of a new shot is a new shot that would have occurred from scrolling, unless the scroll introduces a new image or uses a larger font to properly differentiate itself from the previous shot.

With the definitions set it is necessary to annotate the data to get our known shot boundaries and fraud instances. Annotation of the video’s was done viaANVIL, which let’s the user annotate video files according to a specification format.

(22)

The specification format for the test set uses 2 annotation types. Fraud instances are marked by noting the start and end frames of the fraud instance while shot boundaries are annotated with a singular notification at the transition frame. In Figures 3.1a and 3.1b we can see how ANVIL can be used to annotate for individual frames, where the primary type track is used for annotating fraud periods and the point track is used to indicate particular transition frames for a new shot.

(a) Frame observed in ANVIL, showing a start of a transition of a blank page to Facebook. The current content and frame of the video is displayed and the annota-tion track is annotated with fraud regions and shot boundaries

(b) Next frame observed in ANVIL, which shows the loading of the Facebook page. The fraud instance is marked such that the entirety of the Facebook session is noted and new shots are introduced when the page loads

Figure 3.1: A shot boundary annotated in ANVIL

To accommodate ANVIL, the videos are changed to mp4 format and split into hour portions, as a two-hour video is not accessible in ANVIL. After annotation is finished, an anvil file is produced that contains XML data. The XML data contains the start and end time for fraud instances and for shot boundaries it notes the exact time in the video.

3.2 Algorithms

3.2.1 Pixel differences

The SBD method for pixel differencing is based on the Mean-Square Error (MSE) comparison between two frames. The MSE method is one of the most simplistic algorithms for pixel difference comparisons and is a suitable approach to test how well pixel differencing works on SBD on screen-recordings. It is defined as follows:

M SE = PM i=1 PN j=1(X(i, j) − Y (i, j)) 2 M · N

Where M and N are the dimensions of the frames (which should be equal) and X and Y are the corresponding frames to be compared. This is purely a metric on it’s own, so a threshold has to be introduced. If MSE is small, then the images are similar, so if the threshold is exceeded then one detects a shot boundary.

3.2.2 Histogram comparisons

Histogram comparisons between two frames can be made if both frames are first converted to histograms and then compared via a distance metric. The creation of the histograms in this thesis is done via colour histograms where 16 or 32 bins are used for every colour. The ranges used for pixel intensity values is between 0 and 255, allowing for the original intensity to be saved. After the histogram is calculated, it is normalized as to compare the histograms on the same scale.

(23)

After the histograms are formed for the two frames, a variety of comparisons methods can be used. In this thesis, we chose to use the Correlation method from openCV to compare the histograms, as this method results in a value between 0 and 1 to indicate the similarity between frames, which is easier to threshold than the Chi-Square or Hellinger distance method. The formula for the Correlation method is defined as follows:

d(H1, H2) = P I(H1(I) − ¯H1)(H2(I) − ¯H2) q P I(H1(I) − ¯H1)2PI(H2(I) − ¯H2)2

in which ¯Hk= _N1 P_JHk(J ), N is the total number of bins used and Hiindicates a histogram.

Using this formula we will retrieve a similarity score between 0 and 1 and our threshold to determine whether a new shot is detected is based between that. In the theoretical background it was mentioned that the total number of bins does not make a significant difference, but this thesis will experiment with 8/16/32 bins for colour histograms, as small differences with screen-recordings possibly allow for greater variance. If the bin count is not explicitly mentioned, a 16 bin approach is used.

Important to note in this experiment is that the correlation value between the histograms is extremely similar if used to compare two frames of a screen-recording. The correlation coefficient was often above 0.999 even when the visual content of the frames are quite significantly different. To circumvent this problem, the frame was subdivided into 16 regions. The histograms of each region is then individually calculated for the first frame and compared against the corresponding region in the other frame.

3.2.3 Feature similarities (ECR)

For comparisons of frames based on features and edges, we use the Edge Change Ratio (ECR). This method makes use of the Canny Edge Detector to detect edges Ei in both frames, which

have to be converted to gray-scale beforehand. We make a copy of the edges which are then dilated by a preset radius r, which in our case will be 5. This provides us with ¯Ei, where every

edge pixel is replaced with a diamond which width and height are 2r + 1 pixels in length. We determine the exiting and entering edges by observing whether pixel values in Ei and ¯Ei have

similar intensity values. This gives us the following equations for pout and pin, respectively:

pout= 1 − P x,yE1[x, y] ¯E2[x, y] P x,yE1[x, y] pin= 1 − P x,yE2[x, y] ¯E1[x, y] P x,yE1[x, y]

where x and y are used to index the pixels in the frames E1 and E2. We assume that the

images have the same scale and are thus aligned without need for translation. We take the maximum value between the exiting pout and entering edges pin as the ECR metric.

To detect for a new shot, we check if the computed metric is above the specified threshold. The edge change ratio is a number between 0 and 1, but similarly to the histogram correlation method, the values need to be extremely small as the frames are very similar in screen-recordings. Thresholds between 0 and 0.1 are thus used.

3.2.4 Image hashing

The imagehashing approach used in this thesis follows the pHash (perceptual hash) approach with a 64-bit hash. This is similar to the aHash (average hash) approach, which is worth discussing to grasp the pHash algorithm. aHash shrinks the image down to an image of a smaller ratio, possibly 16x16 or 8x8. For every pixel in this new image we will keep track of a bit in an array, which will be used for the hash itself. The shrinked image is converted to grayscale and the mean colour value is computed based on the intesity of the pixels. Following that, all the pixels

(24)

are compared to the mean and if a pixel has a higher value than the mean the bit for that pixel will be set to 1. If a shrinked image of 8x8 was used, then our hash would consist of 64 bits based on the intensity values of the pixels in the 8x8 image. aHash is fast and quite accurate in distinguishing images.

However, pHash is based on the perceptual hashing paradigm, i.e. similar images should have similar hashes. aHash can provide us more misses in detection if colour histograms are applied to the image and due to gamma correction. Thus for this thesis the pHash approach was chosen to be used and it contains the following steps:

It reduces the size similar to the aHash, but a slightly larger, compared to aHash, shrinked image is usually used such as a 32x32 format. Afterwards the image is converted to gray-scale, but instead of computing the mean colour value we use the Discrete Cosine Transform to separate the image into frequencies. The top-left values of the DCT are used, as these contain the low frequencies which are the most sensitive for the human eye. This leaves us with the 8x8 DCT values, excluding the DC coefficient. Similar to the aHash algorithm, we now compute mean (except we are now using DCT values). We loop through the 8x8 DCT values and compare their value against the average. If the DCT value of the pixel of the shrinked image is higher than the average, that bit will be set in the 64-bit binary array. The hashes that this approach provides are similar if the images are similar to each other. [30]

Important to note is that the ordering needs to be consistent for the allocation of bits and pixels. As long as the hash is computed in a consistent manner. This approach always starts at the top-left-most pixel and goes row by row.

To determine if a new shot has been found via the pHash algorithm, we can use simple distance functions such as the Hamming distance. The hashes are compared on the bit count positions and this gives us the difference in hash. Because the screen-recordings are similar and the frames are similar, the distance between the hashes will not be very large. Therefore, thresholds between 1 and 10 will be used.

3.3 Experimental Setup

3.3.1 General method

Testing of the different SBD algorithms is done in Python with openCV 2.0 to handle the video’s. The annotated frames are read from the ANVIL specification files and will be compared against the frames that the SBD methods find. The general method for checking for new shots with every method involves reading a frame from the VideoCapture from openCV and comparing this against the frame of the previous shot. At the start of the experiments, the only frame known is the first frame. The method that is currently being tested creates a score based on that first frame and the current frame that is being read and if the threshold is exceeded, a new shot is found. The frame for this shot is saved and will be the frame that is used for scoring against the frames that are now being read. This repeats until all frames of the original video are processed. When the original video is processed, we now have an array of frame numbers that have been detected as shot boundaries. However, because the framecount of Python/OpenCV sometimes advances the framecount randomly by a few frames, which makes it not match up with our original test set processed in ANVIL. A fault tolerance margin is added for the found shot boundaries when comparing them to the annotated frames in ANVIL. If no special mention is made, the margin of tolerance used in this thesis is 2 frames before and after the currently processed frame. Thus if the found frame is i, we will consider [i − 2, i − 1, i, i + 1, i + 2] as counts to be compared to the original annotated frame. Considering the original video’s are recorded at 25FPS, a 5 frame window should be suitable to use as we expect that students are not able to perform fraud in a 0.2 second window.

Utilizing the found frames and the window, we can infer two key statistics. The true positives that indicate how many shot boundaries were correctly found from the test set using one of the four approaches for SBD. The true positives are expressed in a percentage of the total amount of shot boundaries that were annotated in the original video. The more true positives are detected, the better the accuracy of the approach.

(25)

SBD also introduces false positives, i.e. frames that were detected as shot boundaries but are not. The ”total frames” found by the SBD can also be expressed as a percentage, this time in case of the total frames of the original video. The total frames metric is useful for determining the reduction capabilities of the approach. The original video sequence can be up to 100 thousand frames and a reduction to one thousand is a significant improvement to summarize the video in it’s entirety with fewer frames. Keeping these two important metrics in mind, we will test the aforementioned four approaches to see which method performs most accurately and efficiently.

The standard and most important experiment that is to be performed on the test set that was created using ANVIL. The unaltered video data is fed to the different algorithms, which all produce a list of frame numbers that are detected as shot boundaries. These frame numbers are then compared with annotated frame numbers from the test set via the fault tolerance window and this will give us a true positive percentage and frame reduction percentage.

Because the different methods can all be used with a threshold value which alters the end result, 5 low-value thresholds were selected for each method. The values of the thresholds were very lenient, as the screen-recordings were very similar and therefore can often confuse two different frames for identical ones. The step between every threshold was incremented linearly, to determine if the accuracy or frame reduction rate would also grow linearly. The structure for determining the effectiveness of a SBD technique can be found in Figure 3.2.

Figure 3.2: The pipeline in which the shot boundary detection methods are compared.

3.3.2 Fault tolerance window

The fault tolerance window for the general experiment uses a window of 5 frames. To test what effect this window has on the results, experiments are run using purely 1 frame, 3 frame and 7 frame windows. These represent frame regions as [i], [i − 1, i, i + 1] and [i − 3, i − 2, i − 1, i, i + 1, i + 2, i + 3] with i being the detected new shot frame, respectively. The experiments are performed using the threshold that yielded the highest true positive rate from the general approach. The goal is to find out whether the 5-frame window is a suitable window compared to 1, 3 and 7 frame windows, as the window size might drastically affect the true positive rate.

3.3.3 Histogram bin count

In the theory, the bin count was usually specified at 8 bins for every colour. It was mentioned that the bin count did not significantly impact the result of SBD. Using screen-recordings, however, means that small effect can have a large impact on the eventual shots that are detected, i.e. a small threshold change can already rise the true positive percentage by 5 percent. We will therefore test what would happen if we used 8, 32 or 64 bins for every colour instead of 16 bins. The aim is not only to check whether the bin count might affect the true positive rate, but also how much the bin count truly affects the speed of the histogram differencing technique.

(26)

3.3.4 Altering video resolution

Video are recorded in a variety of different ways and the video resolution will not always be 640x480 (VGA) such as in our test set. We tested the effect of resizing the original video to 320x240 (QVGA) to test how resistant the approaches are to different video feeds with pre-set thresholds. We deliberately chose a downscaled version, as upscaling might introduce artifacts. We also want to determine whether there is substantial speed-up by downscaling the video’s.

Figure 3.3 visually describes the framework that is used for the experimental set-up, with all the possible options that are used in testing.

Figure 3.3: The complete pipeline in which the shot boundary detection methods are compared. The arrows show the general flow through the pipeline, while the double lines show the links of components that have configurable parameters, such as the choice between different SBD algorithms or resizing

(27)

CHAPTER 4

Results

4.1 General experiment on unaltered video’s

The results for the unaltered video’s compared to the annotated frames from our test with the chosen thresholds are displayed in Figure 4.1 which shows the percentage of the true positives found and the percentage of total frames left after SBD.

(a) Histogram results (16 bins) (b) MSE results

(c) ECR results (d) Hash results

Figure 4.1: Comparison between the different SBD approaches showcasing the true positives and total frame reduction each approach yields

(28)

True positive percentage Total frame percentage Speed (sec)

Histogram 78 11 1292

MSE 77 2 7019

ECR 74 23 1989

Image hashing 80 7 3212

Table 4.1: Comparison between SBD approaches using the highest true positive thresholds

In Table 4.1 we additionally see the speed of the approaches using the thresholds that have the highest true positive thresholds. We can see from Figure 4.1 that the MSE (4.1b) and ECR (4.1c)approaches have a seemingly linear growth in terms of true positive percentage. However, the ECR technique grows more exponentially in terms of total frames compared to the MSE method. In stark contrast to the other two, the Histogram (4.1a) and Image hash (4.1d) true positive percentage is very heavily dependent on the threshold that is chosen, as using a more strict threshold decreases the true positive percentage quite significantly. Similarly to the ECR method, the total amount of frames detected as Shot boundaries also rises exponentially when the threshold is chosen more leniently in the Histogram and Image hash methods. Based on the figures alone, we can summarize the different approaches:

We can infer from Figure 4.1a that the Histogram method performs the fastest and yields a reasonable frame reduction, lower than the ECR method and similar to the Image hash method. It is highly dependent on a low tolerance threshold to find Shot Boundaries, which might prove dangerous when working with different video formats.

The MSE method in Figure 4.1b seems to have a linear decrease in terms of true positive per-centage while maintaining a low total frame perper-centage throughout. The technique is promising in terms of minimizing the total number of frames, but the speed might be off-setting when used on high resolutions, as the speed is slow due to the amount of pixel comparisons that have to be made.

The ECR method in Figure 4.1c performs the worst in terms of true positive percentage and especially in total frame reduction, with it’s only saving grace being it’s reasonable speed and performance even with stricter thresholds. If a faster approach than MSE is required and the thresholds can be affected by uncertain video formats, then the ECR method might prove useful. The Image Hashing in Figure 4.1d utilizing pHash performs well if the threshold is lenient, performing the best in finding true positives and having a lower total frame percentage than the Histogram technique. The speed of hashing is significantly slower than the Histogram and MSE method however, but it should be more resistant against different video sizes due to it’s use of the DCT.

4.2 Fault tolerance

The fault tolerance windows were tested and compared for the four SBD techniques using the most lenient thresholds, as these yielded the highest true positive percentages. The results can be observed in 4.2.

1 frame 3 frames 5 frames 7 frames Histogram 68% 75% 78% 96%

MSE 70% 76% 77% 99%

ECR 65% 72% 74% 93%

Image hashing 71% 78% 80% 99%

(29)

We can infer from table 4.2 that using a window size of 7 leads to too many positives compared to the other fault tolerance windows. This is mostly attributed to shot boundaries being detected that overlap with other annotated shot boundaries, as the window covers too much of the video. The MSE and Image hashing techniques still perform relatively well even with the openCV and ANVIL syncing problems, finding above 70 percent of the true positives with only one frame as window size. We will keep using a frame window of 5 frames in the other experiments, as this does not cause a sudden jump in performance like 7 frames does and is slightly more accurate than the 1 frame window size due to the synchronization issues.

4.3 Histogram bin count

The result of using a bin count of 8/16/32 bins per colour for histogram differencing was tested and the result between these can be seen in Figure 4.2.

(a) Histogram results with 8 bins, obtain-ing an average speed of 1208 seconds

(b) Histogram results with 16 bins, ob-taining an average speed of 1292 seconds

(c) Histogram results with 32 bins, obtain-ing an average speed of 2132 seconds

Figure 4.2: Comparison between the different histogram bin counts affecting the performance of the histogram SBD method

The true positive detection using 32 bins triumphs the other two approaches by quite a lot. The speed difference is not negligible however, but the overall performance is higher and should be used in the histogram approach if the SBD needs to be as accurate as possible. Looking at the graphs, it would suggest that adding bins has a similar effect as making the threshold more lenient, while sacrificing speed. It might be possible to keep increasing the bin count and keep the threshold lenient to acquire a high true positive detection.

However, using 64 bins on every colour was incredibly slow, testing one threshold took over 18500 seconds, which is significantly worse than all the other techniques. It also detected only around 30 percent of the true positives on the lowest threshold, hinting to a decreased perfor-mance with additional bins.

(30)

4.4 Effects of re-sized resolution

After resizing the video’s to 320x240, the video feeds were fed to the SBD algorithms and tested on their true positive percentage and frame reduction rate. The results are detailed in Figure 4.3 and Table 4.3.

(a) Histogram results (b) MSE results

(c) ECR results (d) Hash results

Figure 4.3: Comparison between the different SBD approaches showcasing the true positives and total frame reduction each approach yields on 320x240 resolution

True positive percentage Total frame percentage Speed (sec)

Histogram 65 4 1065

MSE 76 2 1715

ECR 63 11 938

Image hashing 79 8 1578

Table 4.3: Comparison between SBD approaches using the highest true positive thresholds using 320x240 resolution

The results shown in Figure 4.3 when changing the video resolution show an interesting trend when compared to Figure 4.1. The Histogram and ECR approach are significantly less effective, being at least 10 percent lower compared to the detection on the original resolution. The MSE results are slightly less, but the speed gain is tremendous, becoming comparable to the Image Hashing method. The Image Hashing method performs well itself, almost unchanged from the original video but also gaining in speed due to having to process less pixels. This indicates that the MSE and Image Hashing method are more adaptable to changing situations, without altering the threshold values of the methods.

(31)

CHAPTER 5

Link to framework

The aim of the framework was to combine an interface, the frame reduction and video summa-rizing aspects of SBD and the recognition of fraud instances via a CNN. By combining these elements, we aim to improve the speed of examining digital exams for potential fraud, by au-tomatically processing video’s and allowing examiners to quickly skim through the important scenes of the video. This thesis covered the aspects of SBD, but we would also like to cover the link between the different parts of our framework and discuss how to evaluate this product. For visual reference of how the framework works, we again refer to Figure 1.1.

5.1 Receiving video’s

In the framework, the video’s are loaded in via the interface and then sent to the SBD section by specifying the video file that currently needs to be processed. The video file is then read in via openCV and the SBD section is launched. This uses the image hashing technique (see Conclusions) to detect shot boundaries and sends the frame numbers as an array back to the interface. The frame numbers can be used as video summarization, where the frame numbers are used as guidelines to browse quickly through the video.

5.2 Transferring frames to NN

The frame numbers are also forwarded to the Neural Network via the framework. The Neural Network then checks on the appearance of fraud based on the contents of the frames that were forwarded by the SBD section and the fraud instances are marked. The processed data of the Neural Network is sent back to the framework, where the interface can flag frame numbers if a fraud instance was encountered in the Neural Network. The combination of SBD for summariza-tion and the usage of the Neural Network for fraud detecsummariza-tion will make navigasummariza-tion of the digital exam seamless.

5.3 Evaluation

After combining the individual parts of our framework, we would need a method to evaluate it’s performance. This would have to be done by creating new video’s, where volunteers adhere to the standards of a digital exam and cheat by using applications or webpages they are not allowed on. We would test whether our framework would detect these frame instances and at what speed it is done processing.

We then have to compare these statistics for the same video’s against currently existing tools. We would have to test against a proctorer that reviews the video’s of the digital exams and compare the detection of fraud instances and the speed at which the task is performed against our framework.

(32)

(33)

CHAPTER 6

Conclusions

In this thesis we have attempted to find out what SBD detection method is most suitable to use for screen-recordings of digital exams, by comparing the methods primarily on their accuracy (in terms of true positive percentage) and their frame reduction rate. We, however, also considered the implications that the speed of each method has on their practicality for SBD.

Disappointingly, there is no clear-cut definite answer to which SBD method is the best. For speed purposes, the usage of ECR is more appealing than the other methods. In terms of frame reduction, MSE might be more suitable. There are pros and cons for each method, so the user should be certain what they want to use before employing the SBD method. For robustness sake however, it would be better to use either MSE or Image Hashing, as they are more consistent when lowering the window size or resizing the image. For practicality, a decision between Histograms and Image Hashing should be made. Histograms are simple to implement and tweak and are quite fast, while Image Hashing offers consistency. The method that we propose should be used for the framework is the Image Hashing method, as it works well in most situations and flexibility is an important aspect as the resolution of laptops in digital exams varies.

This decision will allow the framework to detect new shots in the most consistent manner, as the image hashing technique performs average to high in all categories. The other techniques that could be switched on are the ECR technique if speed is needed and the MSE technique which will allow for a slightly more compact summarization of the digital exam.

(34)

(35)

CHAPTER 7

Future Work

With the advancing technologies regarding AI and Machine Learning, an area of experimentation can be done in regards to shot boundary detection with Neural Networks. Convolutional Neural Networks are fast and very powerful, but require a vast amount of learning examples. There are mentions of Neural Networks for shot boundary detection [11] and pre-trained Neural Networks for Image Classification have been re-used for SBD [20], but neither are focused on SBD with hard cuts in screen-recordings. More research into CNN’s for SBD with particular regard to screen-recordings might improve speed and accuracy.

There is also room for improvement in using the methods mentioned in the thesis. The histogram approach can be further optimized by finding the correct bin sizes and the image could be pre-processed with a Gaussian blur to possibly aid in the Canny Edge detection for the ECR approach. There are more possibilities to alter the image beforehand or change parameters such as in the image hashing approach.

Another possibility is the experimentation when combining the previously mentioned SBD techniques. The CNN usage mentioned in [15] also combined the CNN with a post-processing step that uses histogram-differencing to eliminate false positive. A similar approach could be taken by combining the frame reduction aspect of MSE and the robustness and accuracy of the Image Hashing method. A first step in testing how well these methods can aid each other is by testing overlap and differences of true positives that were detected by the different methods.

The test set should also be extended. A 6-hour sample offers around 300 thousand frames to work with, but only 1150 true positives to be detected. To properly test the algorithms, the amount of true positives needs to be enlarged. This test-set could be automatically generated by creating video sequences that use frames which are known to be in a different shot, however, this might be too artificial to mimic real-life digital exams. Annotation of video’s and volunteers to make digital exams to create video’s might be needed.

Lastly, the possibilities of higher fraud prevention levels, preferable 1 or 2, should be consid-ered. The approach mentioned in [1] is used to great effect with their own set-up but requires specialized hardware and a more intermediate solution solely using a webcam might be more preferable. The combination of detecting fraud instances on the screen-recordings and detection on the outside via a webcam or a singular camera can protect the integrity of the digital exams on a reasonable level and should be explored further.

(36)

Determining Efficient Shot Boundary Detection on Screen-Recorded video’s of Digital Exams

Bachelor Informatica