A joint visual-textual approach for predicting social popularity of Animated GIFs.

(1)

A joint visual-textual approach for predicting social popularity of Animated GIFs.

submitted in partial fulfillment for the degree of master of science Anass Fellali

12310468

master information studies data science

faculty of science university of amsterdam

2019-07-08

First Supervisor Second Supervisor Title, Name Dr Pascal Mettes Dr Andrew Brown Affiliation UvA UvA

(2)

A joint visual-textual approach for predicting social popularity

of Animated GIFs.

Anass Fellali

anass.fellali@gmail.com University of Amsterdam Science Park 904, 1098 XH Amsterdam, the Netherlands 43017-6221

ABSTRACT

Even though we can find traces of GIFs in the early days of the Internet, we and especially younger generations are still incorporat-ing GIFs into their daily communication. GIFs can compress a lot more of information than texts or still images ,into something that is easily shareable. Given their popularity and the role they can have in expressing emotional responses, it is no surprise that many brands are learning to use GIFs in their online communications to stay in touch culturally with todays’ connected consumers. For that reason, being able to predict the social popularity of a GIF could find many applications in real life such as helping companies optimize their brand storytelling, finding new approaches of engaging with customers or even understand what role the brand plays in the digital culture. [2] In this present work, we will contribute to the current research topics that are being conducted around Animated GIFs analysis. One of the main challenge here is about managing to join two kind of analysis (textual and visual) in the same architec-ture to predict GIFs’ social popularity. Then, as we intend to rely on several datasets, one embedding the textual and visual data and the other one having the features around social popularity, the second big challenge will be to transfer knowledge across these different datasets to take advantage of multiple features. To do so, we will rely on several datasets that are publicly available embedding GIFs, textual description of the clip or features such as the number of likes and reblogs that we will use as proxies for social popularity.

KEYWORDS

video captioning, cnn, lstm

1 INTRODUCTION

A GIF is a bitmap image format widespread on the Internet due to its wide compatibility and portability [6] Nowadays, GIFs have become a key communication tool in contemporary digital societies. As explained by authors in [8], on top of being a standard to encode and decode a string of 1s and 0s, the GIF or Graphics Interchange Format has an utility, aesthetics and sometimes an evolving context. The GIF has no maximum resolution and can display up to 256 colors out of a palette of millions. All GIFs share common characteristics that are the briefness of the duration of the video, the looping behaviour as well as the absence of sound and emotional expressiveness. All of this brings particular challenges in their analysis. [16] The goal of this project is to design and implement a joint visual-textual analysis to predict the social popularity of Animated GIFs. This work will be divided into two main parts. The first one will consist in designing and implementing a convolutional network as well as a long-short term memory recurrent network for the first dataset in

Figure 1: Joint CNN-LSTM Architecture. Figure 1 depicts a representation of the joint visual and textual architecture that we use to predict the social popularity of a GIF. One convolutional network is built to learn the visual features. After being pre-processed, the frames are passed through a pre-trained VGG16 network in order to learn the visual fea-tures. Next to that, we learn the textual features from LSTMs that are an improved version of recurrent neural networks. Both outputs from the models are then concatenated to re-trieve information from both inputs to predict the right so-cial popularity metric.

order to generate captions for the second dataset where information on the social popularity of GIFs are available. Describing visual content embedded inside videos can find many applications in real life such as increasing the accuracy of interactions between humans and robots. Even though, this technology has already received a lot of attention in image captioning, developing such a technology for videos could also be of a huge help for blind people. Once the yielded captions will be describing the new GIFs well enough, we will move onto the second part. The second part will propose a method to predict the social popularity of GIFs represented by the number of likes and reblogs. To achieve that, we will compare the effectiveness of 3 different models. The first one will attempt to predict the social popularity based on the visual aspect of the GIFs only. The second one will be based on the textual features

(3)

that have been generated in the first step. Finally, a joint visual-textual analysis will be performed in order to predict the likes and reblogs in a way that should outperform the two previous methods mentioned.

2 RELATED WORK

Several studies have been conducted around the concept of GIF analysis. Here under are a few descriptions and related sources of the state-of-the-art literature around the subject.

2.1 GIFs

To start with, authors in [4] have analyzed the engagement of over 3.9 million GIFs by relying on the number of likes and reblogs and arrived to the conclusion that animated GIFs are significantly more engaging than other kinds of media. It seems that there is also a strong link between strong engagement and GIFs that contains faces as well as those with high resolution and frame rate. In [10], authors have conducted a study to discover what makes a GIF in-teresting. They have created and annotated a dataset with some affective labels such as aesthetics, pleasantness or the curiosity. By relying on this dataset, authors have managed to design and im-plement a predictive model that can estimates GIF interestingness. On top of that, authors also found that interestingness was linked to likability but not to social popularity. In [5], authors aimed at predicting the emotions perceived by humans based on the GIF content. Their method used 3D convolutional neural networks to extract spatio-temporal features from GIFs instead of only spatial information with more simple CNNs. They managed to outperform previous techniques in predicting crowd-sourced intensity scores of 17 different emotions. This paper may be useful for a comple-mentary perceived sentimental analysis. Authors in [11],introduce a method to tackle the issue of automating the entirely manual process of GIF creation by leveraging user generated GIF content. To do that, they developed a model that given a video, generates a ranked list of its segments according to their suitability as GIF. Their approach managed to capture patterns frequently present in popular GIFs such as segments with people.

2.2 Image and video captioning

In [22], the authors have designed and implemented a neural net-work on images for sentiment analysis and a paragraph vector model to represent and analyze sentiments embedded in text. Re-sults show that a joint visual-textual analysis can achieve state-of-the-art performance than textual and visual sentiment analysis algorithms alone. We will be using the same techniques in order to predict popularity of GIFs and not sentiments embedded in images. In [17], authors have contributed to work developed around video captioning by developing robust captioning frameworks that intro-duce a temporal attention steering mechanism which goal is to use frame level visual concepts based on current video properties to guide attention and not on training data trends. They have built fully end-to-end models that learn from video features to determine the temporal bounds of video clips to generate text descriptions of an entire video. Authors in [7] propose a method based on the encoder-decoder framework, named Reference based Long Short Term Memory (R-LSTM), aiming to lead the model to generate

a more descriptive sentence for the given image by introducing reference information. They achieve that by assigning different weights to the words according to the correlation between words and images during the training phase. Results show that through the introduction of reference information, their model can learn the key information of images and generate more trivial and relevant words for images.

2.3 Video classification

In [12], authors have studied the performance of convolutional neu-ral networks in large-scale video classification problems. One of their result shows that a single-frame model already displays very strong performance which suggests that local motion information may not be critically important. The majority of the studies around the social popularity of images [13] [15] [19] show that social fea-tures have the greatest predictive power in comparison with visual content features that are less powerful than social ones in terms of predictive power. However, these visual content features are useful when no user metadata is present such as tags. In [9], authors use visual sentiment features together with three novel context features to predict a popularity score on social images.

3 DESIGN AND ARCHITECTURE

In this section, we will document the process of designing the architectures of the models used to generate captions and to learn the social popularity features from the GIFs and related generated captions. There are mainly two ways to tackle the issue of video captioning. The first one is a template-based language model that requires special rules for language grammar to split the sentence into several parts, e.g.: subject, verb, object. The second method is about sequence learning models such as RNNs and LSTMs that aim at directly learning an interpretable map between video content and captions. [20] This work focuses on the second method as it seems to allow more flexibility and relies on deep neural architecture which is also an adequate direction for this paper. In the social popularity part, three different classification methods are being explored. The joint part will rely mainly on the architecture that has been designed for the video captioning issue.

3.1 Caption Generation

The first dataset [14] embeds one description for each GIF. After having put the links and related descriptions coming from a TSV file into a dataframe, some text pre-processing is required. 3.1.1 Text pre-processing

In order for our model to learn to generate text, we need to transfer it from a human language to a machine-readable format for further processing. To do that, we started with text normalization. This process embeds multiple steps. The first one is to convert all letters to lower case in order for the model to not make distinctions be-tween, for example: ”Machine” and ”machine”. Then, we entered the process of removing all non-alphabetical character for every descriptions such as punctuation and numerical characters. Finally, we removed some stop-words that do not add so much more value to the sentence such as the words ”a”, ”the”, ”is”. All of these steps allow us to simplify the task and reduce the vocabulary represented by all unique words of the 100.000 descriptions which will be used 2

(4)

Table 1: Sequence generation fed to the model.

X1 (Frames) X2 (Text Sequence) y (Word) Frame1 startseq otter Frame1 startseq, otter dunks Frame1 startseq, otter,dunks basketball

for the Long-Short Term Memory network. The objective is to have a vocabulary that is as much representative of the GIF captions and as small as possible in order to implement a smaller model that should train faster. After this normalization process was done, we still needed a more standardized way of representing the text into a more readable format for a computer. To do this, the model needs to know when does a sentence starts and ends. That is why we added a STARTSEQ token at the beginning of every sequence and a ENDSEQ token at the end of it. This way, we can tell the model when to break while generating new words. Once this was done, we could transform the sequences into vectors. We want the output of the model to be a probability distribution over each word in the vocabulary of the whole corpus. On top of that, the generated sequence should be as long as the longest GIF description. The captions need to be first integer encoded where each word in the vocabulary is assigned a unique integer so that sequences of words can be replaced with sequences of integers. The integer sequences would then need to be one hot encoded to represent the probability distribution over the vocabulary for each word in the sequence. On top of that, as the network will require all output sequences to have the same length for training, we need to pad all encoded sequences to have the same length as the longest encoded sequence. We can pad the sequences with 0 values after the list of words. In other words, we just add zero’s after the list of words that have been encoded as integers. Finally, we store these padded sequences into arrays by adding one word to the padded array at each iteration until the last word of the caption is read, as displayed in table 1. This way, each frame will be linked to a list of arrays embedding the representation of the captions.

3.1.2 GIFs pre-processing

After having pre-processed the captions, we also needed to do similar steps for the animated GIFs. In fact, all GIFs did not have the same number of frames, represented by the first dimension of the video, which made vary the length of the videos. On top of that, all GIFs had not the same width and height, which are the second and third dimension. Finally, the GIFs were also represented by a fourth dimension representing the colours and the opacity which may not require us to use all channels. That is why some normalization steps were also required for the Convolutional Neural Network to process the GIFs in a more standardized way. In order to do that, we had to first load all the GIFs into the memory, which took quite a long time given the huge number of GIFs to load. After that, we sampled the distribution of the number of frames per GIF in order to find the optimal number of frames to load into arrays. Regarding the size, as we intend to pass the frames into a pre-defined feature extraction model, such as a state-of-the-art deep image classification network trained on ImageNet called VGG16 and that requires the frames to be of size 224x224, we re-sized the arrays representing the GIFs to the required format. Finally, as

there were four channels embedded in the arrays - three for the RGB colours and one channel for the alpha measure representing the opacity of the colors - we reduced this dimension to only 3 dimensions by removing the opacity dimension.

3.1.3 Time-Distributed CNN-LSTM Architecture The main challenge of this research topic lies in the design of an architecture capable of reading the GIFs, learning visual features from each frames as well as textual features from captions, word by word. For this reason, we decided to build two separate deep neural networks and to add the last layers of both models together to generate a sentence based on all the words in the vocabulary. Another option could have been to rely only on a recurrent neural network aimed at generating sequences by modifying the architecture to get a model that encodes both image and textual features. In fact, it has been shown that this architecture outputs a very large neural network with a lot of parameters to learn and only a few advantages related to performance measures. Rather than that, the authors suggest that both models should be built separately and merged in a later phase. The decoding model merges the vectors from both input models using an addition operation. This is then fed to a Dense 256 neurons layer and then to a final output Dense layer that makes a softmax prediction over the entire output vocabulary for the next word in the sequence.

As we are already achieving state-of-the-art methods in image captioning or classification, it was unnecessary to develop a new convolutional network from scratch. This method achieved poor results anyway. Using already pre-trained networks was a better solution to our problem. However, one issue arose when designing the network next to our LSTM model used to learn textual fea-tures. The fact is that these pre-trained networks are designed for image classification but videos take the temporal dimension into account. In other words, we need to use 2D-CNNs for the VGG16 layers to work. That is the reason why, using 3D-CNNs was not an option, even though it could have been an option for networks built from scratch. Fortunately, Keras enables us to wrap the layers inside TimeDistributed layers which allows us to use pre-trained convolutional networks on videos as well. The only modification performed on the VGG16 pre-trained network was to remove the last layer composed of a dense layer representing all the possible classes the network was trained on. After having deleted this last layer, we added a global average pooling layer which aims at pre-venting overfitting by reducing the total number of parameters to learn. Finally, the last layer is a dense layer with 256 units acting as the dimension of the output space that will be used when we will add the convolutional network to the long-short term memory network that will be described next.To sum up, the video feature extractor model expects input video features to be a vector of shape (frames,4.096). These are processed by a Dense layer to produce a 256 element representation of the frame. We designed the same network in two different versions, one taking only one frame as an input and the other one taking 5 frames in order to observe the differences in performance and accuracy of the results.

Our model also includes another input which is the caption of the video. As a reminder, these captions have been pre-processed to simplify the texts as much as possible and we added start and end tokens around each caption for the model to know when to start

(5)

and stop learning of predicting in further steps. Finally, all the captions are padded to reach the same length of captions as all descriptions do not have the same length. The first layer of the textual network aims at receiving these padded sequences where the shape has two dimensions. The first one is the number of examples that is dependent on the number of videos and captions that we want to process. The second dimension is the length of the arrays embedding the padded sequences which is equal to the maximum length of all captions. The output of this layer is then sent to an embedding layer. This embedding layer that is required to be the first hidden layer of the network is initialized with random weights. The goal is to learn an embedding, where words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space. The model will learn the position of each word of our corpus by basing itself on the corpus and the surrounding words.[1] We will learn this embedding from scratch in order to use it to predict new captions. Finally, the output of the embedding layer is fed to a Long-Short Term Memory with 256 units. As we were dealing with captions, which are sequences of words such as videos are sequences of frames, we needed to tackle the issue of temporality as well. RNNs are a better way of achieving that rather than using a conventional feed forward network such as CNNs because they introduce hidden states between the cells that help the network to remember past information that have been learned. The presence of these feedback loops is the reason why these networks are called recurrent. LSTMs are variants of these RNNs that help to tackle a drawback encountered when training RNNs as the network kinds of forget the first things learned due to the fact that at each time step, the same weights are being used when the network makes a prediction and backpropagates. LSTMs introduce a new architecture with gates that have control on the memorizing process that we can then leverage by allowing the network to forget unnecessary information, decide which of the new information should be updated or ignored and what is going to be output as a prediction. Once the training has been performed, we can generate new captions, word by word. We pre-process the new GIF the same way it has been done for the training part. The prediction starts with the same starting token used in the pre-processing part and serves as the first input to all predictions. Then, we call the predict function of our model by feeding it the frames of the video as well as the starting token. These will serve to predict the next word of the sentence. In the next step, the model will do the same except that it will take the two words, startseq and the predicted word, as an input to predict the third word. The process will be repeated by the model until it reaches a break word such as endseqor when the maximum length has been reached. [3]

3.2 Social Popularity Prediction

Once we have managed to generate captions for the second dataset embedding the information on the social popularity, we can now dive into the design of the 3 models that will be compared in the results. Before diving into the architecture of the neural networks, the social popularity metrics need further investigation. The proxies used to represent the social popularity are the number of likes and reblogs. These 2 metrics are continuous variables that range, for example, between 0 and 2014 likes. For the sake of simplifying

the issue, we decided to focus only on the number of likes and we regrouped them into 5 categories ranging from Not Popular to Highly Popular.

3.2.1 Visual Learning Features For the ConvNet part of the model, we will use the same VGG16-style architecture as used in previous parts without the LSTM network. Once with have one-hot encoded all the labels representing the class of each GIF, we can just modify the last dense layer of the model to output a probability for the 5 possible classes. In this part, we have decided to re-use the two models taking as inputs one or five frames to observe if the model can learn more patterns by being able to learn from multiple frames rather than just one.

3.2.2 Textual Learning Features One of the supervised model that will be compared is related to text classification. We have a global corpus containing generated captions and their labels that will be used to train a classifier. The intensity of captions pre-processing will depend on the quality of features. In the case were high quality captions are generated, we may reduce the information to learn by applying the same techniques applied in the previous section. However, in the case were captions are generated that are not highly explicit, we will keep as much information as pos-sible. On top of that, the generated captions will be based on a vocabulary that has already been fitted to a simplified corpus of captions. Regarding features engineering on the captions, we will focus on different features representation, such as the count vector or TF-IDF methods. The count vector method is a representation of the dataset in which every row represents a caption from the corpus and every column represents a term from the whole corpus. Finally, every cell represents the frequency count of a particular term in a particular document. Then, we will also use the TF-IDF score that depicts the relative importance of a term in the caption and the entire corpus. The TF-IDF score is composed by two terms: the first computes the normalized term frequency , which is equal to the number of times a term appears in a caption divided by the total number of terms in the caption. The second term is the Inverse Document Frequency, computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. These two methods will be used in different machine learning algorithms such as SVMs, Naive Bayes classifiers and a boosting model such as the Xtreme Gradient Boost-ing. Next to these machine learning algorithms, we will also rely on a deep neural network in the form of the LSTM that we have used in the caption problem with the same embedding, dense and dropout layers.

3.2.3 Joint Visual-Textual Learning Features This joint ar-chitecture aims at combining or concatenating the visual and tex-tual features learned from previous steps. The difference with the first model where the goal was to predict the most possible next word of a sentence, here the objective is to classify the GIF into one of the classes designed to represent the social popularity of the GIF by relying on its visual and textual features. The good point was that we can reuse the weights learned from previous models instead of having to re-train the whole network. In terms of implementation and thanks to the Keras function API, we can simply change the last layer of the previous CNN-LSTM. In the

(6)

Figure 2: Joint Visual-Textual Architecture

previous model, we added two inputs together that had the same shape to output a single tensor that had the shape of the maximum length of the captions. In this model, we will slightly modify the structure of this layer by concatenating the two layers together. We will finally use two dense layers that will help us generate outputs predicting the social popularity class of the GIF.

4 SET UP

4.1 Implementation details

Dealing with video data implies huge resources to be highly accu-rate. Having two datasets of approximately 100.000 GIFs to deal with forced us to rely on an external hard drive to deal with the memory issue. However, having to rely on such a tool rather than working fully on the local memory made the training process even longer as the data was not located near to the local processors. That is also the reason why we limited the training data to 5 frames per GIF and relied only on the gray scale of the frames rather than us-ing all three channels. Even with all these limitations, the trainus-ing time of the caption generator model on half of the dataset took an hour per epoch. Most of the project was built in Python on a local Jupyter Notebook kernel. Relying on Kaggle kernels did decrease the training time but we found ourselves limited with the memory available for me to upload the GIFs. We relied on the Keras functional API which allows us to build simply deep models that require multiple inputs and outputs. To deal with the GIFs pre-processing, we relied mainly on the cv2 python package which allows us to read the GIF and play with its properties. When im-plementing the first caption generator network, we had started by running the frames through the full VGG model during the training phase of the network. However, this way of working was not very efficient and consumed a lot of memory. Another way of doing this step in less time and more efficiently was to pre-compute the video features using the pre-trained VGG16 model and save these video representation arrays to a file. That way, we could load these features later and feed them into our model as the interpretation of a GIF. This method almost divided the training time of an epoch by two.

Table 2: Basic statistics of the TGIF dataset.

Splits Train Validation Test Overall #GIFS 80,000 10,708 11,360 102,068 #Frames 3,258,373 431,359 453,718 4,143,450 #Shots 204,553 26,559 27,933 259,338 Duration 81h 11h 12h 103h #Sentences 80,000 10,708 34,101 125,782 #Tokens 911,593 19,831 34,101 1,418,555 #Unique Tokens 10,685 4,755 7,083 12,228 Avg. Term Freq. 85.32 25.72 54.31 116

4.2 Datasets

1. TGIF DatasetThis dataset was extracted from Tumblr from randomly selected posts published between May and June of 2015. It contains 100.000 animated GIFs and 120.000 sentences describing the visual content inside the GIF. We intend to use this dataset to train our models for description techniques. The GIFs last 3.6 seconds on average, which gives us more than 100 hours of GIFs to analyze. In the sentences related to each GIF, we can find more than 12.000 unique tokens with an average term frequency equal to 116. [14] In table 2 we can find basic statistics of the TGIF dataset. 2. Tumblr Animated GIF Dataset (CHI 2016)Another dataset is publicly available and gives us information about the social pop-ularity of 99,844 GIFs. Next to the available GIFs, we can find the total number of likes and reblogs that can be used as proxies for likability and social popularity. On top of that, information related to the meta-data of GIFs is also available containing features like the number of pixels, number of frames or the number of seconds that the GIF lasts.[18]

4.3 Hyper-parameter tuning

When dealing with deep neural networks, a huge obstacle is re-lated to the huge numbers of parameters that need to be tested and for which we do not have specific rules to decide. As our work consisted in building four different networks, we needed to put priorities in the selection of our hyper-parameter tuning pro-cess. The two biggest part of the work were related to the caption generator network and the joint visual-textual network for social predictions. As these two networks were embedding both videos and captions, there were the largest networks to train with more than 130 millions parameters to learn. As explained in the design section, we have decided to rely mainly on the standard VGG-16 architecture for the CNN part. This enabled me to focus on more sensitive parameters such as the batch size or the number of epochs rather than building the model from scratch which would require hyper-intensive parameter tuning to find the right number of layers or neurons.

As the most important issue we encountered was related to hard-ware limitations, we quickly noticed that we will have to figure out the maximum amount of videos ,frames and the corresponding cap-tions that we could use for our training. Bearing in mind that the more data there is, the better the model will learn and generalize, we still had to choose a reasonable training duration per epoch to be able to test in due time different possibilities of networks and

(7)

other hyper-parameters. To base our decision, we relied on the ETA (Estimated Time of Arrival) information that Keras provides us when running a deep neural network and which is equivalent to one round of training with all of our samples. We started by testing the CNN based on VGG16 network by using only one frame per video. In that case we could go up to 30.000 different videos using only the first frame of each video. Another option for me was to use less videos but to rely on data augmentation on the single frame by shifting, rotating and performing other modifications on the image. Then, we performed the same tests on our time distributed convolu-tional network by transforming 5 frames into arrays. As expected, we could then process less different samples to keep a reasonable ETA. For that reason, we could train the network on 12.000 videos with 5 frames each. Doubling the number of frames would certainly help the model capture more information on dynamic features but it also increased the ETA to reach a time duration of 4 hours 30 minutes for one epoch. Regarding the other hyper-parameters, we tested different number of batch sizes and different learning rates for our Adam optimizer. Based on our experimentation, we ended up having a batch size of 120 samples and a learning rate of 0.005. The smaller the batch size was, the lower the loss started but we had to increase the learning rate in order for the model to learn faster.

During the video classification experiments, the amount of data to process and the number of weights to learn was a bit diminished as there were no needs for the captions. We decided to start fo-cusing on the ideal number of classes we would split our labels in. We tested to split the social popularity proxies represented by the number of likes the GIF had, in 3 or 5 classes ranging from not famous at all to very famous. As we were not using all the videos of our dataset, we balanced the classes as much as possible. The results were showing a better accuracy in the case of 3 classes and the training period was not very different surprisingly but picking 5 classes seemed more accurate to represent the data.

Next to that, we started digging into hyper-parameter tuning of the text classification machine learning algorithms. As this was the less resource intensive, we could perform a bigger selection of hyper-parameters to finetune. We started looking at the different kind of classification algorithms we could use to perform this classification. As the best results were displayed by the Support Vector Machine classifier, we focused ourselves on it. The hyper-parameters we could play on were related to the use or not of the IDF part of the TF-IDF method, if it was better to train the model on uni-gram, bi-gram or tri-grams and what penalty parameter was the best between 0.01, 0.001 and 0.0001. The hyper-parameter maximizing the accuracy were using a penalty of 0.01, a uni-gram training and were using the IDF.

4.4 Evaluation Metrics

For the first part, the purpose is to measure the quality of the text generated by the neural network. However, judging if a text describes well a video content is quite subjective and not as easy to define as when we need to compare numerical outputs together where measures such as precision and recall can already give you a good approximation of the performance. That is why we decided to go with two metrics that are the followings:

Figure 3: Top 20 words frequencies. Highly frequent words such asman, and, woman were impacting the first models a lot as the generated captions were only based on these word in order to minimize the loss as much as possible but which yielded poor METEOR results.

• BLEU Score. The Bilingual Evaluation Understudy Score aims at comparing a generated sentence to a reference sentence by displaying a score of 1.0 if both sentences match perfect and 0.0 in the case of a perfect mismatch. This metric works by counting matching n-grams in the candidate translation to n-grams in the reference text. • METEOR Score. METEOR computes unigram precision

and recall, extending exact word matches to include similar words based on WordNet synonyms and stemmed tokens. While the BLEU metric is based on precision alone, the ME-TEOR metric uses precision and recall. It also focuses more on achieving a better correlation with human judgment at the sentence or token level.

5 RESULTS DISCUSSIONS

In this section, we will detail the results coming from the exper-iments in both the caption generation and the social popularity predictions issues. The first results compare the generated captions and the ground truth by looking at the Meteor metric. The second results will compare the accuracy of three different models of the predictions made on the social popularity proxies.

(8)

Figure 4: Examples of generated captions and their ground truths. As we can observe, the model catches some relevant words but cannot generate a description of actions or move-ments. On top of that the model has difficulties building sen-tences that are grammatically correct and meaningful.

Bleu-1 Meteor WordNet 27.8% 9.6% FrameNet 34.3% 14.1% LSTM (S2VT) 51.1% 16.1% Our CNN-LSTM 4.1% 2.1%

Table 3: Benchmark results of generated captions on three baseline methods [14] outperforming our CNN-LSTM.

5.1 Caption Generation

To start with, one can see on Figure 3 that there are clearly some words that do appear more often that the others. This bar chart was performed on cleaned captions were we kept relevant information as much as possible. This term frequency is highly important as we have already seen in the text classification problem. As all words are not represented in the same way, it may lead to overfitting problem. As we can observe on Table 3, the results from the CNN-LSTM built in this research are not very accurate compared to other methods. This means that the captions generated do not correspond a lot to the reference captions. The first generated results were displaying the same exact sentence embedding only one word that was repeated for the whole sentence. The word ”man” was used over and over which was understandable as it was the most frequent word in the text. Next generations of the model embedding more data have led to sentences that were grammatically corrects but not describing the GIF properly. On top of that, the sentences were slightly different between the GIFs as only one word or two differed between them. Finally, one of the best performing model depicted descriptions that were catching objects present in the scene such as a cat or a basketball. In fact, as the two different models we designed were using either one or 5 frames, the model was not catching the actions and movements that were happening in the GIF. However, different sentences were generated for different unseen GIFS. This may be due to the fact that our last model was using 5 frames with the same caption for the 5 frames which should have helped the

Regression Classification Precise Margin Error Categorical Accuracy

VGG16 - LSTM - - 59,6%

VGG16 - TFP - - 61,5%

1-Frame VGG16 4% 18,5% 23,6% 5-Frames VGG16 6.2% 20,6% 38,4% Table 4: Accuracy results of our two different CNNs for video classification and benchmark methods both using 10 frames.[21] Our model using 5 frames outperforms the one we designed using 1 frame. However, other models using more frames have a better categorical accuracy.

model to move from a one-shot problem, having only 1 frame to learn from in each GIF to being able to learn from 5 frames for the same caption.

5.2 Social Popularity Prediction

5.2.1 Visual Learning Features Regarding the video classifi-cation results, we implemented two different networks as well in two different problems. We suspected a classification problem to take much less time and to be more accurate than a regression one. However, in order to be sure, we tackled both issues and see which one we would focus on. For the regression problem, we measured the accuracy on two different methods. The precise method cal-culates the accuracy of results by calculating the real difference between the prediction and the ground truth. The margin error method computed the accuracy in the same way but allowing the model to have an error of 80 likes. As we can observe on Table 4, in both models, the accuracy seemed to be higher in the classification problem and the whole training process took less time to achieve a lower loss. Regarding the results, one can observe that the accuracy performances are better for the classification neural network, with huge difference with the results in the precise regression problem. However, it did took more time to train the neural networks pre-dicting continuous variables. Finally, we can compare the results between the two different neural networks in both problems. The classification model using 5 frames shows an accuracy of 38.4% and outperforms all other models. The second best performing network is the single frame classifier showing 23.6% of good predictions. The model taking 5 frames as inputs seems to capture more infor-mation by having the chance to look at several examples coming from the same video. However, it did take much more time as well multiplying the training time by more than 1.5 times. Finally, we can see that the regression networks results show a different be-haviour than in the classification one as there is not an increase in the accuracy results as huge as it in the other problem when using more frames.

5.2.2 Textual Learning Features This sections aims at describ-ing the results observed from the generated captions social pop-ularity classification. Results in Table 5 display a small accuracy of the predictions. This is due to the fact that the sentences gener-ated on the new dataset of GIFS were not relevant enough for the model to find interesting patterns linked to the social popularity of the caption. The SVM classifier is the model that outperformed

(9)

Count Vector TF-IDF Margin Error Cat. Acc. Margin Error Cat. Acc.

SVM 3,7% 13% 3% 12,4%

LSTM 6,2% 10,2% 6,2% 10,2%

XGBoost 3% 9% 8% 7.8%

Naive Bayes 2,4% 8.1% 4% 6.8% Table 5: Accuracy results of three different textual classifi-cation algorithms and a LSTM. The Support Vector Machine classifier using the count vector method displays the best performance. The regressors do not display a high accuracy even with a margin error of 80 retweets.

Accuracy

Margin Error Classification 1-Frame CNN-LSTM 4.1% 10.2% 5-Frames CNN-LSTM 9.2% 18.1% Table 6: Accuracy results of a CNN-LSTM network for video classification.

all the other models on the two different features. The LSTM that we used in the caption model was the second best model and the weights used to train this network are the ones used to generate the joint-visual architecture for social popularity predictions. 5.2.3 Joint Visual-Textual Learning Features This sections aims at describing the results coming from the concatenated visual and textual network architecture to predict the social popularity of a given GIF. As we can observe in Table 6, the classification results as well as the regression ones did not improve compared to the results coming from Table 4. This is probably due to the model catching and learning wrong information coming from the generated captions that were not relevant enough to describe the GIFs well enough. However, as in the video classification problem, we do observe an increase in the accuracy results when the model gets the chance to learn from more frames. We can observe this tendency in the regression problem as well which was not the case in the video classification problem.

6 LIMITATIONS & FUTUR WORK

Given the fact that our work was related to learning video as well as text features for predicting continuous numerical values, the biggest limitations we encountered was related to having sufficient memory and time to process all the GIFs. As the datasets that were available only provided the web link redirecting to the GIFs, our memory space was solicited only when getting the GIFs from the web and loading them into the memory. On top of that, all the frames were not used as we needed to normalize the length of the videos for the model to be able to process them. However, even though we were using a pre-trained network, we had to pass all images through that network in order to get the representation vectors which took some time as well on top of the training duration.

As this work was intended to answer some questions, it also raised other issues that may be worth to investigate. As the results were biased due to the low performance results of the caption generator,

we could envision a similar work where we would apply transfer learning in another way. In this work, we transferred learning from captions to another dataset embedding social popularity predictors. However, another possibility would be to learn the same thing but from text data such as twitter data embedding statistics such as retweets and likes. We would first learn the visual features from one GIF dataset that has social popularity features. Then, learn the textual features from another dataset such as tweets describing videos. Finally, we could combine the two dense informative vectors in order to predict the social popularity of any other GIF. Another interesting direction of this work could be to focus a bit more on the classification issue. As we used the dataset only partially, we decided to create balanced classes in order to avoid that our model learns to predict only one class that is over-balanced. However, it could be possible to rely on unsupervised learning techniques such as clustering algorithms to find some other patterns in the distribution of the GIFs social popularity proxies. In that way, the data may be even more representative of the reality and we would certainly have to find methods to prevent overfitting.

7 CONCLUSION

In conclusion to this work, it has not been clearly shown that we could learn new information by concatenating both the visual and textual learning features together. In fact, the visual features used in the video classification network were already displaying better results than the rest. As we are already achieving state-of-the-art image classification techniques and that we used a network that has been pre-trained on objects, this results seems to have sense. Moreover, the results coming from the caption generator network were not reliable enough to be used properly in order to generate new captions for a whole new GIF dataset. As the model was barely catching words describing objects and not the parts of the sentence linked to the action or the dynamics of the video, we cannot fully confirm that we cannot retrieve additional information from the caption of the video. In fact, this work shows that the design and implementation is feasible but it requires more capabilities for the models to learn more information.

ACKNOWLEDGEMENTS

Foremost, I would like to express my sincere gratitude to my advisor Dr. Pascal Mettes for the continuous support of my research. His guidance helped me in all the time of research and writing of this thesis.

REFERENCES

[1] How to Use Word Embedding Layers for Deep

Learn-ing with Keras. https://machinelearningmastery.com/

use-word-embedding-layers-deep-learning-keras/. (????). Accessed: 2019-05-18.

[2] The Surging Popularity of GIFs In

Digi-tal Culture. https://medium.com/ipg-media-lab/

the-enduring-popularity-of-gifs-in-digital-culture-54763d7754aa. (????). Accessed: 2019-06-05.

[3] Understanding LSTM Networks. http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/. (????). Accessed: 2019-05-25.

[4] Saeideh Bakhshi, David Shamma, Lyndon Kennedy, Yale Song, Paloma de Juan, and Joseph Jofish’ Kaye. 2016. Fast, Cheap, and Good: Why Animated GIFs Engage Us. 575–586. DOI:http://dx.doi.org/10.1145/2858036.2858532 [5] Weixuan Chen and Rosalind W. Picard. 2016. Predicting Perceived Emotions in

Animated GIFs with 3D Convolutional Neural Networks. In Media Lab.

(10)

[6] W. Chen, O. O. Rudovic, and R. W. Picard. 2017. GIFGIF+: Collecting emotional animated GIFs with clustered multi-task learning. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII). 510–517. DOI:http://dx.doi.org/10.1109/ACII.2017.8273647

[7] Guiguang Ding, Minghai Chen, Sicheng Zhao, Hui Chen, Jungong Han, and Qiang Liu. 2018. Neural Image Caption Generation with Weighted Training and Reference. Cognitive Computation (08 Aug 2018).

[8] Jason Eppink. 2014. A brief history of the GIF (so far). Journal of Visual Culture 13, 3 (2014), 298–306. DOI:http://dx.doi.org/10.1177/1470412914553365 [9] Francesco Gelli, Tiberio Uricchio, Marco Bertini, Alberto Del Bimbo, and Shih-Fu

Chang. 2015. Image Popularity Prediction in Social Media Using Sentiment and Context Features. In Proceedings of the 23rd ACM International Conference on Multimedia (MM ’15). ACM, New York, NY, USA, 907–910. DOI:http://dx.doi. org/10.1145/2733373.2806361

[10] Michael Gygli and Mohammad Soleymani. 2016. Analyzing and Predicting GIF Interestingness. In ACM Multimedia.

[11] Michael Gygli, Yale Song, and Liangliang Cao. 2016. Video2GIF: Automatic Generation of Animated GIFs from Video. CoRR abs/1605.04850 (2016). http: //arxiv.org/abs/1605.04850

[12] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Suk-thankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’14). IEEE Computer Society, Washington, DC, USA, 1725–1732. DOI:http://dx.doi.org/10.1109/CVPR.2014.223

[13] Aditya Khosla, Atish Das Sarma, and Raffay Hamid. 2014. What Makes an Image Popular?. In Proceedings of the 23rd International Conference on World Wide Web (WWW ’14). ACM, New York, NY, USA, 867–876. DOI:http://dx.doi.org/10.1145/ 2566486.2567996

[14] Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Philip J. McParlane, Yashar Moshfeghi, and Joemon M. Jose. 2014. ”Nobody Comes Here Anymore, It’s Too Crowded”; Predicting Image Popularity on Flickr. In Proceedings of International Conference on Multimedia Retrieval (ICMR ’14). ACM, New York, NY, USA, Article 385, 7 pages. DOI:http://dx.doi.org/10.1145/ 2578726.2578776

[16] Kate M. Miltner and Tim Highfield. 2017. Never Gonna GIF You Up: Analyzing the Cultural Significance of the Animated GIF. Social Media + Society 3, 3 (2017), 2056305117725223. DOI:http://dx.doi.org/10.1177/2056305117725223 [17] Thang Huy Nguyen. 2017. Automatic Video Captioning using Deep Neural

Network. In Thesis.

[18] Lyndon Kennedy Yale Song Paloma de Juan Joseph ’Jofish’ Kaye. Saeideh Bakhshi, David Shamma. 2016. Fast, Cheap, and Good: Why Animated GIFs Engage Us.. In In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems.

[19] Luam Catao Totti, Felipe Almeida Costa, Sandra Avila, Eduardo Valle, Wagner Meira, Jr., and Virgilio Almeida. 2014. The Impact of Visual Attributes on Online Image Diffusion. In Proceedings of the 2014 ACM Conference on Web Science (WebSci ’14). ACM, New York, NY, USA, 42–51. DOI:http://dx.doi.org/10.1145/ 2615569.2615700

[20] Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang. 2016. Deep Learning for Video Classification and Captioning. CoRR abs/1609.06782 (2016). http: //arxiv.org/abs/1609.06782

[21] Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang. 2018. Frontiers of Mul-timedia Research. Association for Computing Machinery and Morgan & Claypool, New York, NY, USA, Chapter Deep Learning for Video Classification and Captioning, 3–29. DOI:http://dx.doi.org/10.1145/3122865.3122867 [22] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2015. Joint

Visual-Textual Sentiment Analysis with Deep Neural Networks. In ACM Multimedia.