Interactive Search and Exploration in Social Multimedia Networks

(1)

MSc Artificial Intelligence

Track: Natural Language Processing

Master Thesis

Interactive Search and Exploration

in Social Multimedia Networks

by

Iva Gornishka

10415548

26

th

_{September, 2018}

42 EC August 2017 - July 2018

Supervisor:

Dr Stevan Rudinac

Dr Efstratios Gavves

Assessor:

(2)

Abstract

In this thesis we present a novel interactive multimodal learning system which facilitates search and exploration in large networks of social multimedia users. The system allows the analyst to identify and select users of interest, and to find similar users in an interactive learning setting. Our approach is based on novel multimodal representations of users, words and con-cepts, which we simultaneously learn by deploying a general-purpose neural embedding model. We show these representations to be useful not only for categorizing users, but also for au-tomatically generating user and community profiles. Inspired by traditional summarization approaches, we create the profiles by selecting diverse and representative content from all the different modalities (i.e. text, image and user modality). The usefulness of the system is eval-uated using artificial actors, which simulate user behavior in a relevance feedback scenario. Multiple experiments were conducted in order to evaluate the quality of our multimodal rep-resentations, to compare different embedding strategies, and to determine the importance of different modalities. We demonstrate the capabilities of the proposed system on two different multimedia collections originating from the violent online extremism forum Stormfront and the microblogging platform Twitter, which are particularly interesting due to the high semantic level of the discussions they feature.

(3)

Acknowledgements

First of all, I would like to thank Dr. Rudinac for being the most patient and extremely opti-mistic supervisor. He not only did not abandon his ‘runaway student’, but even kept on pushing harder, in his own gentle way. His enthusiasm, constant motivation and warm support were invaluable throughout this journey.

I would also like to thank Dr. Gavves for agreeing to be a member of my defense commit-tee.

A huge Thank you/Dank je wel/Obrigada to all the amazing Nmbrs people, especially to Luis, Marlieke and Thijmen, for giving me the chance to be part of the Nmbrs family and for enduring my part-time presence for exactly 2 years to date.

Next, I would like to thank Cassandra for taking to heart the self-volunteered duty of ask-ing me for thesis updates every sask-ingle Wednesday mornask-ing for weeks and months, and for beask-ing a constant reminder that I need to keep on delivering.

Most of all, I would not have been able to achieve this without the moral (and financial!) support of my family through my long student years! Благодаря виfor always helping

me achieve my dreams, for believing in me at each and every step, and for always reminding me that I’m not alone even if I am 2000km away.

And finally, there are simply not enough words to express my eternal gratitude to Isaac for the endless brainstorming sessions, discussions and advice; for his constructive feedback and improvement ideas; for proof-reading this whole thesis a few times, and for acting not only as daily, but also as a nightly supervisor in the last few weeks; and last but not least, for keeping me sane and happy, and for constantly providing me with so-much-needed little distractions, laughter and warm food!

This work was partially funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 312827 (NoE VOX-Pol).

(4)

List of Figures

1 ISOLDE: High-level system overview . . . . 6

2 ISOLDE: Initial View . . . . 8

3 ISOLDE: Interactive Session View . . . . 8

4 Example of visual concepts extracted from an image . . . 13

5 ISOLDE: Search and Filtering Functionality . . . 14

6 General schemes of early and late fusion . . . 16

7 Setups for creating StarSpace embeddings . . . 18

8 ISOLDE: Profile View . . . 22

9 Comparison of visualization approaches . . . 25

10 ISOLDE: Profile Icons per Item Type . . . 26

11 Embeddings Visualization: Diﬀerent Modalities . . . 28

12 Embeddings Visualization: Most Common Concepts . . . 30

13 Example Profiles (Representative user content) . . . 33

14 Evaluation Framework . . . 34

15 Evaluation: System Parameters (Classifier) . . . 37

16 Evaluation: System Parameters (Negative Sampling Threshold) . . . 38

17 Evaluation: System Parameters (Top N highlighted results) . . . 39

18 Evaluation: System Parameters (Number of sessions) . . . 39

19 Evaluation: System Parameters (Dimensionality of User Representations) . . . . 40

20 Performance: Baseline Experiments (Stormfront and Twitter) . . . 43

21 Performance: Stormfront National Chapters . . . 46

22 Performance: Stormfront All categories vs National Chapters (Distribution) . . 48

23 Performance: Stormfront National Chapters (Modified Actors, Ranking) . . . . 50

24 Performance: Stormfront National Chapters (Modified Actors, Distribution) . . 51

25 Performance: Stormfront National Chapters (Individual results) . . . 52

26 Performance: For Stormfront Ladies Only . . . 53

27 Performance: Stormfront en Français . . . 55

28 Performance: Twitter Communities . . . 56

List of Tables

1 Existing Visualization Tools and Analytics Systems . . . 5

2 Stormfront Dataset (Number of posts per category) . . . 10

3 Twitter Trolls (Seed hashtags) . . . 11

4 Twitter Dataset (Number of posts per subcollection) . . . 11

5 Datasets (Summary) . . . 12

6 Examples of entities extracted from sentences . . . 12

7 Most common extracted concepts . . . 14

8 Examples of Nearest StarSpace Neighbors (Training Setup Comparison) . . . 29

9 Examples of Nearest StarSpace Neighbors . . . 31

10 Example Profiles . . . 32

11 System Parameters . . . 36

12 Results: System Parameters . . . 41

13 Final Performance: Baseline Experiments (Stormfront and Twitter) . . . 42

14 Final Performance: Stormfront All categories vs. National Chapters . . . 45

(6)

1 Introduction

In recent years much of our communication is happening on social multimedia platforms, which allow users to connect with others and to exchange ideas and content about topics of their interest. Such fora commonly host lengthy discussions and fierce debates about social issues, and they have become a widely adopted place for cooperation, activism, promotion of diﬀerent ideologies, and even organization of offline events and activities all over the world. Furthermore, they have become invaluable for members of various social movements and groups of like-minded people, since they do not only bring people with shared views and interests together, but also create a feeling of belonging to a community.

The public nature of many social media platforms also provides social scientists with un-precedented possibilities to study various social phenomena, such as the emergence of move-ments, the adoption and evolution of ideologies, community development - processes which no longer require physical interaction, and predominantly happen in virtual environments. Fur-thermore, when it comes to ideologies and processes with a potentially negative societal impact, such as extremism and radicalization, this research is not only interesting for social scientists, but also becomes crucial for aiding governments and policymakers in creating counter strategies. As pointed out by Conway [13], even basic descriptive or explanatory research is missing when it comes to determining the role of the Internet in violent extremism. Further emphasized is the lack of analysis of the individual user’s activity and the structures in which they operate. In fact, the individual users are naturally one of the most common subjects of interest in any social media research - the people themselves, their behavior and the communities they form; what is it that they like or believe in, and why; what content or ideas do they share, and how does it influence others; how do they relate to one another and how do they interact with each other; what is their role as members of a certain community, and how did they become part of this community.

While there has been an increasing amount of empirical research in the area, most of it stems from the field of psychology and social science, and much of it is still done manually, which puts a practical limit to the amount of content that can be analyzed. Klausen [23], for example, used their own Twitter accounts from January through March 2014 to identify and collect information about 59 Western fighters and their followers, which were used to initialize a snowball crawler. Furthermore, it is explicitly mentioned that the capacity of the researchers’ analytical platforms has been exceeded by the final dataset of 29k accounts, making a desired more sophisticated analytical description of the social networks impossible. O’Callaghan et al. [32] made use of Twitter’s API in order to collect an initial set of 911 unique accounts associated with the Syria conflict, after which they manually analyzed all profiles in order to create a final dataset of 652 accounts directly involved in events in Syria. Their further use of Youtube’s API for collecting information about channels mentioned in the tweets and annotations for the shared videos suggests that the researches would have made use of tools for automatically identifying relevant accounts if such were available. Finally, Berger and Morgan [5] manually reviewed the profile pictures of ISIS supporters within their analyzed set of 49k Twitter accounts. They also hand-coded a set of 6k Twitter accounts as ISIS supporters or non-supporters in order to make use of a machine learning approach taking into account the users’ profile descriptions.

As recently pointed out by Fernandez et al. [18], more work is needed to bridge the gap be-tween advances in social science and computational approaches, and the same conclusion was drawn based on our own survey of existing approaches to analysis, detection and prediction of phenomena such as extremism and radicalization. Furthermore, we argue that one of the most urgent issues that need to be addressed is the automatic identification of relevant content, and more specifically of users of interest.

(7)

Currently existing approaches to automatic collection of social multimedia data commonly rely on predefined lexicons and are platform and domain specific. Standard supervised machine learning approaches are great in detecting users of interest, but require large annotated datasets. Ferrara et al. [19], for example, made use of 26k accounts which were manually verified as ISIS-supporters by Twitter’s team. Unfortunately, such datasets are rarely available, and the existing ones are usually too general. Social scientists are commonly interested in specific movements, and new groups of interest emerge daily, which requires continuously annotating new datasets. Semi-supervised approaches, such as the one proposed by Agarwal and Sureka [2], aim at decreasing the amount of data to be hand-coded (e.g. by pre-filtering the dataset using specific words), but are by large still done manually. Fully unsupervised approaches require even more domain knowledge or are too general when fully automated. Scrivens et al. [42], for example, heuristically combined sentiment analysis with scores reflecting the amount, severity and duration of negative posts in order to identify radicalized users on the Dark Web forums

Islamic Awakening, Islamic Network and Turn to Islam. They suggest that domain experts

should manually replace their automatically created list of keywords used for sentiment analysis in order to adapt the algorithm to more specific domains and tasks.

However, once a suitable dataset is obtained, social scientists commonly desire to see an overview of the collection and more detailed information about its content in order to better understand the underlying structures and dynamics. While manually inspecting the individual items might be appropriate for small collections, analyzing large-scale dataset requires advanced tools and techniques. Furthermore, this analysis needs to account for the heterogeneous nature of social media platforms, where ideas and believes can be captured in text, visual content or even interactions with other users.

Over the years, analytics systems have been shown to facilitate analytical reasoning and to increase the human capacity to perceive, understand, and reason about complex data [14]. Unfortunately, to the best of our knowledge, no analytics system has been developed yet which facilitates gaining deeper insight into large collections of social multimedia users and their role as part of a community.

1.1 Research Questions

The goal of this thesis is to develop a novel analytics system which aids domain experts in analyzing large collections of social multimedia users. Its main purpose is to enable easy, on the fly categorization of the users and the discovery of new users of interest, in order to assist annotating large datasets for frequently changing communities. Furthermore, the proposed system needs to facilitate search and exploration in large heterogeneous collections. Lastly, the system should fulfill basic requirements for analytics systems — as outlined by Zahálka et al. [62], some of the most important ones are interactivity, scalability, relevance and comprehensibility,

Thus, we identify a few components which are crucial for this system. First, compact but meaningful multimodal content representations are needed to ensure the interactivity of the system and the relevance of the produced results. These representations need to be suited for diverse and heterogeneous collections, incorporating available information from the text, visual content and user interactions. Second, multimodal user and community summaries (profiles) would allow for easy navigation through the space and support insightful analysis of the dif-ferent social networks and their members. Finally, the core component of the system will be an interactive learning framework, which will allow the users to tailor categorization to their needs.

(8)

In the scope of this thesis, we develop a prototype of such a system, aiming to answer the following question:

How to facilitate gaining deeper insight into the role of a user or a group in an online community

And more specifically,

How to create compact multimodal user and content representations, which can serve as the basis for visualization, categorization and profiling

The following section discusses existing visualization tools and analytics systems, as well as their pros and cons. Section 3 gives a high-level overview of the proposed system and its intended use. Sections 4, 5, 6 and 7 describe in more details its individual components, and in Section 8 we explain how they are all combined into a single interactive interface. In Section 9 we present short qualitative analysis of some of ISOLDE’s individual components before we proceed to formally evaluating the overall performance of the system. Section 10 presents our general evaluation framework, and sections 11 and 12 contain in-depth analysis of the system and the proposed user representations. Finally, Section 13 summarizes the findings of this research and discusses future work ideas.

(9)

2 Related Work

Over the last few years, analyzing multimodal data and its underlying structure has become a subject of many studies, and a great number of excellent visualization tools and analytics systems have been proposed to aid domain experts in their research. While some of them were specifically developed to support network analysis tasks, others, although not directly applicable, are characterized by plenty of useful features and could serve as a good starting point for developing a novel analytics system for search and exploration in large collections of social multimedia users.

Through the years many researchers have addressed the problem of bringing graph visual-ization to non-technical users with simple tools such as NodeXL [45]. More elaborate systems such az GraphViz [17] make it possible to visualize large graphs with multiple diﬀerent layouts, and even provide simple methods for reducing the visual clutter caused by the huge amount of vertices and edges. The main disadvantage of such drawing packages is that they are usually static and do not provide any analysis of the collection. More advanced visualization tools make use of various algorithms in order to improve the readability of the graphs, to provide better overview and to enable insight gain. PIWI [59] integrates multiple community detection algorithms used to present the user with tag words describing each community. Newdle [58] applies graph clustering with a similar purpose, but it also allows the user to adjust the algo-rithm parameters, and provides search functionality based on the cluster tags, as well as simple tag analysis.

One of the most commonly used visualization tools to date is Gephi - an open source software for graph and network analysis on complex dataset [4]. Gephi does not only allow for interactive exploration of the collection, but also provides filtering, manipulating and clustering functionality, which makes it a great tool for dynamic network visualization.

While all of the aforementioned tools are great for discovering the overall structure of the collection, they only provide very basic analysis of the content and the individual items, which is mainly done in a preprocessing step. Furthermore, search capabilities are limited and the systems’ lack of flexibility does not easily enable more complicated analytic tasks. Last but not least, none of these tools have been developed with the problems of the highly heterogeneous multimedia collections in mind.

Aggregated search engines such as CoMeRDA [9] provide excellent search and filtering func-tionality in large multimedia databases, but fail to give an overview of the whole collection and the underlying relationships between items. Furthermore, limiting the user to a set of filters and the need to specify a precise search query make identifying relevant content and serendipitous discovery particularly difficult when the user does not have a well-defined information need.

Multimedia analytics systems, such as Multimedia Pivot Tables [56], ICLIC [51], City Melange [61] and Blackthorn [62] facilitate search and exploration in large collections of multi-media data as well as interactive multimodal learning. Zahálka et al. [61], for example, combine semantic information from the visual domain with latent topics extracted from the text domain in order to provide content-based venue recommendations. They learn the implicitly provided user preferences on the fly from the user’s interactions with the system in a relevance feedback framework. Yet, none of these systems were designed with the analysis of social graphs in mind. Considering the strong and weak points of existing system, we identify the following list of requirements for an analytics system allowing analysis of large social multimedia networks:

Overview – allow user to get a grip of the whole collection

Network Analysis – facilitate finding groups and communities of interest

Content Analysis – provide deeper analysis of the content of the items in the collection Online Learning – allow on the fly categorization of items in the collection

(10)

As it can be see in Table 1, none of the existing systems fulfill all four requirements, thus, facilitating multimedia analytics on social graphs requires a novel solution.

System/Tool Overview Network_Analysis _AnalysisContent _LearningOnline

NodeXL [45] GraphViz [17] PIWI [59] Newdle [58] Gephi [4] CoMeRDA [9] City Melange [61] Blackthorn [62] ISOLDE

(11)

3 Approach Overview

In this work, we propose ISOLDE – a novel multimodal analytics system which allows social scientists to explore large collections of social multimedia users, to identify users of interest and find more similar ones in an interactive framework. In this section we provide a high-level overview of ISOLDE, its main components and their importance for the user’s interactive sessions.

ISOLDE’s full pipeline is conceptually depicted on Figure 1 and it consists of three main

phases: data preprocessing, preparation of the individual system components and finally – the main purpose of the system – the interactive user sessions.

Figure 1: High-level overview of ISOLDE’s full preprocessing pipeline, system components and interactive user sessions

Data preprocessing

The first step is collecting social multimedia data and analyzing the content in order to provide additional context from the text, visual, or any other available modality. In this work we use two diﬀerent datasets, originating from the Internet forum Stormfront [6] and the microblogging platform Twitter1_{, and we extract visual concepts from all images, as well as entities from} the text. Afterwards, the annotated data is indexed in order to provide search and filtering functionality, using Elasticsearch in the back-end [1]. Section 4 discusses in details our approach to data collection and analysis, however, ISOLDE’s framework is general and can easily be adapted to any social multimedia platform or content analysis approach. In Section 4, we also present our approach for detecting communities within the two collections. Although social network analysis is not included in ISOLDE’s current standard pipeline, since we retain this information for evaluation purposes only, one could later chose to integrate it as well.

(12)

System Components

The following phase in ISOLDE’s pipeline is setting up all individual components of the sys-tem. First, we need user representations which will be later used to visualize the users and to classify them. While any standard technique can be used to represent the users, our proposed approach is based on a novel general-purpose neural embedding model which allows us to learn representations not only for users, but also for the multimodal content. Section 5 discusses this approach and its advantages in more details.

Next, user and community summaries (profiles) are required to reduce the amount of content that needs to be processed by the interacting user and to support insightful analysis of the dif-ferent social networks and their members. Our approach to creating such summaries is inspired by standard summarization techniques and relies on identifying diverse and representative items from all modalities. This process is enabled by our multimodal content representations and more details about it can be found in Section 6.

Lastly, an interactive learning component enables the discovery of new users of interest based on previously provided examples. Section 7 describes our relevance feedback framework and provides more information about the used model.

Interactive user sessions

Finally, all components of the system are combined into a single interactive user interface. While Section 8 describes this process and looks at the user sessions from the system perspective, here we summarize how the sessions look from the point of view of the interacting user (commonly called an actor from now on in order to distinguish from social multimedia users).

Step 1: Collection overview, search and exploration. The actor is initially presented

with an aggregated overview of all users in the collection, grouped by their topical similarities (Figure 2). They can start interacting with the collection, easily navigating by zooming in and out, filtering the items and searching by text queries. Automatically created sub-community profiles enable quick understanding of the space and the overall structure of the collection.

Step 2: User interactions and profiling. Once potential users of interest are identified,

the actor can inspect their automatically generated multimodal user profiles, and mark them as relevant or irrelevant (Figure 3).

Step 3: Identifying similar users. When the actor has marked a number of users as

relevant, she can look for more similar users. At this moment, an interactive classifier is trained in real time and used to score each item (i.e. users) in the collections. All user nodes are colored in order to reflect the scores, and the top-N users considered most relevant by the system are highlighted in the interface. Although it is preferable that the actor explicitly indicates the relevance of the top-N highlighted results, they are not obliged to do so and can simply continue with another action.

Throughout the whole session, the interacting user can seamlessly switch between the afore-mentioned actions and continue iteratively refining their preferences until satisfied with the outcome.

(13)

Figure 2: A screenshot of ISOLDE’s initial view

Figure 3: A screenshot of ISOLDE’s interface with the detected topical communities in the background and automatically generated user profile in the foreground.

(14)

4 Data Collection and Analysis

4.1 Datasets

We demonstrate the potential of the proposed system using two diﬀerent datasets, related to the violent online extremism domain and originally collected for the VOX-Pol Project2_.

Stormfront

The first collection consists of around 2 million posts from the white nationalist, white supremacist, antisemitic neo-Nazi Internet forum Stormfront [55]. After disregarding all posts of suspended users and users with less than 3 posts, the final dataset contains posts from 40 high-level cate-gories, generated by 29.279 users in the period between 1st _{September, 2001 and 1}st _February, 2015. Table 2 contains detailed information about the categories and the corresponding num-ber of posts in each on of them. For the purpose of this research, we distinguish between two diﬀerent subsets of categories – the General chapters (such as Politics & Continuing Crises and

Dating Advice), containing discussions on various topics, and the National Chapters (such as Stormfront Italia and Stormfront Britain), which are commonly frequented by users interested

in the specific geographic area and topics related to it.

Twitter

The second collection consists of almost 2.7 million messages from the microblogging platform Twitter3_{. The tweets are in 54 languages and were shared by 9.501 users involved in discussion} about diﬀerent extremist ideologies. The collection was crawled in 4 separate cycles using Twitter’s Search and Streaming APIs4_{, and contains the following subsets of tweets:}

• XFR – the timelines of 174 users confirmed by social scientists to be supporters of far right ideologies. The timelines were collected on 28th _{September, 2016 and are restricted} to the official limit of 3200 most recent tweets provided through the Twitter Search API. • Jihadi – the content generated by 86 users confirmed by social scientists to be jihadis (ISIS supporters in particular). Due to Twitter’s strict policies of account suspension, the data was only collected in the period between 26th _{September, 2016 and 30}th _September, 2016.

• TwitterKurds – tweets collected by using the seed hashtag #TwitterKurds, as well as variations of the hashtag. The crawl contains only tweets mentioning the specified hash-tags, no additional content or other tweets by the encountered users have been collected. • TwitterTrolls – tweets collected by using a number of seed hashtags related to trolling

behavior5 _{provided by social scientists (full list is presented in Table 3). The crawling was} done in multiple cycles between 17th _{May, 2016 and 16}th _{November, 2016 and involved} iteratively analyzing commonly co-occurring hashtags, as well as most active users in order to expand the subset of query terms at every iteration.

2_{http://www.voxpol.eu/} 3_{http://www.twitter.com/}

4_{https://developer.twitter.com/en/docs.html}

5_{According to Al-Rawi [3], although there is no agreed upon definition of the term trolling, it is meant as a} distraction from the main online discussion in a forum or platform by diverting attention to another issue which is mostly irrelevant. Reuter et al. [36] analyze the use of similar hashtags in the context of ‘the fight against terrorism in social media’.

(15)

#Posts #Users

General Chapters

Opposing Views Forum

711662

15849

Politics & Continuing Crises

163837

7224

Lounge

126835

7086

Talk

77979

8815

Questions about this Board

52563

7996

For Stormfront Ladies Only

47165

2711

Events

44220

4670

Strategy and Tactics

42454

5797

Local and Regional

42218

6762

Ideology and Philosophy

36327

3503

Suggestions for this Board

27083

4683

Newslinks & Articles

18483

1886

Classified Ads

18309

2996

Multimedia

14080

3238

eActivism and Stormfront Webmasters

13820

3952

Legal Issues

12378

2951

New Members Introduce Yourselves

12345

3451

General Questions and Comments

3379

712

The Truth About Martin Luther King

3222

950

The Eternal Flame

2566

885

Dating Advice

1076

384

Announcements

236

58

Introduction and FAQ

158

70

Fourth Annual Stormfront Smoky

Mountain Summit

63

32

Stormfront Downunder

42

32

Guidelines for Posting

1

1 National Chapters

Stormfront en Français

114219

1875

Stormfront en Español y Portugués

68118

1426

Stormfront Russia

58518

3617

Stormfront Ireland

49448

3037

Stormfront Baltic / Scandinavia

38167

3310

Stormfront Europe

27647

1418

Stormfront Britain

20719

2038

Stormfront Italia

11596

410

Stormfront Srbija

7638

424

Stormfront Canada

4420

704

Stormfront Croatia

3416

196

Stormfront Hungary

3360

455

Stormfront South Africa

3016

457

Stormfront Nederland & Vlaanderen

89

48

Total (unique):

1882872

29279

(16)

#isischan, #isis_chan, #isis, #isis ちゃん #knivesareforcuttingmelon

#daeshbags

#opisis, #opisil, #opdaesh #opjihadi #opiceisis, #iceisis

#BinarySec, #ctrlsec

Table 3: Twitter Trolls (Seed hashtags)

Subcollection #Posts #Users

TwitterTrolls 2049133 874

XFR 409155 174

TwitterKurds 179878 8512

Jihadi 4488 74 Total (unique): 2642654 9501

Table 4: Twitter Dataset (Number of posts per subcollection)

Here, we also omit all tweets of people who had less than 3 posts in the original dataset. Table 4 contains more information on the number of posts and users in each of the subcollections in the final dataset.

Dataset comparison

Our datasets were chosen to represent two diﬀerent types of social multimedia platforms, with the following main diﬀerences between them:

Structure: Stormfront is a typical forum, where posts are structured in categories, each

one of them divided into fine-grained threads. Users can reply to each others posts, typically by fully or partially quoting the original post, rather than creating a separate comment threads. Twitter on the other hand does not have a well-defined structure, the only means of message organization are the manually added hashtags, but users are able to comment on each other’s messages, creating full discussions around certain content.

Social networks: Unlike Twitter users, Stormfront users are not able to explicitly follow

each other or like one another’s content, hence explicit social network information is not avail-able in the forum. Furthermore, Twitter allows users to tag (mention) each other, providing additional means of user interaction and useful information for analyzing social networks and influential users.

Content: Stormfront posts are usually much longer than tweets - the average Stormfront

post consists of 101 tokens in contrast to the average 11 tokens in a tweet. Due to Twitter’s hard limit on the number of characters that can be used in a tweet6_{, users commonly express} their opinions and ideas using visual content or sharing external links - our Twitter collection contains few times more images.

Language: One of the most challenging diﬀerence between Stormfront and Twitter is

the language used by contributors. Stormfront is a smaller community, where discussions predominantly happen in English, with the only exceptions being the few national chapters (e.g.

Stormfront en Français, Stormfront Srbija, etc). Twitter on the other hand has become one

of the most popular microblogging platforms in the world, naturally attracting native speakers from various countries. In addition, due to Twitter’s character limit, users commonly use slang, abbreviations, omit a lot of context, and even disregard grammatical and syntactic rules, in

(17)

Stormfront Twitter #Users 29279 9501 #Posts 1.88M 2.64M Average #tokens/post7 ₁₀₁ ₁₁ Images (Unique)8 _{120K (86K)} _{510K (330K)} Average #images/user ≈11 ≈55 Table 5: Datasets (Summary)

order to convey a message in as short as possible text – phenomena only rarely appearing on Stormfront.

Table 5 contains an overview and summary statistics for the two datasets.

4.2 Content Analysis

When analysing social multimedia data, scientists usually starts by annotating the content – they commonly label the individual posts and images with concepts that describe them. This process is often done manually, which makes it labor intensive and time consuming. In order to mimic this process, we automatically extract concepts from the textual and visual domain as follows.

Entity linking

In the preprocessing step, we extract entities (i.e. topics, people, organizations and locations) mentioned in the Stormfront posts using the Semanticizer [21, 33], which links text to English Wikipedia articles. The process resulted in 65.240 unique entities, which are usually much easier to interpret than alternatives such as latent topics. Examples for sentences from the dataset together with the extracted entities can be seen in Table 6.

Although the Semanticizer is based on an approach for adding semantics to microblog posts, and tweets in specific [31], we argue that semantic linking might result in sparse and noisy annotations. As it can be seen in Table 6, Stormfront’s posts are predominantly well-formed and grammatically correct, as opposed to Twitter’s content which contains relatively short messages, use of slang and non-standard abbreviations. Yet, the Semanticizer already

Sentence Extracted entities

This map still shows Serbia & Montenegro as one country and

as hosting US troops. Serbia

Anti-Islamization leader steps down amid uproar over Hitler

selfie Adolf Hitler

It would be nice to see a ”webinar” or web seminar done

maybe for those that can’t attend in person. Web conferencing Notice the antifa trash are all covered up in shame. Anti-fascism World oil prices have risen to a new record above US$96 a

barrel United States dollar

Table 6: Examples of sentences and entities extracted from them

7_{including hashtags and user mentions in tweets}

8_{Number of urls which could be accessed, and downloaded content was an image from which visual concepts} could be extracted

(18)

fails to detect simple entities in this clean text, such as Montenegro, US (troops) and

Anti-Islamization. Furthermore, our Twitter dataset contains around 100K posts which only consist

of an image, url, hashtags, tags of other people or a combination of those, hence, no other text at all is available for analysis. Thus, for the Twitter collection the choice was made to use hashtags in order to provide additional context for the text modality - hashtags, just like entities, commonly refer to named entities, and are commonly used by users to self-label their content, as noted by Teevan et al. [49].

Visual concepts

For each image in both datasets we extract 346 TRECVID semantic concepts9 _{as described} by Snoek et al. [46]. A large number of TRECVID concepts are related to intelligence and security applications, and have been shown more useful for categorizing extremism content than the semantic concept detectors trained on general-purpose image collections such as ImageNet [39]. After obtaining confidence scores per image for all of the concepts, we only select the top 5 highest-ranking concepts to represent it. Figure 4 shows an example of image from our Twitter collection and its extracted TRECVID concepts. Table 7 presents the most commonly appearing hashtags, entities and visual concepts in the two datasets.

Finally, it is important to mention that one could apply diﬀerent approaches to providing addi-tional information from the text and visual modalities. Our previous work included annotating both Twitter and Stormfront posts and users with extracted LDA topics [7], but as already mentioned, they are harder to interpret than entities. We also made use of Indri [48] in or-der to annotate Stormfront users with a predefined list of domain-specific topics, however, for this research, we chose to make use of more general approaches applicable to any social mul-timedia platform in order to showcase the vast applicability of our system and easily compare performance on our two datasets.

Figure 4: Example of an image with extracted visual concepts Building,

Car, Scene Text, Ground Vehicle and Urban Scenes

9_{Full list of the 346 TRECVID concepts (selected from the original set of 500 TRECVID concepts for the} TRECVID 2011 Semantic indexing task) is available at https://www-nlpir.nist.gov/projects/tv2011/ tv11.sin.346.concepts.simple.txt

(19)

Stormfront Twitter

Entities Visual Concecpts Hashtags Visual Concepts

Jews 47727 Graphic 26187 ISIS 122837 Graphic 131311

Race and ethnicity in

the United States Census 44126 Animation Cartoon 17618 TwitterKurds 78726 Scene Text 95225

Iran 9962 Synthetic Images 16871 ISIL 43357 Text 94179

Dôn 8280 Adult 16201 OpISIS 43295 Overlaid Text 86651

Ron Paul 7641 Adult Male Human 15253 ISISchan 38308 Computer Or Television Screens 71481 David Duke 7556 Charts 13850 TeamVene10 37869 Animation Cartoon 69613 Mestizo 6324 Commercial Advertisement 12999 IslamicState 37690 Synthetic Images 63763 Gendèr 4935 Still Image 12859 targets 36766 Commercial Advertisement 63142 White pride 4821 3 Or More People 12764 IS 36407 Charts 56232 Adolf Hitler 4799 Scene Text 12092 iceisis 34654 Animal 50413

Table 7: Top 10 most common entities, hashtags and visual concepts from both datasets with the number of times they occur in the corresponding dataset

4.3 Data Indexing

Finally, the collected and annotated data is indexed – for this purpose, we use Elasticsearch in the back-end [1]. The indexing procedure helps us create facets in the interface, allowing users to filter based on most common concepts, entities or hashtags (Figure 5). Furthermore, users can search for text within the content, as well as for specific terms occurring in the user names or the annotations.

Figure 5: A close-up of ISOLDE’s search box and facets for filtering users based on Hashtags (expanded) and Visual Concepts (collapsed)

4.4 Community Detection

Although performing social network analysis in advance and integrating the results of it into

ISOLDE’s standard pipeline could be of great importance for its performance, we currently

retain this information for evaluation purposes only. Here, we discuss our approach to detecting communities within our collections of Stormfront and Twitter users, and later, in Section 12,

(20)

we describes how we make use of the detected communities in order to automatically evaluate

ISOLDE.

We find communities using the Louvain Method proposed by Blondel et al. [8]. We construct a graph where each node corresponds to a user and the edges are weighted based on the similarity between the users. First, we compute similarities based on replies and retweets. Using generalized Jaccard similarity, the similarity of any two users x and y is

J (x, y) = P imin(xi, yi) P imax(xi, yi) (1) where for every user i in the collection, xi and yi are the amount of times x and y replied

to i (for Stormfront) or retweeted i (for Twitter). Additionally, we compute similarities for Stormfront users based on the forum categories – we use Jaccard similarity again, with the only diﬀerence that xi and yi are now the number of times x and y posted to a category i.

Finally, preserving all edges results in a dense graph yielding only few communities with a large number of members, which does not allow for analysis of the finer structures of the graph. Therefore, for every user, we only preserve the edges to its 10 most similar users.

(21)

5 User and Content Representation

Learning meaningful user representations is crucial for our system since they are used both for automatically generating user and sub-community summaries, and for training interactive learning models. Furthermore, they are used for initially clustering the data and positioning items on the screen, which means that uncovering the data structure, quickly identifying in-teresting parts of the topical space and relevant items in them, as well as efficiently using the screen space and avoiding visual clutter, all directly depend on the quality of representation. Finally, compact representations are needed to ensure the interactivity of the system during the user sessions.

When it comes to solving tasks using content with multiple channels or modalities, two main approaches are commonly used - early and late fusion (Figure 6). In the first case, features are extracted from each modality and fused into a single content representation before a learning algorithm is applied. Alternatively, separate learners can be used to produce results from each individual modality before fusing all results into a single, final one.

Figure 6: General schemes of early fusion and late fusion

Early fusion approaches are commonly used in the context of social multimedia data, but they are highly dependent on the chosen weighting scheme. Weights are most commonly picked heuristically, through experimentation – Clements et al. [12] used a grid search to find optimal parameters for their probabilistic weighting scheme, and Rudinac et al. [38] heuristically as-signed weights based on the performance of each individual modality on its own. Unfortunately, these approaches produce weights optimized for the specific evaluation task which was picked, rather than general performance on a variety of possible tasks.

Late fusion, on the other hand, does not require weighting of the separate modalities. But while it has been shown to commonly outperform early fusion on tasks such as semantic video analysis [47] and multimedia retrieval [54], sparsity and imbalanced features across the modalities tend to introduce a lot of noise. In turn, major disagreements between the results from the diﬀerent modalities are prone to produce unreliable final results independent of the fusion approach.

Although we experiment with unimodal features and compare the performance of our system for both of these approaches, we argue that a diﬀerent solution is needed when analyzing social multimedia content. Thus, we present a novel approach, which allows us to directly learn multimodal user representations, overcoming the challenges presented by fusing features or results in the two mainstream schemes.

5.1 Unimodal user representations

As a baseline approach to representing a user based on all modalities (text, visual concepts, entities and hashtags) we use TFIDF representations [41]. Due to the specifics of our automatic

(22)

approaches to extracting visual concepts and entities from the posts, some concepts and entities are extracted particularly often – this could present an issue for alternatives such as bag of concepts representation, but is naturally handled by TFIDF representations.

In order to represent a user in the textual modality, we treat all the posts generated by the user as a single document d. Then, for each term t in the vocabulary:

tf idf (t, d) = tf (t, d) × (log 1 + nd

1 + d(t) + 1) (2)

where tf(t, d) is the term frequency of t in document d, nd is the total number of documents

and d(t) is the number of documents containing t. In a similar fashion, using the standard definition of TFIDF, we separately create user representation in all other modalities by simply treating all visual concepts, entities and hashtags as terms – that is, tf(t, d) in equation 2 is the number of times a concept t was used by the user and d(t) is the number of users that used the same concept t.

In this manner, we produce user representations having the dimensionality of the corre-sponding vocabulary – that is number of words or distinct concepts. Such high-dimensionality vectors are not suitable for interactive systems like ISOLDE since they increase the computa-tional time needed for classification, and hence the amount of time that the user waits for a response from the system. Therefore, we reduce the dimensionality of the obtained unimodal user representations using PCA [30].

Preliminary experiments showed that in our evaluation setup early fusion clearly outper-forms late fusion approaches independent of the deployed fusion technique. Thus, in all future experiments with TFIDF representations based on multiple modalities, we assume the repre-sentations have been normalized and concatenated in advanced.

However, separately modeling the individual modalities does not only yield sparse representa-tions in some of them, but it also fails to capture important dependencies between them. When analyzing social multimedia data, analysts are often interested in the co-occurrence of specific topics and the relations between the diﬀerent modalities within the same posts – phenomena which are not capture by simple unimodal representations which essentially aggregate all of the content produced by the user.

5.2 Multimodal user and content representations

In order to overcome some of the problems posed by unimodal user representations, we propose learning multimodal user and content representations using StarSpace [57] - a general-purpose neural embedding model which was recently shown to be eﬀective for a variety of tasks. While most of these tasks can be adapted in order to learn user representations, we chose to deploy a multilabel text classification task, which allows us to simultaneously learn embeddings not only for users, but also for words, entities, hashtags and visual concepts. We expect such embeddings to be general enough for solving a wide range of tasks.

Formally, StarSpace receives as input document-label pairs (a, b) and minimizes the loss

X

(a,b)∈E+ b−_∈E−

Lbatch(sim(a, b), sim(a, b−₁), · · · , sim(a, b−_k))

where sim is cosine similarity; Lbatch is hinge loss and the negative labels b−

i are

gener-ated using k-negative sampling strategy; documents are represented as bag-of-words and their embedding is simply the sum of the word embeddings in it.

(23)

Figure 7: Diﬀerent setups for creating user and content embeddings using StarSpace [57]

through a multilabel text classification task, in the scope of this research we have chosen to compare four diﬀerent setups, conceptually depicted on Figure 7.

In all four setups we generate training examples per post and assign the corresponding user as a positive label. However, the setups diﬀer in two ways - first, in the way we generate the input documents for every example (post), and second, in whether or not we add additional labels next to the user label.

In our first baseline setup (StarSpace W-U), the documents fed to the neural embedding model consist of the bag of words (W) contained in the post itself, and the labels are only the corresponding users (U). We argue that this setup is intuitively similar to using TFIDF vectors as user representations, due to the fact that the model would try to minimize the distance between the embeddings of a user and the words they used, but at the same time, the embeddings of those words which were also commonly used by other users will be ‘pulled’ further away in space.

Our second baseline setup (StarSpace C-U) represents a user by only taking into account information about visual concepts, entities and hashtags (collectively referred to as ‘concepts’ from now on). Thus, the examples fed to the neural model consist of the bag of concepts (C) associated with a post and the contained images, and the labels are only the corresponding

(24)

users (U).

Both of our baseline models diﬀer from simple approaches for representing a user with TFIDF or BOW vectors in that the model receives additional information about the co-occurrence of words or concepts within the same post.

Next, we propose two diﬀerent setups incorporating simultaneously information about text and concepts. StarSpace CW-U combines the two baseline approaches by providing the model with two separate examples per post - one containing the bag of concepts associated with the post (C), and one containing the bag of words (W). For both examples, the user (U) is a single assigned label.

Finally, we experiment with one more setup - StarSpace W-UC. In this setup, examples consist again of the bag of words (W), however every post is labeled not only with the user (U), but also with each of the associated concepts (C). In this way, we put more importance on the concepts – such a setup implicitly minimizes the distance between users and the concepts that they commonly use, by simultaneously minimizing the distance between user and a post, and each of the concepts and the post.

Section 9 provides more insight into the resulting embeddings, their quality and the main diﬀer-ences between StarSpace CW-U and StarSpace W-UC. Sections 11 and 12 contain comparison of the performance of our system using unimodal and multimodal representations, as well as detailed comparison of the performance using each of the four diﬀerent StarSpace setups.

(25)

6 User and sub-community profiles

One of the main challenges with analyzing social media data is the amount of content that needs to be processed. As recently noted by Rudinac et al. [40], efficient summarization approaches are needed to facilitate exploration in such large and heterogeneous collections. Given the diversity of available social media platforms, we identify the following requirements for a desired summarization approach: first, it needs to be able to summarize content from multiple (possibly unrelated) documents (posts); the quality of the summaries should not depend on the length or content of the individual documents; the summarization process needs to simultaneously handle all available modalities so that it can capture important dependencies between them; and last but not least, it needs to be able to summarize content even in the absence of one or many of the possible modalities.

Although plenty of summarization approaches have been proposed through the years, most of them were developed with a single modality in mind and fail to meet the requirements of multimodal datasets, and social multimedia data in specific. Nevertheless, approaches such as Carbonell and Goldstein [10] can be seen as a step towards multimodal summarization since they only rely on the notions of relevance and diversity rather than modality-specific properties. Most existing approaches to summarizing multimodal data commonly summarize content of a single modality by taking into account another one. Rudinac et al. [38] and van den Berg et al. [50] create visual summaries by taking into account features extracted from the associated text. Pang et al. [34], on the other hand, first select location-representative tags from the textual modality, and then select diverse and representative images to visualize those tags in order to create summaries of tourist destinations. Even if they produce truly multimodal summaries, these approaches usually over-rely on the availability of certain modalities, consequently failing one of the mentioned requirements.

Existing approaches to summarizing multimedia data in specific are commonly developed for microblogging platforms such as Twitter, and often rely on picking whole key posts in order to summarize a bigger set – e.g. the approaches proposed by Chakrabarti and Punera [11] and Ren et al. [35]. Unfortunately such approaches are not suitable for summarizing content of platforms such as Stormfront, where individual posts are considerably longer, which makes reading even a few key posts already impractical.

After all, we identify that most of the mentioned approaches, independent of the type of content or modalities that they have been developed for, are based on the criteria of selecting diverse and representative content. Given the multimodal embeddings proposed in Section 4, we are able to directly use these criteria for the creation of muldimodal user and community summaries, which meet all of the above-mentioned requirements.

Generating user profiles using StarSpace embeddings

For each user we automatically generate a short profile consisting of words, textual and visual concepts, as well as the other users, which are close to her in the embedding space. Our mul-timodal embeddings make it possible to directly compare items across the diﬀerent modalities, which allows us to simultaneously pick representative and diverse content from all of them. Furthermore, representing a user as a bag of semantic concepts and similar users yields short and easy to interpret profiles, and also works well regardless of the amount and type of content associated to the user.

We generate user profiles by following Algorithm 1. We start by collecting all possible items that can be included in the summary – that is the union of WU (the set of words used by the

user), CU (the concepts extracted from their posts, i.e. visual concepts, entities or hashtags)

(26)

Algorithm 1 Generating User Profiles

1: procedure Profile(U, nn) . Profile of U containing at most nn items

2: P ← WU∪ CU∪ RU

3: S ← ∅

4: while |S| < nn AND P 6= ∅ do 5: for i ∈ P do

6: SUi = Ui .The amount of times it was used by U

7: SRi =Pj∈P dist(i, j) .Representative for the set of possible

8: SDi =Pj∈Sdist(i, j) . Diverse based on the set of selected

9: r1 ← argsort(SU, descending)

10: r2 ← argsort(SR, ascending)

11: r3 ← argsort(SD, descending)

12: rf inal ← AggregateRankings(r1, r2, r3)

13: elemtop← T op1(rf inal)

14: P ← P \ {elemtop}

15: S ← S ∪ {elemtop}

16: return S . The profile consists of the items in S

Next, in an iterative manner we pick items from the set to be included in the summary until we reach a desired number of items. At every iteration, we start by assigning each item i multiple scores which reflect the following criteria:

• User Representativeness: SUi = #times it was used by i — preferable items are

com-monly used by the user

• Semantic Representativeness: SRi = Pj∈Sdist(i, j) — preferable items have low

total distance to all other items

• Diversity: SDi = Pj∈P dist(i, j) — preferable items have high total distance to

previ-ously selected items

We incorporate two diﬀerent notions of representativeness in order to avoid picking ex-tremely rare and meaningless items – due to the specifics of our training procedure, any rare items which are mainly used by the user would potentially have embeddings close in space to that of the user; in turn, they are likely to have low total distance to all other items of the user (hence, highly ranked Semantic Representativeness score), without being meaningful enough to be considered a useful part of the summary. In order to counteract this, we addi-tionally rank items based on the amount of times the user themselves used these items (User Representativeness score). In this way, we give more weight to common, descriptive items.

We use the scores in order to produce 3 diﬀerent rankings of the items – based on ascending

SR scores and descending SD and SU scores, thus placing most representative, most diverse

and most commonly used by the user items highest in the corresponding ranking. Finally, we aggregate the 3 rankings and select the single highest scoring item to be added to the user summary.

To this end, we deploy the Borda method for aggregation of the three rankings, however future work could include experiments with diﬀerent techniques. For every ranking i, the Borda aggregation [16] assigns a score BCi(e) to every element c in the collection, where

BCi(e) corresponds to the number of elements ranked lower than e in ranking i. The overall

score for each item is simply the sum B(e) =P

(27)

Figure 8: Example of automatically generated user profile

decreasing order of their total score. Figure 8 shows an automatically generated profile. Section 9 contains qualitative analysis of the generated profiles.

Generalization to community profiles

Extending our summarization approach to generating community rather than user profiles is a straightforward task. The main diﬀerence between the two of them is in Step 2 of Algorithm 1 – rather than collecting all items related to a single user, we simply aggregate all items used by any user within the community. In addition, this set can quickly become too large in the community case, therefor, for optimization reasons, we only pick the 1000 most common items from each modality before we proceed to the next steps of the algorithm.

Finally, it is important to note that our summarization approach can be extended to take into account any type of items provided that they have comparable embeddings. For example, some of our setups for learning user and content embeddings using StarSpace contain information about the visual concepts that occur in a post. Next to those concepts, we can also add yet another token representing a whole image in order to learn image embeddings directly, which in turn would allow us to include whole images in the user profiles.

(28)

7 Interactive learning

Once the actor has selected a number of positive and negative example items, they are used to train a classifier. The following section provides more details about this process.

Relevance feedback framework

There are currently two paradigms commonly used in interactive learning systems - active learning [43] and relevance feedback [63]. In the first case, the actor is presented with the items which the learner is least confident about. While this strategy ensures a higher performance gain in the next iterations, it means that the actor needs to be patient enough to label all hard examples before they receive results satisfying their own information need. Since ISOLDE’s main purpose is to allow interactive search, exploration and finding users similar to previously selected ones, we find that optimizing precision in the long term is less important than immedi-ately presenting the actor with relevant results. Thus, we deploy a relevance feedback learning instead, i.e. the items highlighted after each learning cycle are the ones that the classifier is most certain about - a strategy that has a better eﬀect on the actor’s continuous knowledge gain. The number of highlighted items N is a parameter of the system that we further discuss in Section 11, where we formally evaluate its eﬀect on the performance of the system.

Classifier

As noted by Zahálka et al. [62], one of the main performance requirements for multimedia analytics systems is interactivity - the user should receive results within seconds at most in order to avoid them abandoning the non-responsive interface and the analysis altogether. This means that state of the art machine learning algorithms cannot be deployed if they are too computationally expensive, independent of the performance gains that they would bring.

We experiment we two linear models – an SGD model [37] and an SVM model [15], which was proven eﬀective in various tasks utilizing user relevance feedback in an interactive setup [52, 62]. The confidence scores obtained from this classifier are used to rank all users and the ones considered most relevant are highlighted in the interface.

Negative Sampling

However, aiming at presenting the system with good examples of what they are interested in, the actor is generally more likely to explicitly mark the relevance of positive examples - a problem which is further complicated by the relevance feedback framework in which the actor’s attention is explicitly directed at examples considered as positive by the classifier. Therefore, in order to improve the performance of the classifier and avoid getting it stuck on the positive samples, we make use of negative sampling. At every learning iteration, we sample negative example if the set of negative examples U−is too small compared to the set of positive examples – i.e. |U−_{| ≤ c|U}+_|) where c is a parameter of the system. Initial experiments with a fully random sampling did not perform well, thus we currently perform random sampling only at the first iteration. Afterwards, we uniformly sample users from the 10% of the users which scored the lowest at the previous iteration. Section 11 discusses the results from our experiments with diﬀerent values for c and the parameter’s impact on the overall performance of the system.

(29)

8 ISOLDE

In this section we look at the diﬀerent phases of ISOLDE’s interactive user sessions in order to see how all of the individual components enable them and how these components are combined into a single coherent system. Furthermore, we provide more information about ISOLDE’s interface and how it supports the analyst in gaining deeper insight about the collection.

ISOLDE’s Initial View

As already explained in Section 3, the interactive user sessions start with an aggregated view of the collection (shown on Figure 2), following the well-known approach ‘overview first, zoom

and filter, details on demand’ which was formalized by Shneiderman [44]. We find this view to

be the most important part of ISOLDE’s interface, since the overall structure and the original placement of the nodes are preserved throughout the whole user session.

We visualize users based on their StarSpace embeddings (discussed in Section 5) in order to make sure that similar users are placed close to each other in space. This does not only make initially exploring the space and understanding the overall structure of the collection easier, but it is also useful during the interactive learning stage – given that the system ranks the users’ relevance based on the same embeddings, we also ensure that users with similar relevance are positioned close in space and the results are easy to interpret.

Standard techniques to visualizing high-dimensional data, such as applying PCA [30] or TSNE [27] to reduce the dimensionality of the user representations, accurately convey the underlying local or global structure of the data, but they commonly create a lot of visual clutter, which makes topical parts of the space hard to identify. Thus, we chose to use a force-based layout instead – a common technique for reducing visual clutter in graph visualizations. Force-based graph drawing approaches, such as the ones proposed by Kamada et al. [22] and Fruchterman and Reingold [20], aim at finding aesthetically pleasing positions for the nodes of a graph by using physics forces – attractive forces work on the neighboring nodes in order to keep them close to each other, while repulsive forces between all pairs of nodes ensure that nodes are not drawn too close to each other.

Therefore, we first need to create a graph with all users as nodes. One can easily do this by exploiting the natural underlying graph structure of social media users caused by them following each other or replying to one another, however in our implementation, we aim at doing this by using their user representations. We first cluster all users based on their StarSpace embeddings – each cluster centroid will now also be a node in the graph. The number of clusters and the specianfic clustering algorithm can be easily adjusted in the interface based on the interacting user’s needs. The system currently supports 2 clustering methods – hierarchical clustering using the method proposed by Ward Jr [53] and K-Means clustering, which was independently proposed by MacQueen et al. [28] and Lloyd [26].

Next, we connect user nodes to the centroid nodes and assign edge weights based on the distance between the user and the corresponding centroid. At this stage, it is important that users are connected to more than one cluster in order to prevent them from randomly levitating around their own centroid only, and in turn, losing similarity information about the nodes of diﬀerent clusters. However, to reduce the computational eﬀorts needed for finding optimal positions for the nodes, we only connect users to the closest 5 centroids.

Figure 9 shows how the initial view of the system (given the same dataset) would look like by using a standard dimensionality reduction technique (PCA or TSNE), and how it looks using a force-based layout with K-Means clustering. As it can be seen, our proposed force-based layout does not only make better use of the available screen space, but the cluster nodes provide

(30)

(a) PCA

(b) TSNE

(c) Force-based layour based on k-means clustering

(31)

additional structure to the dataset. In this way, exploring the dataset is made easier not only by visually grouping similar users together, but also by the automatically generated profiles of the ‘communities’ formed by each cluster.

Interactive user sessions

Collection overview, search and exploration

Given this initial view of the system, the interacting user can start exploring the collection. Interacting with the clearly denoted cluster centroids provides a summary of the content of all belonging users, which allows the user to easily grasp the structure of the collection and to understand the semantic space. The user can also search for specific terms or filter the collection. As already noted, the initial positions of the users are preserved during the whole session, which means that results of such interactions with the collection are only conveyed by altering the visual properties of the nodes – changing their size, color or opacity.

User interactions and profiling

At this phase of the user sessions, our system is mostly static. The only nodes that change are the selected ones – we clearly denote nodes marked as relevant or irrelevant by coloring them in bright green or red – colors chosen to ensure that previously observed nodes and their corresponding relevance are easy to distinguish.

The multimodal user and community profiles, generated using the algorithm described in Section 6, are shown only on demand, by hovering or clicking any of the nodes. The items in the summary are grouped based on their type (users, visual or textual concepts) and denoted with intuitive symbols.

Figure 10: Close up of the diﬀerent icons used to denote (top to bottom): hashtags, all diverse and representative items, users, visual concepts, words.

Identifying similar users

Finally, the actor can easily require to re-rank all users based on the previously selected ones. After a classifier is retrained on the fly, as described in Section 7, all user nodes get recolored – the brighter the node is, the more relevant it is considered by the system. Brighter colors have been shown to attract the users attention, thus directing their focus to the nodes they might be interested in. However, to ensure that the most relevant nodes are easy to spot, we explicitly highlight the top N users that the classifier is most confident about, where N is a parameter of the system. Furthermore, by doing so, we aim at stimulating the user to explicitly mark the relevance of the highest-ranking users, providing useful input for the next iteration of the relevance feedback session.

(32)

Implementation Details

ISOLDE’s interface is implemented in JavaScript and the visualizations are made using D3.js.

Multimedia items are retrieved and fed to the interface from Elasticsearch [1] in order to provide search and filtering capabilities, and the analytics modules are implemented in Python. Such implementation makes ISOLDE modular and allows it to be used either as a stand-alone web application or as a part of a larger analytic system.

Interactive Search and Exploration in Social Multimedia Networks

MSc Artificial Intelligence

Master Thesis

Interactive Search and Exploration

in Social Multimedia Networks

Iva Gornishka

26

September, 2018

Supervisor:

Dr Stevan Rudinac

Dr Efstratios Gavves

Assessor:

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

1 Introduction

1.1 Research Questions

2 Related Work

3 Approach Overview

Data preprocessing

System Components

Interactive user sessions

4 Data Collection and Analysis

4.1 Datasets

Category

#Posts #Users

General Chapters

711662

15849

163837

7224

126835

7086

77979

8815

52563

7996

47165

2711

44220

4670

42454

5797

42218

6762

36327

3503

27083

4683

18483

1886

18309

2996

14080

3238

13820

3952

12378

2951

12345

3451

3379

712

3222

950

2566

885

1076

384

236

58

158

70

63

32

42

32

1

_{September, 2018}