Automatic Multimodal Summarisation of Touristic Routes for Interactive City Exploration

(1)

Automatic Multimodal Summarisation of Touristic Routes

for Interactive City Exploration

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

J

ORRIT VAN DEN

B

ERG

10677518

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

August 17, 2015

Supervisor Second reader

(2)

Automatic Multi-modal Summarisation of Touristic Routes

for Interactive City Exploration

Jorrit van den Berg

University of Amsterdam Science Park 904, Amsterdam

jorrit van den berg@hotmail.com

ABSTRACT

The potential of mining tourist information from so-cial media data gives rise to interesting new use-cases. Until recently, most of the research in this domain has been based on image metadata (geolocations), primar-ily focusing on the flow of tourists in a city. In this thesis, we propose a multi-modal approach to visually and textually describing user selected routes, such that a user can interactively view information about a route between two locations. The approach has been imple-mented in a prototype application for Android smart-phones. The prototype is based on a dataset of Flickr and Foursquare user-contributed content from Amster-dam. We have evaluated the approach in a user study. In general, the idea of visually summarising routes and describing them is well-received by the participants in our study.

Categories and Subject Descriptors

H.3.3 [Information Storage and Retrieval]: Infor-mation Search and Retrieval; H.3.5 [InforInfor-mation Stor-age and Retrieval]: Online Information Services

General Terms

Design, Experimentation, Human Factors

Keywords

City navigation, deep nets, multimedia analysis, multi-media summarisation, latent topics, semantic concepts, social multimedia, urban computing

1. INTRODUCTION

When visiting a city, tourists often have to rely on travel guides to get information about interesting places in their vicinity or between two locations. Existing crowd-sourced tourist websites like TripAdvisor1

pri-marily focus on providing Point Of Interest (POI) re-views. The available data on social media platforms allows for new use-cases, stemming from a much richer impression about places.

1

http://www.tripadvisor.com

The detection and recommendation of POIs and tourist routes recently became a topic of intensive research. Re-lated work shows how human flows can be detected [5]. Efforts to personalise recommendations have been made by extracting demographics from images [3] or incorpo-rating user preferences by interaction [13][24]. In addi-tion to generating recommendaaddi-tions, visualising query results in a user application is a closely related challenge [7], in a sense that, a potential tourism use-case most likely requires easy to grasp visualisations on a mobile device. With the aim of gaining a better understanding of how visual summarisation and description of routes can aid tourists in exploring a city, in this thesis we seek to answer:

To what extent can touristic routes for in-teractive city exploration automatically be vi-sually summarised and described using user-contributed multimedia?

Our main research question has three important as-pects: visual summarisation, textual description and user preferences. Therefore, we intend to answer the research question by investigating the following sub-questions: How can routes be automatically visually summarised with images?, how can places on a route be automatically described? and how do users perceive an automatically summarised route? To answer these research questions, we adopt a realistic use-case of inter-active city exploration, implement a prototype app and evaluate it in a user study. Figure 1 shows an overview of our approach.

The dataset with user-contributed content that we use is downloaded from Flickr2_{and Foursquare}3_{. Flickr is a}

photo sharing website on which many amateur photog-raphers upload their leisure photos. On the other hand, Foursquare is a platform focusing on venues, which in-cludes associated images.

2_{http://www.flickr.com} 3

(3)

Figure 1: Overview of the approach

With their different purposes and user groups, images from these platforms complement each other. Geotagged images on both platforms fairly precisely correspond to geographical locations. We gather our dataset from the region of Amsterdam. The intended contribution of this paper is a method to generate a visual summary of user selected routes and the description of places on these routes, as far as both can be derived from the image content and associated annotations.

The remainder of this thesis is organised as follows: Sec-tion 2 contains the related work and SecSec-tion 3 the ap-proach overview. The prototype design is discussed in Section 4. Section 5 contains experimental results and their discussion. Section 7 concludes the thesis.

2. RELATED WORK

This section provides a synopsis of related work. With regard to mining tourist information, strictly metadata (geolocations) based approaches are found in most lit-erature and these are still a topic of ongoing research. More recently, the tourist information domain caught the attention of the multimedia community too, which additionally attempts to utilise the analysis of image content and multi-modal analysis. We discuss both metadata-based and multimedia efforts, as well as re-lated work in visual summarisation.

2.1 Metadata-based Efforts

El Ali et al., adopted a method from bioinformatics to align similarities in route segments on which photogra-phers took their Flicker images [5]. The segments are aggregated into routes using an altered implementation of Dijkstra’s algorithm. The authors performed a lab evaluation and web survey, which provide valuable in-sights in the user preferences.

Pippig et al focus on allowing users to choose a se-mantic theme-based route with a particular start and end POI [15]. Their proposed approach uses geotagged Wikipedia4 _{articles, of which the semantic similarity is}

computed. The geographical areas to which the Wikipedia articles are associated are determined from the geotags of images that have equal tags or titles as the articles. The images are obtained from Flickr and Panoramio5_.

Multiple methods for route computation are proposed, depending on whether the users prefer short routes or routes with the most relevant landmarks on ´ıt.

Quercia et al. argue that navigation systems that solely provide short routes to a destination may not fully re-flect the possible goals of users [17]. They researched routes from the viewpoint of emotional pleasantness with three alternative premises: beautiful, quiet and happy. The findings in their conducted user study were that the proposed routes indeed evoked these feelings.

4_{http://www.wikipedia.org} 5

(4)

Vu et al. researched the behaviour of tourists in Hong Kong [23]. The authors propose P-DBSCAN, an alter-native version of the widely used Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algo-rithm [6]. It captures the popularity of a clustered loca-tion with a threshold for the number of users that con-tributed the images. In their study, the authors com-pare the travel behaviour of Western and Asian tourists. A finding is that the POIs they visited and the routes they followed slightly differ.

2.2 Content-based and Multi-modal Efforts

We proceed by summarising the content-based and multi-modal approaches. Popescu et al. propose an approach to extract tourist information from Flickr data [16], such as the POIs they visited and how long they stayed there. The authors emphasise that tourists not always take the shortest route between POIs. Personal prefer-ences and opening hours of POIs were found as impor-tant factors influencing the order in which POIs are vis-ited. The authors note that for some of the locations, only a small fraction of Flickr images are geotagged, which makes it difficult to generalise the approach. Cheng et al. consider the demographic characteristics of users [3]. Their system is designed both for a per-sonalised route recommendation, as well as a next des-tination suggestion. The demographic characteristics, such as gender, age group and race, are extracted from Flickr images. The authors notice a correlation between the extracted demographics and travel locations. Ex-tracting demographics from the visual content for this purpose holds the assumption that a person depicted in an image is a traveller.

Okuyama and Yanai propose an interactive approach for travel route recommendation [13]. First, POIs are clustered using hierarchical clustering and five represen-tative images for each cluster are selected by matching SURF descriptors. Travel paths are derived from geo-tagged images on a user’s day-trip. When travel path data is scarce for a specific city, new routes are gener-ated based on the co-occurrence of POIs in two routes. The generation of new trip models could to some extent circumvent the scarcity of data for certain cities. How-ever, the authors do not investigate how the aggregated routes were perceived by the users.

New Yorker Melange, proposed by Zahalka et al. is a recent example of an interactive and multi-modal city explorer [24]. Instead of looking at the popularity of POIs, the system looks for social network users with similar topical interests as the interacting user and rec-ommends the venues visited by them.

Information about venues and their related images are downloaded from Foursquare6 _{and complemented with}

images from Flickr and Picasa7. A deep net is used to detect visual concepts from the images. From user annotations, bag of words features are extracted and analysed using Latent Dirichlet Allocation (LDA), in order to identify 100 latent topics. Finally, clustering is applied to represent venues and users with a certain number of semantic topics.

2.3 Visual Summarisation

Li et al. propose a method to select iconic images of landmarks [12]. The approach is based on GIST de-scriptors. The images are clustered and the clusters are validated by comparing their representative images on geometric correspondence. The computed clusters are each represented by an iconic image.

Chen et al. combine detecting POIs for tourists and depicting those POIs on a map [2]. POIs are clustered with the objective of reducing the variability in geo-graphical distance and tags (measured with TF-IDF). The dominant tags in the clusters are used to obtain additional images for each clustered POI. Subsequently, the images are clustered and the cluster members are aligned using homography, in order to generate a clut-ter free image. An inclut-teresting user consensus weighting scheme is used to scale images in size for a tourist map. The authors notice that the approach sometimes pro-duces unexpected results, but the paper does not men-tion specific evaluamen-tion measures. The animamen-tion style rendered images do not seem eligible for other visuali-sations than a tourist map.

Rudinac et al. focus on selecting representative and diverse images in order to visually summarise a geo-graphic area [18]. A multi-modal graph based approach is used to select representative and diverse images de-picting various aspects of a given geographic area. It has been proven effective in summarising both popular and less visited areas.

Following up on their previous work, Rudinac et al. re-searched the user preferences for the automatic creation of visual summaries [19]. In order to study human pref-erences in visual summarisation, Amazon Mechanical Turk was deployed. Subsequently, multiple criteria were identified that influence human choice of images, such as representativeness, diversity, aesthetic appeal and sen-timent. The authors further propose an approach for directly mapping those criteria onto features. Finally, an approach to user-centred evaluation of visual sum-maries is proposed.

6_{http://www.foursquare.com} 7

(5)

In [20], Rudinac and Worring further study the user preferences in visual summarisation. The aim of the study was to identify which semantic concepts in images are most often selected by humans when summarising image collections. The results show that some seman-tic topics are more popular than others and that au-tomatic selection of images with popular semantic con-cepts yields promising results. Some findings are that, panoramic images are often selected by users, alongside with images depicting ”bodies of water against the sky-line”.

In essence, the discussed related work is focusing on route detection or recommendation, visualising POIs and visually summarising geographic areas. A particu-lar novelty of our approach is an interactive generation of visually and textually summarised touristic routes for city exploration.

3. APPROACH OVERVIEW

In this section, we describe the acquisition of our dataset and the analysis methods that we use in our approach. The design of the prototype app that we implement will be discussed in Section 4. Our approach consists of the following steps:

1. Obtaining content from social media. 2. Detecting semantic concepts in the images. 3. Detecting topics in the annotation of the images. 4. Grouping images.

5. Applying multi-modal fusion. 6. Creating visual summaries. 7. Creating textual descriptions.

3.1 Obtaining Content From Social Media

At first, an approach similar to [24] was used to obtain content. POIs within a radius of 9 kilometres from the centre of Amsterdam are downloaded from the Face-book Graph API and the Foursquare API. The geo co-ordinates of those POIs, the text ”Amsterdam” and a radius of 500 metres were used to search for Creative Commons images and their attrributes on Flickr. Un-fortunately, this approach only yielded around 30,000 images. Expanding the data collection by querying Foursquare as well resulted in 80,000 images. To further enlarge the dataset, more images taken in Amsterdam by already known Flickr user IDs were downloaded, for all license types. A limitation of this procedure is that only the Creative Commons subset of the images are suitable for our intended visualisation purposes. How-ever, the final result is a much larger dataset of 157,000 images and their attributes, which can be considered a reasonable trade-off.

Figure 2 shows a heat map of the GPS coordinates of the images in the dataset, to illustrate their geographical distribution.

Figure 2: Heat map of image GPS coordinates

In accordance with intuition, the density is the highest in the centre of Amsterdam and the areas surrounding it. The suburban areas have a much lower image den-sity, which limits the use-case to city tourism. Figure 3 illustrates the content analysis pipeline.

Figure 3: Data processing pipeline

3.2 Detecting Semantic Concepts

The detection of semantic concepts in images became a relatively robust method to describe them. Krizhevsky et al. created a convolutional neural network model con-sisting of 650,000 neurons to learn 1000 image classes from ImageNet synsets [10]. The images are first scaled down to a 256 x 256 pixel resolution and the mean of the input images is subtracted, resulting in the centred RGB-values. To prevent the model from overfitting, 224 x 224 pixel patches are randomly extracted from the images. This model showed a substantial improve-ment over state-of-the-art in 2012.

(6)

Szegedy et al. propose an improved convolutional neu-ral network architecture, which they named the Incep-tion architecture [21]. The optimisaIncep-tion is achieved by approximating the optimal sparse structure. This ap-proximation is computationally expensive, which they circumvent by applying dimension reductions and pro-jections when necessary. In addition, their model uses multiple scales of input images, instead of a single scale. In comparison with the work by Krizhevsky et al., this approach results in improved accuracy while having less parameters. We adopt this approach to detect 15,293 semantic concepts. Each image is represented by a probability distribution over the 15,293 concepts.

3.3 Detecting Topics in the Annotations

Apart from the semantic concepts depicted in them, images can also have valuable information in their an-notations. The annotations, i.e. title, description and tags are tokenised using the Python Natural Language Tool Kit (NLTK). Stopwords, unique words and HTML markup are removed.

Hoffman et al. propose online Variational Bayes (VB) for LDA as a fast method to identify topics from text [8]. The online method is compared with a batch method of VB. The online method performs better in comparison over 3,300,000 Wikipedia articles.

ˇ

Reh˚uˇrek et al. created the Gensim framework for topic modelling with large corpora [22]. They argue that the memory storage of large corpora forms a limita-tion, which they avoid by loading documents one by one from disk. It supports several vector space mod-els, among which are TD-IDF and LDA. The Gensim framework is used to compute a probability distribution over 100 latent topics for each candidate image.

3.4 Grouping Images

A rectangular geographical grid is used to group the im-ages. The geographical grid has 125 x 125 meter cells and its outer corners have the same geo coordinates as the bounding box used in [5]. Grouping the images into smaller subsets is needed to be able to compute sim-ilarity matrices that fit in memory. Furthermore, 125 metres is a reasonable walking distance for most people. This allows the geographical centroids of each cell to be used as graph nodes for routing later on. For each grid cell, a pairwise cosine similarity matrix is computed for both the probability distributions over the visual con-cepts and the textual topics.

3.5 Multi-modal Fusion

We combine the visual and textual modality using the late fusion approach proposed by Ah-Pine et al. [1]. For both modalities, new (cross-modal) similarity matrices are computed with a thresholding function to keep the 3 highest similarity values and replace the lower values by zeros.

These matrices are multiplied by the similarity matrix of the opposite modality. The original mono-modal and the newly computed cross-modal similarity matrices are all given a weight and summed to get a final pairwise similarity matrix. This is formulated in equation (1), where Sim is the final matrix.

Sim = αtSt+ αiSi+

αitSimimg−txt+ αtiSimtxt−img (1)

Stand Si stand for the mono-modal similarity matrices

(text and image), Simimg−txt and Simtxt−img for the

cross-modal similarity matrices. The weighting of (1) is equal to the weighting scheme that the authors used: αt= 5/12, αi= 1/4, αit= 1/4, αti= 1/12.

3.6 Creating Visual Summaries

In order to create visual summaries we adopt Affinity Propagation clustering. Affinity Propagation by Dueck and Frey is a clustering algorithm which identifies exem-plars by passing messages between data points [4]. This is done in multiple iterations to find the best fit. The authors have evaluated the algorithm for unsupervised image categorisation. It performed well in this task, as it does not have a predefined number of clusters and is designed to find exemplars.

The previously computed matrix for each grid cell is clustered using the Affinity Propagation clustering al-gorithm implemented in Scikit-learn [14]. Given the variance in number of images per cell and possible rep-resentative images, the non-predetermined number of clusters is a desirable feature of this algorithm. The cluster centroids for each grid cell are sorted by cluster size in a descending order. In this order, it is checked if centroid images have a Creative Commons license and the first that has, is selected as a representative image.

3.7 Creating Textual Descriptions

To get a textual description for an area, the previously preprocessed text is used for each cell. Two methods to select relevant terms have been used, TF-IDF and a combination of TF-IDF and equation (2) from the tag ranking method proposed by Li et al. [11]. Their method can be used to determine the relevance of im-age tags or to propagate tags to unannotated imim-ages. Namely, individual tagging can result in subjective tags. By comparing tags from multiple users, more objective tags could be identified. The evaluation results indicate that this is a robust method to improve image retrieval. In addition to retrieval, the proposed approach can be used to propagate tags for an image based on its visual neighbours.

(7)

3.7.1 TF-IDF

A TF-IDF vector is computed using the Gensim Frame-work, considering each grid cell as a single document. The 10 most relevant terms are stored (i.e. terms with the highest TF-IDF weights). TF-IDF in general dis-criminates well between terms that are used on a certain location and terms that are used in the entire city. Un-fortunately, it also has the tendency to give a high rank to rare (often unwanted) tags.

3.7.2 TF-IDF and Tag Ranking

To mitigate this effect, a combination of TF-IDF and the tag ranking equation (2), proposed by Li et al. [11], is used.

(nw[Nf(I, k)] − P rior(w, k)) (2)

The operator nw counts the frequency of a tag in the

subset computed by Nf(I, k), which are the k visual

nearest neighbours of the input image. The input im-age I, is in our case the representative imim-age for a grid cell. Attempting to apply both a geographical and a visual constraint (on grid cell level) did not produce meaningful results, because the set of candidate terms is too small. This was tested by selecting the repre-sentative image from a grid cell as I, clustering all the images in the cell (applying spectral clustering with K = 3) and using the terms from all the images in the same cluster as the representative. This may be differ-ent when using the visual nearest neighbours over the entire set, but that would mean ignoring the GPS coor-dinates of the images. Therefore, the number of images in the respective grid cell is used as k in the formula. All their terms are used as candidate terms. A unique user constraint is not applied, as we have no user IDs of the Foursquare images. P rior(w, k) is an approxi-mation of how frequently a terms occurs in the entire set. TF-IDF does use the actual frequency, which to some extent promotes terms with geographical corre-spondence. As tag ranking tends to give a low rank to rare tags, the two can be combined to mitigate the shortcoming of TF-IDF. To do so, the results of both methods are combined using Borda count. Borda count does not directly discard terms in either ranking. By experimenting we noticed that the ranking of TF-IDF has more geographical correspondence, which we like to retain as much as possible. Using the combination of methods and giving double votes to the TF-IDF rank-ing, might improve the results over solely using TF-IDF.

4. PROTOTYPE DESIGN

4.1 Mobile Application

The prototype is an Android app for smartphones and smartwatches, although the latter is not formally eval-uated due to the scarcity of potential participants with such device.

Figure 4: Screenshot of screen to select locations

It is a native Android app which is developed in Java. The app allows the user to query locations (or use the current location), to select places in the vicinity or on a route. Figure 4 contains a screenshot of the app. For both features, the location coordinates are used to query a REST API endpoint.

4.2 REST API

The API endpoints return lists of 256 x 256 pixel im-age thumbnails, their GPS coordinates, as well as the frequently used tags in the area. We use a PostgreSQL database and the Django framework, utilising the geo-spatial features of PostGIS and GeoDjango. On the server, a graph is used to get the neighbour nodes of a node/grid cell that contains the user coordinates (for the explore function), or a route is computed using Breadth-first search. Alternative routes are computed by selecting different neighbour nodes of the origin node.

(8)

As a user also has the option to avoid crowds, there is a weighted version of the same graph which uses the number of images per cell for each node as a proxy for crowdedness. Dijkstra’s shortest path algorithm is used for the weighted version. The edges between the nodes are weighted with equation (3).

1 − exp(C1+ C2) (3)

;where C1 is the number of images for the first cell

and C2 the number of images for the second cell. The

weighting and the analysis steps described in Section 3 are precomputed offline, to keep computations needed on-line to a minimum. Improving the measure for crowd-edness is part of future work.

4.3 User Interface

The images in the response from the API are shown as circular images on Google Maps. When a user interacts with the map by tapping on one of the images, the im-age is enlarged and an info-box with the most frequently used tags in the area is shown. Figure 5 illustrates this functionality.

Figure 5: Screenshot of info-box

If the smartphone is paired with a smartwatch, the im-ages are shown as a slideshow on the smartwatch, which can be seen in Figure 6.

Figure 6: Slideshow on smartwatch

5. EVALUATION

As no ground-truth data is available, automated vali-dation methods are not an option. A user study could provide valuable insights in how users perceive the sys-tem, both from a conceptual- and performance point of view. Those two dimensions are evaluated by e.g. Kofler et al. in [9]. To evaluate the approach in a user study, a prototype app is created. Two use-cases are featured in the prototype. Users can get images and descriptions for places in their vicinity or obtain the images and descriptions of places on a route between two locations. In the route mode, users have the option to obtain alternative routes. The prototype is tested by participants who are selected by convenience sam-pling. The participants could try the prototype and fill in a questionnaire using a button in the app. The questionnaire contains questions related to the concept, performance and the demographics of the participants. The evaluation has been performed with 7 participants, who tested the application in Amsterdam and filled in the questionnaire afterwards. Five of the participants were male and 2 were female. There are 4 participants in the age-group 25-34, 2 in the age-group 55-64 and 1 in the age-group 65-74. The questionnaire they filled in consisted of Likert scale questions with 5 categories: 1 = Strongly disagree, 2 = Disagree, 3 = Neither agree nor disagree, 4 = Agree and 5 = Strongly agree. Figure 7 contains a chart of the responses for each statement. The bars represent the percentages of respondents that selected the respective options. Each variable (V) cor-responds to a numbered statement within brackets in the text. These brackets furthermore contain the me-dian (Mdn) and Interquartile range (IQR) as measures of central tendency and dispersion.

(9)

Figure 7: Questionnaire responses

5.1 Concept

Most participants agreed with the statement ”A system like this is helpful for me when I am sightseeing”, (V1, Mdn=4, IQR=1). Opinions are divided on the statement ”In comparison with other travel information sources (e.g. maps, travel guides, tourist office) this system has added value”. Three participants agreed, 2 neither agreed nor disagreed and 2 disagreed (V2, Mdn=4, IQR=2). In relation to the visual summaries, the following statements were included; ”The majority of pictures that are retrieved gave me a clear impression about the areas that I have seen”. Most of the partici-pants agreed on this (V3, Mdn=4, IQR=1). ”The abil-ity to get images taken in my vicinabil-ity (surrounding area) has added value” was also agreed on by most partici-pants (V4, Mdn=4, IQR=0). And all do agree on ”The ability to get images taken in areas on a route between two places has added value” (V5, Mdn=4, IQR=0). Being asked about tags, ”The info boxes with frequently used social media tags in an area have added value”, 4 participants agreed, 1 neither agreed nor disagreed and 2 disagreed (V6, Mdn=4, IQR=2). Opinions are divided over: ”Tags that might be offensive must be removed”, 3 participants agreed, 2 neither agreed nor disagreed and 2 disagreed (V7, Mdn=3, IQR=2). This might be due to the fact that offensive is a subjective term and there are many types of offensiveness. More specific understanding of the preferences with regard to offensive tags requires further research.

To the statement ”I would like to get personalised route recommendations”, most participants agreed (V8, Mdn=4, IQR=1). On statement ”I would use a system like this to relive a past trip” almost all participants expressed their agreement (V9, Mdn=4, IQR=0). Opinions were divided on the statement ”I would like an additional fea-ture automatically generating information about past trips for me based on my own travel photos”, to which 3 participants agreed, 2 neither agreed nor disagreed and 2 disagreed (V10, Mdn=3, IQR=2). This ques-tion may have been difficult to understand without an example of how such a feature could work in practice.

5.2 Performance

All participants agreed to the statement ”The overall performance of the application is good” (V11, Mdn=4, IQR=0). However, opinions are divided over; ”The ex-plore function retrieves images in a reasonable time”. To this statement, 2 participants agreed, 4 neither agreed nor disagreed and 1 strongly disagreed (V12, Mdn=3, IQR=1). To the statement ”The get route function retrieves images in a reasonable time”, 4 participants agree and 3 neither agree nor disagree (V13, Mdn=4, IQR=1). As a substantial amount of data has to be downloaded when using these functions, the user ex-perience is strongly dependent on the available data connection. The opinions on the statement ”The ap-plication crashed or froze frequently” are very divided: 1 strongly agreed, 1 agreed, 3 disagreed and 1 strongly disagreed (V14, Mdn=2, IQR=3). Identifying the cases in which users experience these issues requires further analysis.

5.3 Suggestions for Improvement

In an open question, users have been asked for sug-gestions on how to improve the app. Regarding the screen to search for places, it was stated that a list to choose places from would be helpful. ”A list of pop-ular places you can choose from, not everyone knows the correct Dutch name (of locations).” Searching for origin or destination locations requires that users know the names of the locations. Therefore, such a list would indeed be a good improvement. ”An option to see more pictures for a specific location and the ability to view pictures full screen.” Retrieving more images for a lo-cation is already incorporated in the REST API and a full screen viewer will be added in the app. ”I miss a visual line I must follow to come to a next interesting place.” A drawback of such a line is that the user inter-face becomes overloaded with items. However, Google Maps has a feature to open a new screen and get turn-by-turn directions to a selected place, which could be an interesting solution. ”Information about places near the route would be nice. Instead of using tags I would prefer links.”

(10)

With regard to the terms, the aim of this project was to propose an approach to textually summarise locations on a route. Aggregating terms in links could be a next step. For example, download links for museum (audio) tour apps can be included.

5.4 Discussion

Data analysis techniques, like the ones used here, re-quire a substantial amount of data to be effective. Cities with a small number of user-contributed images were al-ready mentioned as a limitation in Popescu et al. [16]. Additionally, only a fraction of images on Flickr have a Creative Commons license, for Amsterdam at least. Here, only cluster centroids with such a license are used for visualisation purposes. At the expense of being less representative for a cluster, a wider range of candidate images for visualisation could be obtained by consider-ing a cluster centroid nearest neighbours as well. The evaluation has been performed with the textual descrip-tion method described in Secdescrip-tion 3.7.1 for the majority of participants (N = 6). The goal was to evaluate both methods with an equal-sized subgroup of participants. Unfortunately, 7 was the maximum number of partici-pants with a supported Android smartphone that could be found. The above may have an impact on ques-tion V6, where the one participant that had tested the method described in Section 3.7.2 agreed to the state-ment. The statement about the need for removal of of-fensive tags (V7) was included, because ofof-fensive crowd-contributed information in Google Maps search results recently caused debate. The dataset does contain a few tags that could be considered offensive or politically in-correct.

6. CONCLUSION

We proposed an approach for automatic visual sum-marisation and textual description of touristic routes. We visually summarise 125 x 125 metre geographical ar-eas by selecting a representative image, based on seman-tic concepts in the content and latent topics in the anno-tations, i.e. title, tags and description. Subsequently, we textually describe these areas based on the afore-mentioned annotations. Using a graph based approach, we then enable routing, offering a rich impression of the areas between two locations. We adopted a realistic use-case of city exploration, implemented a prototype app and evaluated it with a user study conducted in Ams-terdam. In general the visual summaries did give par-ticipants in our user study a clear impression about the routes. There was less consensus about the effective-ness of the method to describe places. The descriptive terms tend to be noisy. A comment in the evaluation furthermore indicated, that the terms may need to be aggregated in links. Nonetheless, most participants per-ceived the application as helpful when sightseeing.

Although functionality and retrieval performance need to be improved to have more added value in compari-son with other travel information sources, the concept appears to have potential. The extent to which user selected routes in a city can be automatically visually summarised and described using user-contributed im-ages and their annotations is mainly dependent on the availability of data in an area. For areas outside the city centre, this appeared to be a problem. By using a wider range of candidate images in a cluster, this could be mitigated to some extent. With user-contributed terms, it is also important to address offensive content, for instance by user feedback. Mining tourist informa-tion from social media will most likely continue to be an interesting area of multimedia research.

Future work will consist of improving the informative-ness of descriptive terms. The visual and geographi-cal correspondence of the images might be exploited by first applying the tag relevance approach by Li et al. [11] to visual neighbours and subsequently performing TF-IDF with a geographical constraint as we described in Section 3.7.1. A related potential improvement is generating a list of stopwords with no correspondence to a geographical area from a large dataset. Addition-ally, the static information could be complemented with a real-time Twitter feed about events on the location. The same applies to measuring crowdedness. The num-ber of images per cell could be a rough indication but is not robust to temporal fluctuations. Adding real-time travel information could result in an evident added value over other travel information sources.

7. ACKNOWLEDGEMENTS

I gratefully thank Dr. Stevan Rudinac for his help and supervision during this project. The University of Am-sterdam for the education in multimedia information systems. Flickr and Foursquare for sharing their data. And the participants in the user study for giving their valuable insights for further improvement.

8. REFERENCES

[1] Ah-Pine, J., Clinchant, S., Gabriela, C., and Liu, Y. Xrce’s participation in imageclef 2009. In Working Notes of CLEF 2009 (2009). [2] Chen, W.-C., Battestini, A., Gelfand, N., and

Setlur, V. Visual summaries of popular landmarks from community photo collections. In Signals, Systems and Computers, 2009 Conference Record of the Forty-Third Asilomar Conference on (2009), IEEE, pp. 1248–1255.

[3] Cheng, A.-J., Chen, Y.-Y., Huang, Y.-T., Hsu, W. H., and Liao, H.-Y. M. Personalized travel recommendation by mining people attributes from community-contributed photos. In Proceedings of the 19th ACM international conference on Multimedia (2011), ACM, pp. 83–92.

(11)

[4] Dueck, D., and Frey, B. J. Non-metric affinity propagation for unsupervised image

categorization. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on (2007), IEEE, pp. 1–8.

[5] El Ali, A., van Sas, S. N., and Nack, F. Photographer paths: sequence alignment of geotagged photos for exploration-based route planning. In Proceedings of the 2013 conference on Computer supported cooperative work (2013), ACM, pp. 985–994.

[6] Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (1996), vol. 96, pp. 226–231.

[7] Gavalas, D., Konstantopoulos, C., Mastakas, K., and Pantziou, G. Mobile recommender systems in tourism. Journal of Network and Computer Applications 39 (2014), 319–333.

[8] Hoffman, M., Bach, F. R., and Blei, D. M. Online learning for latent dirichlet allocation. In advances in neural information processing systems (2010), pp. 856–864.

[9] Kofler, C., Caballero, L., Menendez, M., Occhialini, V., and Larson, M. Near2me: An authentic and personalized social media-based recommender for travel destinations. In Proceedings of the 3rd ACM SIGMM

international workshop on Social media (2011), ACM, pp. 47–52.

[10] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural

information processing systems (2012), pp. 1097–1105.

[11] Li, X., Snoek, C. G., and Worring, M. Learning social tag relevance by neighbor voting.

Multimedia, IEEE Transactions on 11, 7 (2009), 1310–1322.

[12] Li, X., Wu, C., Zach, C., Lazebnik, S., and Frahm, J.-M. Modeling and recognition of landmark image collections using iconic scene graphs. In Computer Vision–ECCV 2008. Springer, 2008, pp. 427–440.

[13] Okuyama, K., and Yanai, K. A travel planning system based on travel trajectories extracted from a large number of geotagged photos on the web. In The Era of Interactive Media. Springer, 2013, pp. 657–670.

[14] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,

Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (2011),

2825–2830.

[15] Pippig, K., Burghardt, D., and Prechtel, N. Semantic similarity analysis of user-generated content for theme-based route planning. Journal of Location Based Services 7, 4 (2013), 223–245. [16] Popescu, A., Grefenstette, G., and Mo¨ellic, P.-A.

Mining tourist information from user-supplied collections. In Proceedings of the 18th ACM conference on Information and knowledge management (2009), ACM, pp. 1713–1716. [17] Quercia, D., Schifanella, R., and Aiello, L. M. The

shortest path to happiness: Recommending beautiful, quiet, and happy routes in the city. In Proceedings of the 25th ACM conference on Hypertext and social media (2014), ACM, pp. 116–125.

[18] Rudinac, S., Hanjalic, A., and Larson, M. Generating visual summaries of geographic areas using community-contributed images. IEEE Transactions on Multimedia 15, 4 (2013), 921–932.

[19] Rudinac, S., Larson, M., and Hanjalic, A.

Learning crowdsourced user preferences for visual summarization of image collections. Multimedia, IEEE Transactions on 15, 6 (2013), 1231–1243. [20] Rudinac, S., and Worring, M. Making use of

semantic concept detection for modelling human preferences in visual summarization. In

Proceedings of the 2014 International ACM Workshop on Crowdsourcing for Multimedia (2014), ACM, pp. 41–44.

[21] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. arXiv preprint arXiv:1409.4842 (2014).

[22] ˇReh˚uˇrek, R., Sojka, P., et al. Software framework for topic modelling with large corpora. LREC (2010), 45–50.

[23] Vu, H. Q., Li, G., Law, R., and Ye, B. H.

Exploring the travel behaviors of inbound tourists to hong kong using geotagged photos. Tourism Management 46 (2015), 222–232.

[24] Zah´alka, J., Rudinac, S., and Worring, M. New yorker melange: Interactive brew of personalized venue recommendations. In Proceedings of the ACM International Conference on Multimedia (2014), ACM, pp. 205–208.

Automatic Multimodal Summarisation of Touristic Routes for Interactive City Exploration

Automatic Multi­modal Summarisation of Touristic Routes

for Interactive City Exploration

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

J

​

ORRIT VAN DEN

​

B

​

ERG

10677518

M

​

ASTER

​

​

I

​

NFORMATION

​

​

S

​

TUDIES

H

​

UMAN­

​

C

​

ENTERED

​

M

​

ULTIMEDIA

F

​

ACULTY OF

​

S

​

CIENCE

U

​

NIVERSITY OF

​

A

​

MSTERDAM

August 17, 2015

Automatic Multi-modal Summarisation of Touristic Routes

for Interactive City Exploration

Jorrit van den Berg

jorrit van den berg@hotmail.com

ABSTRACT

Categories and Subject Descriptors

General Terms

Keywords

1.

INTRODUCTION

2.

RELATED WORK

2.1

Metadata-based Efforts

2.2

Content-based and Multi-modal Efforts

2.3

Visual Summarisation

3.

APPROACH OVERVIEW

3.1

Obtaining Content From Social Media

3.2

Detecting Semantic Concepts

3.3

Detecting Topics in the Annotations

3.4

Grouping Images

Automatic Multimodal Summarisation of Touristic Routes

UMAN