Symeon Papadopoulos, Emmanouil Schinas, Theodoros Mironidis, Katerina Iliakopoulou, Eleftherios Spyromitros-Xioufis, Ioannis Tsampoulatidis and Yiannis Kompatsiaris ABSTRACTInformation Technologies Institute, Centre for Research and Technology Hellas Thessaloniki, Greeece {papadop, manosetro, mironidis, ailiakop, espyromi, itsam, ikom}@iti.gr We present a framework for targeted collection, indexing and browsing of shared media content. Such a framework is valuable in several use cases involving the tracking of social content related to an entity of interest, e.g. topic, person, location or event. When the entity of interest is popular or trending, massive amounts of content are shared (including large amounts of near-duplicate content and spam) resulting in information overload for the end users. To this end, the described framework supports configurable targeted crawling and indexing of social multimedia content in tandem with real-time analysis and summarization. The framework is based on a real-time distributed architecture, including very efficient image indexing and clustering implementations. We evaluate the framework on a large-scale case study around the #OccupyGezi events. 1 INTRODUCTION The ubiquitous use of social media platforms for capturing and sharing people's opinions, information and experiences hold the potential of transforming online user communities to sensors of real-world phenomena and events [1]. However, the fast pace, the huge volume and the unpredictable nature of user-contributed content make it extremely challenging to obtain informative views of evolving news stories and events in real time and to quickly surface relevant media content. To this end, we present a framework that supports highly customizable crawling, indexing and browsing of social media content with the goal of facilitating the discovery of newsworthy and interesting social multimedia. The presented framework employs a hybrid strategy for quickly fetching relevant content, comprising both a stream-based and a query-driven approach. In addition, it integrates very efficient and scalable image indexing and clustering implementations. We evaluate the media retrieval capabilities of the proposed framework in the context of the #OccupyGezi events. The framework will be made available in the form of an open-source project through the SocialSensor GitHub account (https://github.com/socialsensor). 2 RELATED WORK The NUS Extreme Tsingua (NExT) center have recently presented a system for the large scale crawling and indexing of social media content [2]. It supports many of the functionalities featured by the proposed framework, but it lacks the crawling customization capabilities, since the set of monitored sources is predefined. MediaFinder [3] offers this missing capability, i.e. it supports query-driven media discovery. However, it completely lacks functionalities for persistent monitoring and indexing of content. As a result, the query response time is very large and the results are one-time, i.e. no possibility is offered for obtaining up-to-date results (i.e. have the query periodically run in the background). Finally, in terms of the employed visual indexing and clustering technology, our system is similar to the event detection system of [4], which, however, is limited to media content published by online news sources, thus lacking the social media aspect in the coverage of news. 3 FRAMEWORK OVERVIEW The crawler is responsible for the collection of data and content from online sources in the form of Items (posts made in a social platform, e.g. tweets), WebPages (URLs embedded in collected Items) and MediaItems (images/videos embedded in Items or WebPages), given a set of CrawlingSpecs (a set of arguments specifying what to crawl, e.g. a hashtag on Twitter. At the moment, there are three types of such arguments: KeywordsFeed, LocationFeed and SourceFeed, for collecting content around keywords/hashtags, locations and users/accounts respectively) as input arguments. The proposed crawling and indexing framework, depicted in Figure 1, comprises the following components: (a) Item collection, comprising stream-based (stream-manager) and query-driven (search-manager) components, (b) structured data repositories based on mongoDB and Solr, (c) fetching and indexing components, including WebPage fetching, article extraction, MediaItem detection and extraction, feature extraction and visual indexing, (d) aggregation components, including geo, visual clustering and analytics. Below, we briefly describe how these components operate within the proposed framework. Fig. 1. Crawler architecture Item Collection: The stream-manager process adopts a stream monitoring paradigm, i.e. given a set of CrawlingSpecs, it 'listens' to new Item events. The search-manager, on the other hand, operates in a query-driven mode. In particular, it receives custom CrawlingSpecs (manually entered or automatically generated), and formulates appropriate queries to different social networks (currently Facebook, Flickr, YouTube, Tumblr, Google+ and Instagram). Structured data repositories: Data access is implemented by appropriate wrapper methods around the mongoDB and Solr repositories. Item and MediaItem metadata are stored in mongoDB, and most of their fields are indexed in Solr. At the moment, a significant amount of data redundancy exists between the two repositories in order to speed up retrieval (i.e. using a single request to both find and fetch the MediaItems and the accompanying metadata). A number of convenience access methods have been implemented on top of the respective repository clients. MediaItem fetching and indexing: MediaItems enter the system in two ways: (a) MediaItems embedded in the collected Items, and (b) MediaItems that are extracted from shared links (WebPages). In the first case, MediaItems go through the following four steps before their indexing is complete.
* Article extraction: This step checks whether the URL corresponds to an Article and if yes, whether it contains media content. * Update in mongoDB: Here, more fields of the newly created MediaItem need to be updated. These include the title, description, and status. Also, the WebPage fields title, text, status, media, domain, expandedUrl, is Article and quality are updated. The above fetching and indexing steps are implemented as Bolts on top of Storm.
4 CASE STUDY: MEDIA COLLECTION AROUND #OCCUPYGEZI EVENTS To assess the effectiveness of the proposed media crawling and indexing framework, we conducted a large-scale media collection case study around the #OccupyGezi events (http://en.wikipedia.org/wiki/2013_protests_in_Turkey). These events refer to several massive protests in Turkey that were sparked by outrage at a brutal eviction of a sit-in at the Taksim Gezi Park protesting against the urban development plan for the park. The protests started on 28 May 2013 and quickly escalated both in different neighbourhoods of Istanbul and to other Turkish cities, and even to large cities around the globe. The events continued until 31 July. We started the crawling exercise on June 4th and continued until July 17th. The crawling was conducted around a set of hand-picked keywords and a small number of selected accounts on Facebook and YouTube. Table 1 lists the crawling configuration for each of the monitored streams. The framework components were distributed in three machines. More specifically, the Hadoop installation (serving the fetching, feature extraction and indexing of images) span all three of the machines, while the mongoDB, Solr and image NN search services were installed in a single machine. A simple web user interface for presenting and searching over the collected data was served by a fourth machine. Table 1: Crawling configuration Figure 2 illustrates the number of new collected Items and Media Items (per hour) over time. The timelines were created based on the Item timestamps. Therefore, they also contain a few items that were created before June 4th. Note that the numbers depicted by the timelines refer to all incoming Items (including retweets) and to items containing media content. A total of 16,785,785 Items and 319,095 Media Items were collected respectively, spanning the interval (Jun 1, Jul 17). Fig. 2: Frequency of Item and Media Item arrival (per hour) over time. Figure 3 provides a map-based view of the event at different zoom levels (based on the subset of images that are geotagged). From the figure, it becomes apparent that although the events were highly intense on Istanbul (and the Gezi park in particular), there was considerable activity in major Turkish cities (Ankara, Bursa, Izmir, Antalya, Eskisehir, Bodrum), and even in large European (London, Berlin, Paris) and American (New York, San Francisco, Boston, Toronto) cities. Fig. 3: Map-based views of the event. To assess the media coverage of the crawler, as well as the retrieval capabilities of the visual indexing component, we devised the following test. We selected images related to the event from four different sources (listed in Table 2) and used them as visual queries to the NN search component. For each query, we manually checked whether the top 20 results contained the image or a near-duplicate and in which positions. At the end, we computed a coverage score (denoted as C) by dividing the number of queries with at least one correct result (QC) with the total number of queries (Q). Note that this test is a challenging one, since the crawler monitored a limited set of entities, but the queries were effectively selected from the whole Web and were diverse in nature. Figure 4 illustrates a few example queries for each source. Table 2 provides information on the selected sources, the number of queries per source and the selection method for the queries. Table 2: Sources used for selecting image queries Fig. 4: Samples of the used image queries. Table 3 reports the coverage achieved by the crawler for each source. We expect these scores to slightly underestimate the coverage of the crawler since the period of the crawling did not cover the full duration of the events (especially the first days that were intensive in terms of media sharing activity). The lowest coverage (~0.32) is obtained for queries from Google image search. A closer look at the failed queries revealed that they mostly corresponded to photos posted in news portals (BBC, euronews, New Yorker) and in a freelance photo journalism portal (demotix.com). For the rest of the sources, we note that coverage ranges from 0.5 to 0.95. A potential explanation for the higher percentages could be the nature of these sources: the source with the highest coverage is a tumblr account, which makes its images much more likely to share through social networks compared to news sites. Source #4 (a collection of artistic depictions of the protests) also features highly viral content, which is potentially attractive for social networks users. Source showdiscontent.com constitutes a 'median' case, in the sense that its photos resemble those featured in professional news sites, but at the same time it is closer to the social networks due to its positive stance towards the protesters (thus making it more social network-friendly and a better candidate for sharing). A second observation pertains to the quality of ranking for similar images. In the vast majority of cases, where at least one correct result was found to be the same or very similar to the query image, the correct results were ranked above the incorrect ones. Only in five (out of 87) cases, we could find a few irrelevant images ranked above relevant ones. It is also noteworthy that in all cases, the response times were sub-second, typically in the range of 100-200msec. The above observations indicate that the NN search component is both accurate and highly efficient. A final noteworthy finding coming out of this study pertains to the emergence of visual memes [9]. It is often the case that some popular images are retouched and shared by social network users, leading to very interesting (and often hilarious) means of self-expression. Two such examples that were discovered with the help of image clustering on the images of the #OccupyGezi collection are illustrated in Figure 5. Fig. 5: Visual memes discovered with the help of image clustering. 5 CONCLUSION The presented framework offers a versatile and effective solution to the problem of social multimedia crawling and indexing. It enables performing highly configurable crawling tasks and is built on top of scalable and highly efficient technologies. The evaluation of the crawler in the context of the #OccupyGezi events demonstrated that it is an effective tool for collecting diverse content from social networks and for browsing and searching it in multiple ways. In the future, we plan to work on three important improvements: (a) providing better support for the collection of professional news media content from major news sites (e.g. with the help of RSS feeds), (b) enabling the configuration of separate crawling tasks from a user-friendly web interface, (c) distributing the stream management logic, enabling the crawling with the use of multiple API keys (user-provided) in order to support a large number of independent crawling tasks without violating the social network API limits. In addition, we plan to conduct more comprehensive evaluations of the crawling and retrieval capabilities of the framework on numerous case studies, involving professional end users (e.g. journalists). Acknowledgment |
E-Letter > STCSN E-Letter Vol.1 No.3 >