Social Multimedia Crawling and Search

Symeon Papadopoulos, Emmanouil Schinas, Theodoros Mironidis, Katerina Iliakopoulou, Eleftherios Spyromitros-Xioufis, Ioannis Tsampoulatidis and Yiannis Kompatsiaris
Information Technologies Institute, Centre for Research and Technology Hellas
Thessaloniki, Greeece
{papadop, manosetro, mironidis, ailiakop, espyromi, itsam, ikom}

We present a framework for targeted collection, indexing and browsing of shared media content. Such a framework is valuable in several use cases involving the tracking of social content related to an entity of interest, e.g. topic, person, location or event. When the entity of interest is popular or trending, massive amounts of content are shared (including large amounts of near-duplicate content and spam) resulting in information overload for the end users. To this end, the described framework supports configurable targeted crawling and indexing of social multimedia content in tandem with real-time analysis and summarization. The framework is based on a real-time distributed architecture, including very efficient image indexing and clustering implementations. We evaluate the framework on a large-scale case study around the #OccupyGezi events.

The ubiquitous use of social media platforms for capturing and sharing people's opinions, information and experiences hold the potential of transforming online user communities to sensors of real-world phenomena and events [1]. However, the fast pace, the huge volume and the unpredictable nature of user-contributed content make it extremely challenging to obtain informative views of evolving news stories and events in real time and to quickly surface relevant media content. To this end, we present a framework that supports highly customizable crawling, indexing and browsing of social media content with the goal of facilitating the discovery of newsworthy and interesting social multimedia. The presented framework employs a hybrid strategy for quickly fetching relevant content, comprising both a stream-based and a query-driven approach. In addition, it integrates very efficient and scalable image indexing and clustering implementations.  We evaluate the media retrieval capabilities of the proposed framework in the context of the #OccupyGezi events. The framework will be made available in the form of an open-source project through the SocialSensor GitHub account (

The NUS Extreme Tsingua (NExT) center have recently presented a system for the large scale crawling and indexing of social media content [2]. It supports many of the functionalities featured by the proposed framework, but it lacks the crawling  customization capabilities, since the set of monitored sources is predefined. MediaFinder [3] offers this missing capability, i.e. it supports query-driven media discovery. However, it completely lacks functionalities for persistent monitoring and indexing of content. As a result, the query response time is very large and the results are one-time, i.e. no possibility is offered for obtaining up-to-date results (i.e. have the query periodically run in the background). Finally, in terms of the employed visual indexing and clustering technology, our system is similar to the event detection system of [4], which, however, is limited to media content published by online news sources, thus lacking the social media aspect in the coverage of news.

The crawler is responsible for the collection of data and content from online sources in the form of Items (posts made in a social platform, e.g. tweets), WebPages (URLs embedded in collected Items) and MediaItems (images/videos embedded in Items or WebPages), given a set of CrawlingSpecs (a set of arguments specifying what to crawl, e.g. a hashtag on Twitter. At the moment, there are three types of such arguments: KeywordsFeed, LocationFeed and SourceFeed, for collecting content around keywords/hashtags, locations and users/accounts respectively) as input arguments. The proposed crawling and indexing framework, depicted in Figure 1, comprises the following components: (a) Item collection, comprising stream-based (stream-manager) and query-driven (search-manager) components, (b) structured data repositories based on mongoDB and Solr, (c) fetching and indexing components, including WebPage fetching, article extraction, MediaItem detection and extraction, feature extraction and visual indexing, (d) aggregation components, including geo, visual clustering and analytics. Below, we briefly describe how these components operate within the proposed framework.

Fig. 1. Crawler architecture

Item Collection: The stream-manager process adopts a stream monitoring paradigm, i.e. given a set of CrawlingSpecs, it 'listens' to new Item events. The search-manager, on the other hand, operates in a query-driven mode. In particular, it receives custom CrawlingSpecs (manually entered or automatically generated), and formulates appropriate queries to different social networks (currently Facebook, Flickr, YouTube, Tumblr, Google+ and Instagram).

Structured data repositories: Data access is implemented by appropriate wrapper methods around the mongoDB and Solr repositories. Item and MediaItem metadata are stored in mongoDB, and most of their fields are indexed in Solr. At the moment, a significant amount of data redundancy exists between the two repositories in order to speed up retrieval (i.e. using a single request to both find and fetch the MediaItems and the accompanying metadata). A number of convenience access methods have been implemented on top of the respective repository clients.

MediaItem fetching and indexing: MediaItems enter the system in two ways: (a) MediaItems embedded in the collected Items, and (b) MediaItems that are extracted from shared links (WebPages). In the first case, MediaItems go through the following four steps before their indexing is complete.
  • Ranking: Priority scores are assigned to the incoming MediaItems, taking into account popularity (e.g. number of retweets) and time ('recency') scores.
  • Fetching: A priority queue is maintained. MediaItems are placed to the queue according to the priority score of the previous step and their URLs are continuously consumed by 48 threads and fetched. URL expansion is also carried out in this step. The downloaded images (including video thumbnails) are sent to the next step.
  • Indexing: Compact descriptors are extracted from the collected images using the VLAD+SURF descriptor-aggregator combination that was found to outperform VLAD+SIFT and to be amenable to aggressive dimensionality reduction as shown in [5]. The extracted vectors are further encoded using Product Quantization [6], thus making Nearest Neighbour (NN) search even more efficient and scalable.   
  • Update: The fields width, height and indexed of the processed MediaItems are updated in mongoDB.
In the second case, the collected WebPages also go through a set of processing steps in order to lead to new MediaItems. More specifically, the processing steps are as follows:
  • Ranking: This is implemented in the same way as described above, taking as inputs the time that a URL was shared and the number of times it was shared to compute the priority score.
  • Fetching: The shared URL is expanded and its content (HTML source) is fetched. Depending on the type of URL, there are two possible next steps:
    * MediaItem extraction: If the URL originates from one of a set of selected media sharing platforms (e.g. YouTube, DailyMotion), the corresponding id is extracted and the respective metadata are fetched with an API call.
    * Article extraction: This step checks whether the URL corresponds to an Article and if yes, whether it contains media content.
    * Update in mongoDB: Here, more fields of the newly created MediaItem need to be updated. These include the title, description, and status. Also, the WebPage fields title, text, status, media, domain, expandedUrl, is Article and quality are updated.

The above fetching and indexing steps are implemented as Bolts on top of Storm.
  • Aggregation: This is implemented through the processes clustering (geo), clustering (visual) and analytics. Clustering (geo) performs geographic clustering periodically (e.g. every 20 minutes) using the BIRCH algorithm [8] to identify geographic hotspots. This is applicable only to MediaItems that are already geotagged (in the future it could also be applied to automatically geotagged items). Clustering (visual) implements an efficient graph-based image clustering step, originally proposed in [7], on top of the NN visual search capability of the system. The extracted image clusters are computed on a regular basis (e.g. every 20 minutes) and serve as proxies of trending media content (i.e. large clusters of near-duplicate images imply that an image and its variants are independently shared by multiple social network users). Finally, the analytics component computes several useful statistics (e.g. most frequent tags, most frequent domains, etc.). In terms of deployment, all aggregation components are currently implemented as single-node services.

To assess the effectiveness of the proposed media crawling and indexing framework, we conducted a large-scale media collection case study around the #OccupyGezi events ( These events refer to several massive protests in Turkey that were sparked by outrage at a brutal eviction of a sit-in at the Taksim Gezi Park protesting against the urban development plan for the park. The protests started on 28 May 2013 and quickly escalated both in different neighbourhoods of Istanbul and to other Turkish cities, and even to large cities around the globe. The events continued until 31 July.

We started the crawling exercise on June 4th and continued until July 17th. The crawling was conducted around a set of hand-picked keywords and a small number of selected accounts on Facebook and YouTube. Table 1 lists the crawling configuration for each of the monitored streams. The framework components were distributed in three machines. More specifically, the Hadoop installation (serving the fetching, feature extraction and indexing of images) span all three of the machines, while the mongoDB, Solr and image NN search services were installed in a single machine. A simple web user interface for presenting and searching over the collected data was served by a fourth machine.

Table 1: Crawling configuration

Figure 2 illustrates the number of new collected Items and Media Items (per hour) over time. The timelines were created based on the Item timestamps. Therefore, they also contain a few items that were created before June 4th. Note that the numbers depicted by the timelines refer to all incoming Items (including retweets) and to items containing media content. A total of 16,785,785 Items and 319,095 Media Items were collected respectively, spanning the interval (Jun 1, Jul 17).

Fig. 2: Frequency of Item and Media Item arrival (per hour) over time.

Figure 3 provides a map-based view of the event at different zoom levels (based on the subset of images that are geotagged). From the figure, it becomes apparent that although the events were highly intense on Istanbul (and the Gezi park in particular), there was considerable activity in major Turkish cities (Ankara, Bursa, Izmir, Antalya,  Eskisehir, Bodrum), and even in large European (London, Berlin, Paris) and American (New York, San Francisco, Boston, Toronto) cities.

Fig. 3: Map-based views of the event.

To assess the media coverage of the crawler, as well as the retrieval capabilities of the visual indexing component, we devised the following test. We selected images related to the event from four different sources (listed in Table 2) and used them as visual queries to the NN search component. For each query, we manually checked whether the top 20 results contained the image or a near-duplicate and in which positions. At the end, we computed a coverage score (denoted as C) by dividing the number of queries with at least one correct result (QC) with the total number of queries (Q). Note that this test is a challenging one, since the crawler monitored a limited set of entities, but the queries were effectively selected from the whole Web and were diverse in nature. Figure 4 illustrates a few example queries for each source. Table 2 provides information on the selected sources, the number of queries per source and the selection method for the queries.

Table 2: Sources used for selecting image queries

Table 3: Coverage of crawler for different sources

Fig. 4: Samples of the used image queries.

Table 3 reports the coverage achieved by the crawler for each source. We expect these scores to slightly underestimate the coverage of the crawler since the period of the crawling did not cover the full duration of the events (especially the first days that were intensive in terms of  media sharing activity). The lowest coverage (~0.32) is obtained for queries from Google image search. A closer look at the failed queries revealed that they mostly corresponded to photos posted in news portals (BBC, euronews, New Yorker) and in a freelance photo journalism portal ( For the rest of the sources, we note that coverage ranges from 0.5 to 0.95. A potential explanation for the higher percentages could be the nature of these sources: the source with the highest coverage is a tumblr account, which makes its images much more likely to share through social networks compared to news sites. Source #4 (a collection of artistic depictions of the protests) also features highly viral content, which is potentially attractive for social networks users.
Source constitutes a 'median' case, in the sense that its photos resemble those featured in professional news sites, but at the same time it is closer to the social networks due to its positive stance towards the protesters (thus making it more social network-friendly and a better candidate for sharing).

A second observation pertains to the quality of ranking for similar images. In the vast majority of cases, where at least one correct result was found to be the same or very similar to the query image, the correct results were ranked above the incorrect ones. Only in five (out of 87) cases, we could find a few irrelevant images ranked above relevant ones. It is also noteworthy that in all cases, the response times were sub-second, typically in the range of 100-200msec. The above observations indicate that the NN search component is both accurate and highly efficient.

A final noteworthy finding coming out of this study pertains to the emergence of visual memes [9]. It is often the case that some popular images are retouched and shared by social network users, leading to very interesting (and often hilarious) means of self-expression. Two such examples that were discovered with the help of image clustering on the images of the #OccupyGezi collection are illustrated in Figure 5.

Fig. 5: Visual memes discovered with the help of image clustering.

The presented framework offers a versatile and effective solution to the problem of social multimedia crawling and indexing. It enables performing highly configurable crawling tasks and is built on top of scalable and highly efficient technologies. The evaluation of the crawler in the context of the #OccupyGezi events demonstrated that it is an effective tool for collecting diverse content from social networks and for browsing and searching it in multiple ways. In the future, we plan to work on three important improvements: (a) providing better support for the collection of professional news media content from major news sites (e.g. with the help of RSS feeds), (b) enabling the configuration of separate crawling tasks from a user-friendly web interface, (c) distributing the stream management logic, enabling the crawling with the use of multiple API keys (user-provided) in order to support a large number of independent crawling tasks without violating the social network API limits. In addition, we plan to conduct more comprehensive evaluations of the crawling and retrieval capabilities of the framework on numerous case studies, involving professional end users (e.g. journalists).

This work is supported by the SocialSensor FP7 project, partially funded by the EC under contract number 287975.


[1] S. Diplaris, S. Papadopoulos, I. Kompatsiaris, N. Heise, J. Spangenberg, N. Newman and H. Hacid. Making sense of it all”: an attempt to aid journalists in analysing and filtering user generated content. In Proceedings of the 21st international conference companion on World Wide Web, ACM, New York, NY, USA, 1241-1246, 2012

[2] T.-S. Chua, H. Luan, M. Sun and S. Yang. NExT: NUS-Tsinghua Center for Extreme Search of User-Generated Content. IEEE Multimedia, vol. 19, no. 3, pp. 81-87, July-Sept. 2012

[3] G. Rizzo, T. Steiner, R. Troncy, R. Verborgh, J. Garca and R. van de Walle. What fresh media are you looking for?: retrieving media items from multiple social networks. In Proceedings of the 2012 international workshop on Socially-aware multimedia (SAM ’12). ACM, New York, NY, USA, 15-20, 2012

[4] A. Joly, J. Champ, P. Letessier, N. Herv, O. Buisson and M.-L. Viaud. Visual-based transmedia events detection. In Proceedings of the 20th ACM international conference on Multimedia (MM ’12). ACM, New York, NY, USA, 1351-1352, 2012

[5] E. Spyromitros-Xioufis, S. Papadopoulos, I. Kompatsiaris, G. Tsoumakas and I. Vlahavas. An Empirical Study on the Combination of SURF Features with VLAD Vectors for Image Search. In Proceedings of the 13th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 2012), Dublin, Ireland, May 2012

[6] H. Jgou, M. Douze and C. Schmid. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011

[7] S. Papadopoulos, C. Zigkolis, G. Tolias, Y. Kalantidis, P. Mylonas, Y. Kompatsiaris and A. Vakali. Image Clustering through Community Detection on Hybrid Image Similarity Graphs. In Proceedings of ICIP 2010, International Conference on Image Processing (Hong Kong), IEEE, pp. 2353-2356, 2010

[8] T. Zhang, R. Ramakrishnan, and Miron Livny. BIRCH: an efficient data clustering method for very large databases. SIGMOD Record vol. 25, no. 2, 103-114, 1996

[9] L. Xie, A. Natsev, J. R. Kender, M. Hill, and J. R. Smith. Visual memes in social media: tracking real-world news in YouTube videos. In Proceedings of the 19th ACM international conference on Multimedia (MM ’11). ACM, New York, NY, USA, 53-62, 2011

Dr. Symeon Papadopoulos received the Diploma degree in Electrical and Computer Engineering in the Aristotle University of Thessaloniki (AUTH), Greece in 2004. In 2006, he received the Professional Doctorate in Engineering (P.D.Eng.) from the Technical University of Eindhoven, the Netherlands. Since September 2006, he has been working as a research associate with the Information Technologies Institute (ITI), part of the Centre for Research and Technology Hellas (CERTH), on a wide range of research areas such as information search and retrieval, social network analysis, data mining and web multimedia knowledge discovery. In 2009, he completed a distance-learning MBA degree in the Blekinge Institute of Technology, Sweden. In 2012, he defended his Ph.D. thesis in the Informatics department of AUTH on the topic of large-scale knowledge discovery from social multimedia.

Manos Schinas received the Diploma degree in Electrical and Computer Engineering in the Aristotle University of Thessaloniki (AUTH), Greece in 2011. His thesis concerned the development of a focused crawler, oriented to the discovery of personal identities and social interactions. Currently, he is a PhD student in the Aristotle University of Thessaloniki under the supervision of Professor P. Mitkas, on the topic of event detection in social media. His research interests include information retrieval and large scale storage and computing. Since January 2012, he has been working as a research associate in the Information Technologies Institute (ITI), part of the Centre for Research and Technology Hellas (CERTH) in the SocialSensor FP7 project.

MSc. Mironidis Theodoros received the Certificate degree in Applied Informatics from the University of Macedonia (UoM) Greece in 2009. In 2011, he received the Diploma in Applied Informatics from the UoM. Since July 2013, he has been working as a research associate with the Information Technologies Institute (ITI), part of the Centre for Research and Technology Hellas (CERTH) on web technologies and social network analysis.

Katerina Iliakopoulou received her diploma from Electrical and Computer Engineering Dpt. of Aristotle University of Thessaloniki (AUTH), Greece in 2012. Her research interests include fields such as robotics, artificial intelligence, machine learning and data mining.
In the year 2010-2011, she worked for the research project PANDORA (Program for the Advancement of Non Directed Operated Robotic Agents) of Electrical and Computer Engineering Dpt. (AUTH). The goal of the project was the construction of a mobile robot for urban search and rescue operations (USAR). Her part involved the development of software regarding robot autonomous navigation and victim identification. In the last year she works as a research associate for the Information Technologies Institute of the Center of Research & Technology, where her work focuses on data mining from social media.

Mr. Eleftherios Spyromitros-Xioufis received a B.Sc. in Informatics in 2009 and a M.Sc. in Information Systems in 2011 from the department of Informatics of the Aristotle University of Thessaloniki. Since April 2010 he is a Ph.D. student in the same department and member of the Machine Learning and Knowledge Discovery group. His Ph.D. research concerns the development of supervised learning methods for multi-target prediction and the use of  multi-label classification methods for multi-media retrieval. Since January 2012 he works as a research assistant with the Information Technologies Institute (ITI), part of the Centre for Research and Technology Hellas (CERTH), on the development of visual feature extraction, vectorization and indexing methods for large-scale image retrieval as well as on applications such as geolocation of images and diversification of image retrieval results.

Ioannis Tsampoulatidis received his Bachelor (Hons) in Computer Science from the University of Portsmouth, UK, in 1998 and since October 2000, he has been working as a research associate with the Information Technologies Institute (ITI), part of the Centre for Research and Technology Hellas (CERTH). He has participated in many National and European projects on a wide range of research areas such as semantic web, internet protocols, software patterns and open web standards. He is also freelancing as technical associate and consultant for various academic and industry clients with more than 10 years of experience in the IT field.

Dr. Ioannis (Yiannis) Kompatsiaris is a Senior Researcher (Researcher B’) with the Information Technologies Institute / Centre for Research and Technology Hellas, Thessaloniki, Greece. His research interests include semantic multimedia processing, social media analysis, knowledge structures, reasoning and personalization for multimedia applications. He is the co-author of 57 papers in refereed journals, 30 book chapters, 7 patents and more than 170 papers in international conferences. He has been the co-organizer of various international conferences and workshops and has served as a regular reviewer for a number of journals and conferences. He is a Senior Member of IEEE and member of ACM.