Scalable Social Analytics for Online Communities

Marcel Karnstedt
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway

With the constantly growing ecosphere of online communities, their managers, operators and members can hugely benefit from a rich set of tools to successfully understand, control, exploit and utilise them. This requires to extract reusable, interpretable analytics in real time from the streams of dynamically, socially produced data. In this article, we summarise our efforts in the context of the ROBUST project on producing a suite of novel, highly scalable methods and tools to enable this, aligned along four non-disjoint dimensions: structural analysis, behavioural analysis, content/social-semantic analysis and cross-community analysis.

ROBUST's risk-based approach is heavily dependent on a suite of advanced analytical methods for online communities. In this article, we summarise the objectives and efforts in the context of this core part of the ROBUST project, which is aligned along four non-disjoint dimensions: structural analysis, behavioural analysis, content/social-semantic analysis and cross-community analysis. Figure 1 illustrates the interplay of these components as the 'ROBUST analytics temple'.

Fig. 1. Interplay of different analytical dimensions in ROBUST.

The information entities, actors and links in online communities are defined using many types of attributes. Thus, a prerequisite for any analytical task is to define the subsets or merging of these attributes to provide appropriate, tractable abstractions that map to well-defined network representations and problems. Initial structural, textual, sentiment and discourse analyses indicate the degree to which members and their posts may be linked by different relations, such as frequent interaction, influence or agreement and disagreement. Such analyses provide different views on the raw community data and enable to extract and compute the features of interest for all subsequent analytics. With this basis in mind, in the following we briefly summarise our efforts along the four dimensions mentioned above.

This section summarises the four dimensions of analytical tools that we combine in the ROBUST project, i.e., the four pillars illustrated in Figure~\ref{fig:robust-temple}. In the structural analysis, we inspect the different types of community graphs that can be constructed from interactions between its members. Behavioural analysis combines this and content-based analytics to investigate and address the behaviour of individuals on micro level as well as the resulting behaviour of communities on macro level. Finally, research on cross-community dynamics merges aspects of all these dimensions to identify how communities influence each other and how information flows between them.

Fig. 2. Use-case application visualising the development of community graphs and corresponding fundamental structural properties.

2.1 Structural Analysis
Structural analysis focuses on core techniques for analysing and mining dynamic social graphs in order to identify and measure emergent structural characteristics, such as sub-communities, overlapping communities, growth and influence measures for those communities, their members and leaders. Conventionally, a sub-community is a highly connected substructure, defined according to an objective function or metric chosen to formalise the idea of groups with more intra-group than inter-group connectivity. Community members can be linked by participation in common forums, common threads, explicit reply-to structures or the quoting of text from a previous post. Communities offer multiple ways to cluster users -- a cluster might correspond to a sub-community of people who interact closely, who have the same preferences or who tend to act similarly.

The initial feature extraction provides the basis for identifying and modelling all these different relationships. On top of these, the ROBUST platform offers a set of established (e.g., degree distribution, connectivity, growth) and novel fundamental structural properties that support different views on the dynamics of the resulting structures. One such novel notion is framed as 'fairness and diversity' [1], i.e., numerical measures of the equality and inequality of the distribution of users, blogs, forums, activity, etc. In many online communities only a few users are highly active, whereas most users exhibit only a little activity or none at all. For an owner, the amount of skewness is an important indicator for the health of her community.

The equality of user activity represents the fairness in an online community, whereas the equality of content activity represents the observed diversity.
In terms of community-finding algorithms, we focus on particularly scalable approaches that can identify the usually overlapping community structures. [2] provides a very good overview of the state of the art in this field.

In ROBUST, we analyse the different approaches and define novel metrics and understandings of community structures in extremely large-scale graphs, where traditional approaches require a major rethinking. The outcome is a suite of structural analysis techniques that combines existing approaches and overcomes their limitations by newly developed extensions suited to handle the ROBUST data and use cases. As the data of every use case has its own characteristics, different approaches are needed dependent on these specific characteristics. A key to success is to be able to identify the required techniques based on the character of the data, domain and potential risks, as well as being able to give general guidelines for use cases and scenarios not explicitly handled in the project.The analysis of fundamental structural properties and second order ``social'' properties, such as overlapping communities, quasi-cliques and spatio-temporal correlations, enables the ROBUST project to build realistic models and provides other analytical tasks with empirical data on dynamic social processes.

Figure 2 illustrates one out of many use-case scenarios for structural analysis: a tool for community exploration that combines the dynamic visualisation of different community graphs with corresponding fundamental structural properties. 'Imbalance' represents one aspect of fairness, 'separation' one aspect of diversity as mentioned above. The nature of the tool supports posterior analysis of community development as well as monitoring in real time.

2.2 Behavioural Analysis
Behavioural analysis focuses on scalable methods for mining and matching social behavioural motifs. A behavioural motif is a special structural pattern that identifies a particular relationship between several actors in a community. For example, a star-shaped pattern typifies the reply-to communication structure between a helpful expert (or sub-community) and community 'newbies'. Similar to the manifold options to link community members as outlined in the previous section, a community can be profiled according to the social roles of its members and its sub-communities, the observed trends in positive and negative behaviour, the community's tendency to maintain or lose its members and expertise, etc.

Fig. 3. Different features of churn.

Figure 4 illustrates one analysis of interest in this context: the role composition [3] of a discussion forum and how it changes over time, here indicating that the forum's 'supporters' are displaced by 'grunts' and 'ignored'.

Fig. 4. Analysing role composition dynamics

The ROBUST project focuses on two inter-dependent dimensions of behavioural analysis: the micro level focusing on behaviour of individual members of online communities, and macro-level analysis combining such individual characteristics to a global view on the behaviour of (sub-)communities. The article 'Modelling and analysis of user behaviour in online communities' in this newsletter provides a closer look on the approaches the project takes and how these types of behavioural analytics are linked to the assessment of community health.

Behavioural analysis also draws upon content analysis techniques. As an example, consider the notion of churn [4], which is of interest on micro as well as macro level. Churn may be motivated by a variety of social, behavioural and also content-based features.
While a prominent term in other fields like telecom businesses, the special interplay of intrinsic and extrinsic features in social networks (cf. Figure 3) makes modelling, detecting and predicting churn in online communities a particularly challenging problem. This consequently requires mining and matching behavioural and structural motifs and incorporating insights from the content & social-semantic analysis. [5], [6] summarise some of the according insights gained in the context of churn in online communities.

The provision of scalable methods for mining and matching behavioural motifs allows communities to be profiled according to the observed trends in positive and negative (anti- or non-social) behaviour and the communities' tendency to maintain or lose their members and expertise. This delivers empirical analysis to the risk framework and enables to build accurate computational models to forecast risks to the community, such as high churn. Further, it will enable the provision of policies or interventions to stem loss of expertise and unnecessary decay of communities.

2.3 Content & Social-Semantic Analysis
Content/social-semantic analysis focuses on text-mining techniques for detecting, tracking and qualitative measurement of topic ebb and flow, providing views of current and recurrent interests, shifts in topic and sentiment [7], and attention management and information filtering for an entire community. This task also feeds directly into the behavioural analysis, augmenting the identification of community roles, such as topic innovators, summarisers, users likely to answer questions, users likely to engender disputes or users whose co-occurrence tends to lead to negative behaviour. To achieve this, ROBUST focuses on approaches to support automated and scalable behaviour and content analysis.

In conjunction with structural analysis, we use content data to summarise the topics motivating sub-communities, providing a view of the community’s knowledge leaders, its competence and its knowledge deficits. In the absence of explicit or incomplete structural links, we use content data to infer links and design matrix factorisation approaches on link and content data to improve community-finding algorithms.
In conjunction with behavioural methods, analysis of content features is used to identify different communities or sub-communities that react similarly to external news events or breaking topics. Information filtering is based on algorithms for content recommendation for an entire community, enriched by traditional personalised algorithms to improve effectiveness [8]. Attention management focuses on news stream filtering from real-time web applications, such as Twitter or Facebook. News items are selected from the streams based on their relevance to the community. Examples of successfully developed and evaluated approaches are described in [9], which assesses the interestingness of tweets in a streaming fashion based on a rich set of different features, and [10], which aims at news filtering and event detection by following a novel approach of structural community identification accompanied by content analyses to determine topics of the so found groups of interest.

2.4 Cross-Community Dynamics
Cross-community analysis combines aspects of all of the above mentioned dimensions. It focuses on uncovering and measuring how community structures and substructures influence each other. In some cases, the presence of sub-communities may indicate redundancies or complementarities, where communities might be merged or linked. Spatio-temporal correlation analysis across different (sub-)communities can provide insight into external causalities, both with positive or negative effects, that result in similar structural dynamics (such as link bursts or large churn events) co-occurring simultaneously in different parts of the network. A key objective is that the proposed novel metrics and the accompanying analysis and prediction techniques appropriately support the risk analysis and modelling techniques of ROBUST.
The challenge to overcome is to analyse and predict the development of communities and the influences between communities and events. [11] provides an overview of an according framework focusing on the resulting life-cycle events of communities. While that work focuses on scientific communities and citation relations, we are currently working on adopting this approach to the online that are communities in the focus of the ROBUST project.

A second corner point of this dimension is the assessment of cross-community impact [12] and diffusion likelihood within and between communities [13]. This draws from empirical analysis of structural behaviour and topic diffusion. A related sub-task is the identification of communities that 'feed' other communities by consistently being the source of new topics of information. The identification of (external) sub-communities and pivotal users as well as 'spanning' communities, sub-communities that form strongly connected sub-graphs in one or more communities, can indicate redundancies, risks and opportunities where communities might be merged, created or require special support. Such interventions can only be attempted if the social and behavioural features of the communities are understood and would not result in loss of users and expertise.

Fig. 5. Community detection on Twitter graphs.

A key aspect of ROBUST is to design efficient algorithms and techniques to analyse particularly large data sets and, where required, in real time. This involves the investigation of methods for coarsening community graphs to a tractable size, such as feature selection, node merging or sampling.

More important, the project provides high-performance computing capabilities, such as a massively parallel computing framework extending MapReduce [14] and bulk synchronous processing like in Google's Pregel [15] and derivatives. Depending on the task type, algorithms are executed either by an extended MapReduce-based query processor or in a (potentially distributed) in-memory fashion. For instance, OLAP-style analysis tasks, such as topic modelling and matrix decomposition, are executed as distributed batch processes. Near real-time analysis tasks, such as graph maintenance and event detection, are executed in a distributed fashion in-memory.

In many online communities, the rate of incoming information can present challenges for real-time detecting and tracking community substructures, topics and social behaviour motifs for risk assessment. Rerunning batch algorithms every time the social graph changes is not feasible for massive social graphs. Thus, one objective is to design efficient supervised and unsupervised stream-mining algorithms. This first requires the appropriate assessment of tasks and requirements in order to be able to decide on the correct trade-off between batch and stream processing.

Fig. 6. Mobile event guide for the Volvo Ocean Race 2012.

As a result, ROBUST combines advanced MapReduce operators for matrix decomposition and the computing of structural features and clusters, Giraph for modelling the spread of information, and distributed stream-processing for attention management \& event detection. The latter is an example particularly suited to showcase how we combine community detection (cf. Figure 5) to identify 'Tweet Cliques', topic detection to label them, and stream processing to enable a mobile event guide (cf. Figure 6) for Twitter users [10].

4 Conclusion
The key objective of the ROBUST community analytics is to provide a suite of novel, highly scalable analytical methods and tools for online communities. In this article, we summarised the four underlying analytical dimensions and highlighted the importance of adequately combining them to achieve this core objective of the project. We further highlighted the important aspects of scalability and real-time requirements. By today, the project partners achieved most of the ROBUST goals and we are currently working on integrating the resulting contributions into an open-source platform. We were able to successfully combine the different analytical tools as outlined above into one rich and flexible framework for advanced community analytics. We believe that this opens a range of very valuable opportunities for anyone involved in managing or analysing online communities and that it offers a sophisticated basis for a range of manifold application scenarios.

This work was jointly supported by the European Union (EU) under grant no. 257859 (ROBUST integrating project) and Science Foundation Ireland (SFI) under grant no. SFI/08/CE/I1380 (LION-2).

[1] J. Kunegis and J. Preusse, “Fairness on the web: alternatives to the power law,” in WebSci, 2012, pp. 175–184.

[2] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 35, pp. 75 – 174, 2010.

[3] J. Chan, C. Hayes, and E. M. Daly, “Decomposing discussion forums and boards using user roles,” in ICWSM, 2010.

[4] M. Karnstedt, T. Hennessy, J. Chan, P. Basuchowdhuri, C. Hayes, and T. Strufe, “Churn in social networks,” in Handbook of Social Network Technologies and Applications, B. Furht, Ed. Springer US, 2010, pp. 185–220.

[5] M. Karnstedt, T. Hennessy, J. Chan, and C. Hayes, “Churn in social networks: A discussion boards case study,” in SOCIALCOM ’10, 2010, pp. 233–240.

[6] M. Karnstedt, M. Rowe, J. Chan, H. Alani, and C. Hayes, “The effect of user features on churn in social networks,” in WebSci, 2011.

[7] Y. He, C. Lin, and A. E. Cano, “Online sentiment and topic dynamics tracking over the streaming data,” in SocialCom/PASSAT, 2012, pp. 258–266.

[8] I. Guy, I. Ronen, and A. Raviv, “Personalized activity streams: sifting through the ”river of news”,” in RecSys ’11, 2011, pp. 181–188.

[9] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi, “Bad news travel fast: A content-based analysis of interestingness on Twitter,” in WebSci, 2011.

[10] H. Hromic, M. Karnstedt, M. Wang, A. Hogan, V. Bel´ak, and C. Hayes, “Event Panning in a Stream of Big Data,” in LWA Workshop on Knowledge Discovery, Data Mining and Machine Learning (KDML), 2012.

[11] V. Beak, M. Karnstedt, and C. Hayes, “Life-cycles and mutual effects of scientific communities,” Procedia - Social and Behavioral Sciences, vol. 22, no. 0, pp. 37 – 48, 2011.

[12] V. Belk, S. Lam, and C. Hayes, “Cross-community influence in discussion fora.” in ICWSM, 2012.

[13] V. Belak, S. Lam, and C. Hayes, “Towards maximising crosscommunity information diffusion,” in Advances in Social Networks Analysis and Mining (ASONAM ’12), 2012, pp. 171–178.

[14] C. Boden, M. Karnstedt, M. Fernandez, and V. Markl, “Large-Scale Social-Media Analytics on Stratosphere,” in International World Wide Web Conference (WWW) - Demo Track, 2013.

[15] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: a system for large-scale graph processing,” in SIGMOD ’10, 2010, pp. 135–146.

Marcel Karnstedt received his Diploma (M.Sc.) from the Martin-Luther-Universität Halle-Wittenberg, Germany, at the end of 2003. From January 2004 to February 2009 he worked as a research associate and teaching assistant with the Databases & Information Systems Group (head: Prof. Dr.-Ing. habil. Kai-Uwe Sattler) at the Ilmenau University of Technology, Germany. That is where he also successfully completed his Ph.D. (summa cum laude). Since March 2009 he has been affiliated with the Digital Enterprise Research Institute (DERI), National University of Ireland, Galway (NUIG). He is member of the Unit for Information Mining and Retrieval (UIMR) and started as a Postdoctoral Researcher in the CLIQUE project on analyzing and visualizing large graphs and networks, specifically social networks and biological networks. Since December 2009 he also holds an adjunct lectureship at NUIG. Starting with November 2010, he has been employed as a Senior Postdoc and is currently responsible for DERI's part of the ROBUST project. Further, he contributes to the tasks of query processing and sensor mining in the SPITFIRE project, which aims at combining the "Internet of Things" with the "Web of Things".