Modelling and analysis of user behaviour in online communities

Mattew Rowe
School of Computing and Communications
Lancaster University, UK

Miriam Fernandez and Harith Alani
Knowledge Media Institute
Open University, UK,

Online communities generate major economic value and currently form pivotal parts of corporate expertise management, marketing and product support. Exploiting the full value of these communities, as well as maintaining their growth, popularity and adoption requires the creation of analysis methods that can help community owners and managers to monitor and understand the dynamics of their communities. Of particular importance is understanding the behaviour that users exhibit in online communities, since changes in these behaviours could affect the utility of the community. In this document we summarise the work produced within the ROBUST project to represent the behaviour of a community of users using numeric features and to discover the types of behaviour, or roles, that users assume within online communities. We explain the process of extracting behavioural features, the combined approach of clustering and role label derivation through which we identify roles that are present within a given online community platform (as an example using IBM Connections), and the rule-based methodology that we have implemented to infer the roles that users assume over time.

Online communities are now an integral part of the World Wide Web, they provide web users with the necessary environment in which they can interact and discuss topics of interest, seek answers to questions and support requests. Such is the utility of online communities that companies now host discussions and support forums in order to support their products and services. A recent report of McKinsey  estimates that between 900 billion to 1.3 trillion dollars in annual value could be obtained if social media is fully exploited (

Exploiting this knowledge requires the development of analysis tools that can help to understand and manage these communities as well as the social and economic objectives of its users, providers and managers. Of particular relevance is understanding users actions and interactions with other community members, thus understanding how users behave online. Identifying the behavioural patterns or roles that emerge from the community (e.g. experts, leaders, ignored users, etc.) as well as assessing the distribution of users assuming different roles (i.e. the role composition of the community) can provide community managers with insights into the behavioural dynamics of their community, allowing for the prediction of community evolution. For instance in a technical support community it could be the case that an increase in novice users who are asking many questions, coupled with a decrease in expert or knowledgeable users who can answer those questions, would lead to an increase in unresolved problems. Understanding and identifying these types of phenomena can help community managers to better maintain the health of their communities.

In this document we summarise the work produced within the ROBUST project to represent and discover the types of behaviour, or roles, that users assume within online communities. The approach is comprised of three main stages: a) behaviour modelling, b) role mining, and c) role inference. The first stage, behaviour modelling, characterises the behaviour of users in terms of numeric features. The second stage, role mining, derives the roles that emerge from the community. The role mining approach is based on a statistical-clustering based method that segments community users into distinct clusters and then aligns each cluster with a role. From this alignment rules are then constructed representing the behaviour that characterises each particular role. The final stage, role inference, uses the feature representation of the users' behaviour together with the extracted rules to derive the community's role composition over time. This allows analyses to be made as to how the role composition correlates with community activity and how communities differ.

We will illustrate this process by its application to the IBM industrial test bed provided by the ROBUST project. The IBM employee test bed represents the use of online communities within an enterprise for sharing information and expertise between employees. With more than 386,000 employees, the challenge for IBM and its internal business communities is to maintain and exploit community-based expertise so that employees find the right answers and the right people in an efficient way. These internal online communities are created via IBM Lotus Connections and include person profiles, collaborative bookmarking, a wiki, blogs, file sharing, and a discussion forum. Note that this definition of community differs from previous works [9] [10] in which the boundaries of a community are not constrained by a software platform but dynamically change in concordance with the interactions of the users. Concrete details of the proposed approach and its application to the second ROBUST industrial use case, SAP Community Network can be found in the following publication [7].

The remainder of the paper is structured as follows: Section 2 presents related work. Section 3 presents the  proposed behaviour analysis approach and its application to the IBM test bed. Section 4 presents the conclusions and future work. 

Social media users exhibit diverse understandings, preferences, habits, communication, relations and interactions (i.e. different behaviours). While behaviours are always context dependent, repeated interactions and agreements across practices (i.e. repeated behavioural patterns) can be translated into certain behavioural roles (information seekers, broadcasters, lurkers, etc.) A significant amount of work has been done in the last few years trying to understand the prominent behaviours in social networks and how to classify users under behavioural roles. Role composition (or role identification) approaches can be divided in two general methodological approaches: interpretive analyses [1] and structural analyses [3][4][6][5]. Interpretive analyses employ methods like ethnography, content analysis, and surveys to capture behaviours and relations within groups. While highly useful in identifying and understanding important social roles and the context in which these roles develop, interpretive studies are very difficult to reproduce and compare across social networking sites.

On the other hand, structural analysis approaches use formal methods like clustering or network analysis to identify relevant roles within the community. These approaches differ in their initial assumptions and in the methodology selected for the analysis. Approaches like [3] assume the existence of a pre-defined set of roles. As opposed to these type of approaches, our approach aims to identify those roles that emerge from the community under analysis. Another key difference of our approach with similar works [5] [4] is the support of the role identification step with empirical data and not only with human observations of behavioural clusters. Additionally, while some of the aforementioned works focus on identifying the key contributors of the community, we aim to investigate the complete role composition of a community.  

This section presents specific details of the three different stages that compose the proposed behaviour analysis approach.

3.1 Behaviour modelling
According to current literature the behaviour that users exhibit within differing types of online communities (e.g. discussion forums, question-answering platforms) can be described, in general, using six dimensions:
\item Focus dispersion: the distribution of a user's contributions across different topics.
\item Popularity: the number of users who reply to, or follow, a given user. This defines the popularity of a user within a given community.
\item Engagement: the number of users whom a given user replies to, or follows. This indicates the extent to which the user communicates with the community.
\item Initiation: the extent to which a user starts discussions or contributes new content to the community.
\item Contribution: the extent to which a user contributes to the discussions initiated by other users in the community.
\item Content quality: the quality of the content that a user shares with the community.
For each individual community, these dimensions are aligned to particular features of the community. For example, in the case of the IBM communities, explicit relations among users (e.g. the ``friend'' relation) are not created. Features like popularity or engagement are extracted from the community's reply graph. Popularity, for example, is computed as the number of users that interact with or reply a given user.

The output of this phase is the representation of each user's behaviour as a six-dimensional vector of numeric features (where each of the previously described dimensions constitute an element of the vector). Note that to track behavioural changes over time, the same user is represented by different vectors, each of them describing the user's behaviour at a particular time step. For the experiments reported in this paper, features are extracted using a window of six months of data. The window moves in one-week intervals to compute the behaviour of users over time.

3.2 Role Mining
Distinct types of social systems contain certain community roles - e.g. discussion message boards contain conversation-driven roles, microblogging platforms contain celebrity users, etc. Decisions must be made as to what roles to monitor in a given community and whether those roles are appropriate. The Role Mining process aims to identify those particular roles that emerge from the community under analysis.

To derive the roles on the platform we partition users (previously represented as a six-dimensional vector of numeric features) into distinct clusters where each cluster is intended to represent a particular behaviour and therefore a concrete role. Partitioning is performed by running three different unsupervised clustering algorithms (EM, K-Means and Hierarchical clustering) over a tuning segment selected from the community's data. For our experiments this segment comprises six months of data. To select the appropriate number of clusters we iteratively increase the number of clusters used by the algorithms and record the silhouette coefficient to assess the separation and cohesion of the produced partitioning. We finally select the number of clusters and clustering model that maximise the silhouette coefficient.

Fig. 1. Boxplots and level-mappings of the feature distributions for each of the 12 clusters induced for the IBM test bed. The focus dispersion, engagement, initiation and popularity features are presented by their initials at the level-mappings table (right hand side of the figure).

Once clusters have been generated we need to identify which particular behaviour (i.e., which role) each cluster represents and assign to this behaviour a human-readable role label that managers can understand and interpret. Role label derivation first involves inspecting the distributions of behaviour feature values in each cluster and aligning these distributions with a level mapping (i.e. low, mid, high) by using equal-frequency binning. This enables the conversion of continuous dimension ranges into discrete values. For the IBM use case, Figure 1 shows the distribution of the focus dispersion, engagement, initiation and popularity dimensions for each of the 12 identified clusters (left part of the figure) and the feature-to-level mapping generated (right part of the figure). For example, users belonging to cluster {0} are characterised for presenting low levels of focus dispersion, initiation and popularity and a medium level of engagement.

Fig. 2. Maximum-entropy decision tree used to segment the clusters into minimal-distance paths.

In order to derive the role rules for each cluster we use a maximum-entropy decision tree to divide the clusters into branches that maximise the dispersion of dimension levels. Concrete details of this step can be found in [7].

Figure 2 shows the separation of the clusters from a complete grouping into a single cluster in each leaf (or merged clusters in the case of identical feature-to-level mappings, as in the case of clusters 3,11 and 4,7). The path from the root to the leaf is then used to define the characteristics, and therefore the rules defining each role. For example, cluster 1 represents a role that has low engagement and low focus dispersion (engagement=low and focus dispersion=low then role =1). The generated rules are then applied to community users to infer their roles in the community at a particular point in time. This is known as the role inference process (explained in the next section).

Once this tree is generated, labels can be assigned to describe those roles in accordance to the features and levels that define them. This naming method, applied to the SAP use case [7], allows for an easier comparison of roles across platforms. In the particular case of IBM, IBM community managers suggested their own preferred labels to name the identified roles. The final assignment of labels to particular behaviour roles can be seen in Table 1. As we can see from the table, our discussions with IBM Community managers led to the reduction of two additional roles. No distinction is made among clusters 0,1 and clusters 2,10, leaving a total of 8 identified roles for IBM Communities.

Table 1. Roles identified for IBM communities.

3.3 Role Inference
Armed with the role rules extracted during the role mining process we can analyse how users change their behaviour over time and what common patterns exist in terms of role transitions. Figure 3 presents an overview of our approach for deriving the role composition of a community over time. We begin by taking all the users within a community over a given time segment (a 6 month sliding window) and calculating the features that describe the behaviour of each community user. Next we take the values of each feature and derive bins using equal frequency binning to determine the concrete boundaries to the feature levels. These boundaries are then replaced in the previously extracted rules and the rules are applied to determine the role adopted by each user. E.g. in the case of IBM Communities, Lurkers are identified by the following rule: if engagement=low and initiation=low then role=Lurker. Using the bins derived for each feature we can now assign interval boundaries to the feature levels: if engagement < 0.5 and initiation < 0.3 then role=Lurker.

Fig. 3. Overview of the approach for discovering the role composition of a community, and its repeated application over time.

Once every community user has been labelled with a role we can then derive a community's composition by the proportion of users that each role covers. The process of deriving the composition of a community can be repeated over time to detect changes in how the community evolves. An example of this process can be seen in Figure 4. This figure represents the role composition of a community at different time steps, and the evolution of the role composition over time, where each particular role is identified by a different colour.

Fig. 4. Example of the role composition evolution for one of the IBM Communities.

In this paper we have presented a three-stage approach to facilitate the process of community behaviour analysis through: a) the modelling of user behaviour, b) the mining of roles that are relevant to a given platform, and c) the inference of the roles that users adopt in a community over time. We have illustrated how this approach has been applied to one of the ROBUST industrial test beds, in this case IBM Connections, and how this approach can be used to monitor the role composition of a community over time. Correlations among this analysis and community health indicators (such as activity, number of users, etc.) can be made to predict how behavioural changes affect the health of the community. Examples of these predictions can be found in [8]. Our future work involves the application of behaviour analysis to particular community types to assess whether common behavioural patterns emerge from online communities that share goals and objectives (such as question answering communities, message boards, or expertise communities).

The work of the authors was supported by the EU-FP7 project Robust (grant no. 257859). We would also like to thank IBM for the provision of the dataset for our analyses.

[1] S. A. Golder, J. Donath, Social roles in electronic communities, in: in Association of Internet Researchers (AoIR) 5.0, 2004, pp. 19–22.

[2] J. Hautz, K. Hutter, J. Fuller, K. Matzler, M. Rieger, How to establish an online innovation community? the role of users and their innovative content, in: System Sciences (HICSS), 2010 43rd Hawaii International Conference on, 2010, pp. 1 –11.

[3] R. Nolker, L. Zhou, Social computing and weighting to identify member roles in online communities, in: Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on, 2005, pp. 87 – 93.

[4] T. Zhu, B. Wang, B. Wu, C. Zhu, Role defining using behavior-based clustering in telecommunication network, Expert Syst. Appl. 38 (2011) 3902–3908.

[5] J. Chan, C. Hayes, E. M. Daly, Decomposing discussion forums and boards using user roles, in: ICWSM, 2010.

[6] S. Angeletou, M. Rowe, H. Alani, Modelling and analysis of user behaviour in online communities, in: L. Aroyo, C. Welty, H. Alani, J. Taylor, A. Bernstein, L. Kagal, N. Noy, E. Blomqvist (Eds.), The Semantic Web – ISWC 2011, Vol. 7031 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2011, pp. 35–50.

[7] M. Rowe, M. Fernandez, S. Angeletou and H. Alani. (2012). Community analysis through semantic rules and role composition derivation. Web Semantics: Science, Services and Agents on the World Wide Web.

[8] M. Rowe and H. Alani. What makes communities tick? Community health analysis using role compositions.2012 International Confernece on Social Computing (SocialCom). IEEE, 2012.

[9] S. Fortunato. Community detection in graphs. Physics Reports 486.3 (2010): 75-174

[10] S. Papadopoulos, Y. Kompatsiaris, A. Vakali, P. Spyridonos. Community Detection in Social Media. In Data Mining and Knowledge Discovery 24(3), pp. 515-554 (2011), DOI: 10.1007/s10618-011-0224-z

Dr Mattew Rowe
is a lecturer in Social Computing at the School of Computing and Communications of the University of Lancaster. Before joining Lancaster University Dr Matthew Rowe was a Research Associate at the Knowledge Media Institute, The Open University, where he was WP3 leader at the EU-Funded Integrated Project ROBUST. Over the past 4 years Dr Rowe has published 35 peer-reviewed papers at several international conferences and journals such as ESWC, ISWC, WebSci, SocialCom and the Journal of Web Semantics. Dr Rowe is co-organiser of the Making Sense of Microposts workshop series. He has also been chair for the Social Web and Web Science track at the ESWC 2012. Dr Rowe's research interest are centre around the Social Semantic Web and how information can be processed and interpreted by machines via formal semantics.

Dr Miriam Fernandez
is currently a research associate at the Knowledge Media Institute, Open university. She received her MSc and PhD from the Universidad Autonoma de Madrid, Spain. Her research is focused on the synergy of Information Retrieval, Semantic Web and Social Web technologies. She has participated in several European projects (aceMedia, Mesh, X-Media, SmartProducts, WeGov, ROBUST), published in top-level conference proceedings (ECIR, SIGIR. ESWC, ISWC) and journals (TKDE, TCSVT, JWS), and worked for one of the main companies in the search engine and Web development market (Google Zurich).

Dr Harith Alani
is a senior lecturer at the Knowledge Media institute, The Open University, where he is heading a group specialising in Social Semantics and Web Science. Previously to joining KMi, Dr Alani was a senior research fellow at the School of Electronics and Computer Science, University of Southampton. Dr Alani has published more than 80 articles in various top class journals and conferences, and has been involved as a principle investigator in several European and national research projects. Dr Alani is a frequent member of organisational committees of leading conferences. He was a chair for the Semantic Data track at Hypertext 2012, Semantic Web In-Use track at ISWC 2011, and the Sensor Web track for ESWC 2011. He is programme co-chair of ISWC 2013 and WebSci 2013. Dr Alani's research interests include social semantics, web science, social computing, social media analysis, ontology searching and ranking, offline-online social network tracking and analysis, and eGovernment2.0.