1 Introduction

1.1 Context

Today social networks are the most popular communication channels for users looking to share their experiences and interests. They host considerable amounts of user-generated materials for a wide variety of real-world events of different type and scale [2]. Social media has a significant impact in our daily lives. People share their opinions, stories, news, and broadcast events using social media. Monitoring and analyzing this rich and continuous flow of user-generated content can provide valuable information, enabling individuals and organizations to acquire insightful knowledge [8].

Due to the immediacy and rapidity of social media, news events are often reported and spread on Twitter, Instagram, or Facebook ahead of traditional news media [15]. With the fast growth of social media, Twitter has become one of the most popular platforms for people to post short messages. Events like breaking news can easily draw people’s attention and spread rapidly on Twitter. Therefore, the popularity and significance of an event can be measured by the volume of tweets covering the event. Furthermore, the relevant tweets are also indicators of opinions and reactions to events [6].

Obtaining demographic information about social media users, their interests and their behaviour is the main concern of user profiling, which in turn can be used to understand more about users and improve their satisfaction [16].

Various research works that have been conducted in user profiling, for instance in the field of recommender systems. However, the number of studies and analyses on the impact of cultural and art events in social media is rather limited, and focused on English-only content, while overlooking the other languages. Considering this, we propose a domain-specific approach to profile social media users engaged in a cultural or art event, regardless of their language and their location.

1.2 Problem Statement

In this study, we intend to respond the following questions:

  • What are the topics of interest of the social media users who published their experiences or opinions about a cultural or artistic event?

  • What d emographic features can be revealed about these users?

  • What is the predicted level of engagement and areas of interest of perspective users approaching the event?

To tackle the above questions, we suggest an approach that addresses two core aspects:

  1. 1.

    User profiling: the process of extracting user features, raising the level of abstraction of the discussed concepts, and deriving the topics of interest. The interest domains and behaviour of social media users who share their opinions about a cultural or artistic event.

  2. 2.

    User interest prediction: the anticipation of whether a social media user will be attracted by the current or similar event in the (near) future and, if yes, with what kind of interest and background.

1.3 Proposed Solution

The first step of our approach is to collect the required data about an event from social media. After cleaning and transforming this collected data to a proper format, we define some steps to perform data analysis in different levels. The first step of analysis is to extract the main topics from the provided dataset by using topic modeling techniques. After that, we perform different clustering algorithms on the outputs of topic modeling and then employ cluster validation techniques to evaluate the obtained results. Ultimately, using the outcomes of data analysis, we can employ a classification method to anticipate the interest areas of the future users in similar events.

Notice therefore that in our specific setting we cluster users by topics of interest, and not merely based on lexical similarity based on used words.

1.4 Structure of the Paper

The paper is organized as follows: Sect. 2 discusses the related work; Sect. 3 describes our approach, with practical implementation details reported in Sect. 4; Sect. 5 presents the case study and Sect. 6 reports the outcomes of the analysis. Finally, Sect. 7 concludes and outlines the future work.

2 Related Work

Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in the data. The KDD process involves using the database along with any required selection, preprocessing, subsampling, and transformations; applying data-mining methods (algorithms) to enumerate patterns from it; and evaluating the products of data mining to identify the subset of the enumerated patterns that are deemed useful for increasing knowledge [9].

Past works have found that content extracted from social media is a meaningful reflection of the human behind the social network account posting that content. Works like [26, 28] mainly focus on clustering web users, while studies such as [1, 10, 22] specifically address clustering of people in social networks based on textual and non-textual features. There are also several works that address user profiling in online social networks. For instance, in [3] the authors propose a method to select experts within the population of social networks, according to the information about their social activities.

Other works [18, 21] focus on event analysis in social media. The former analyzes the resulted heterogeneous network, and use it in order to cluster posts by different topics and events; while the latter performs analysis and comparison of temporal events, rankings of sightseeing places in a city, and study mobility of people using geo-tagged photos. Some works [11] leverage Twitter lists to extract the topics that users talk about and [25] introduces validation methods to evaluate clustering results. All these works have delivered new solutions to social media analysis field and investigated the problems of profiling users and analyzing events by employing different data mining approaches.

In comparison to the mentioned studies, our research proposes a complex approach that aims clustering social media users based on their topic of interest, extracted from their distinct features. We design a specific analysis pipeline for art events and we show it at work on real case studies.

3 Approach

The approach presented in this paper defines a specific KDD process that comprises some data enrichment and preprocessing steps, followed by data mining phases which lead to significant knowledge extraction results in our scenarios. We propose the pipeline reported in Fig. 1: first we extract all the required data from the social media platforms; in the next phase, we transform and enrich the data to proper formats for the subsequent analysis and then store it; and finally, data analysis techniques are applied on the clean, enriched and preprocessed data. The next sections describe each phase of the process in detail.

Fig. 1.
figure 1

Content analysis pipeline for art and culture events

3.1 Data Extraction

In this phase raw data is extracted by addressing the social network API with the appropriate query, which is able to extract information on the event of interest. We concentrate on Twitter as a good representative of social content in the context of live events and participation. Therefore, we exploit the Twitter API for data extraction [24].

Table 1. Data schema

3.2 Data Preprocessing

Since the collected raw data is incomplete and inconsistent, as described next, we need to apply preprocessing techniques to prepare an appropriate dataset which can be used for next analyses and experiments. The preprocessing phase consists of three main steps to be followed:

  • Text Normalization: Textual properties (especially in social media) include a great deal of non-standard characters, punctuation, symbols, stop words, etc. that must be omitted for making the data clean and standard. Furthermore, it is essential to reduce derived words to their word stem or root form.

  • Language Identification and Translation: Unsurprisingly, social media users do not always tweet in English, so having text in different languages are not unexpected. The majority of research works only focus on English contents, and thus the importance of language as a demographic feature is overlooked. Hence, with the aim of making data more coherent and unambiguous, and for expanding the coverage of the approach to world-wide scenarios, we apply language detection and translation into English, for homogenisation.

  • Gender Detection: Twitter does not provide users’ gender in their objects. Since we consider this a crucial demographic feature, we enrich the data with gender information.

  • Data Loading: In this phase the clean and enriched data is stored in appropriate format for large scale analysis (CSV file).

3.3 Data Analysis Overview

In order to avoid manual tagging of data, which would be costly and not scalable across multiple experiment or usage scenarios, we opt for unsupervised techniques, namely clustering and topic modeling.

Since we want to profile users based on the texts they share on Twitter, first we need to create a Document-Term Matrix (DTM) which discovers the importance (frequency) of terms that occur in a collection of documents. It is noteworthy that, from now on, our “document” of interest is the social network user. Therefore, in practice a document corresponds to each user’s textual feature, namely: personal biography, hashtags used in the tweets, text of the tweets posted, and Twitter lists the user belongs to (see Table 1). Therefore, each entry of DTM contains the frequency of each term occurred in each document.

As one can easily understand, this matrix is very big and extremely sparse. With the objective to get a more high-level and understandable sense of the documents, we apply topic extraction by means of Latent Dirichlet Allocation (LDA) on the matrix. The output of LDA is also a matrix that assigns a probability to each pair of document and extracted topic: in practice, we obtain a probability of a document (i.e., user) to be interested in a given topic. We then use this LDA output for clustering users.

We also define a prediction phase, where we use a classification method (specifically, Decision Trees), to create a model that can anticipate whether a newcomer user might be interested in the event, and can predict the topic(s) of interest for that user.

3.4 Topic Modeling

Topic models can help to organize and offer insights to understand large collections of unstructured text bodies [20]. They allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics [14]. The input data for topic models is a document-term matrix (DTM). The rows in this matrix correspond to the documents and the columns to the terms. The entry \(m_{ij}\) indicates how often the \(j^{th}\) term occurred in the \(i^{th}\) document.

In this study, topic modeling phase consists of two steps:

  • Topic Extraction: to discover the abstract topics that occur in the collection of our documents, we apply a topic model such as Latent Dirichlet Allocation (LDA) which benefits from Gibbs sampling algorithm [12]. For fitting the LDA model to a given document-term matrix, the number of topics needs to be fixed a-priori. Because the number of topics is in general not known, models with several different numbers of topics are fitted and the optimal number is determined in a data-driven way [14]. Maximum values of Deveaud et al. (2014) method and minimum values of Cao et al. (2009) estimation are considered optimal and are used in this study to identify the number of topics in LDA. The output of this model is a topic probability matrix that contains the probability of each topic associated to each document. In practice, this tells us the probability that a given user is interested in a given topic.

  • Dimension Reduction: Since the extracted topics from LDA are possibly correlated, it is suggested to employ Principal Component Analysis (PCA) to convert them to a set of values of linearly uncorrelated topics. This transformation of data to a lower dimensional feature space not only reduces the time and storage required but also makes the data visualization and interpretation easier.

3.5 Clustering

Clustering aims to organize a collection of data items into clusters, such that items within a cluster are more “similar” to each other than they are to items in the other clusters [13]. In this work, this phase of the pipeline is divided into two steps:

  • Cluster Analysis: In order to profile social media users based on the texts they share about a specific event, different cluster algorithms namely K-means, Hierarchical, and DBSCAN are used and compared, in order to select the best option in our specific setting.

  • Cluster Validity: When cluster analysis was performed, it’s crucial to evaluate how good the resulting clusters are. The evaluation indices, that are applied to judge various aspects of cluster validity are traditionally classified into three types: unsupervised (internal), supervised (external), and relative [23].

    In this study, Silhouette Coefficient and Dunn’s Index as internal indices and Entropy as an external criterion are selected in order to evaluate and compare the different aspects of clustering results.

3.6 Prediction of User Interest

In order to guide event planning professionals to market, plan and implement their events more effectively, we propose to predict the category or the interest area of potential new users who might be involved in the similar cultural or art events in the future. Accordingly, beside the outlined unsupervised techniques that were employed to profile users, we opt for a supervised learning algorithm namely decision tree, which builds a classification model, for prediction of new users’ interests, based on the user categories that we obtained from clustering. Decision tree learning is a typical inductive algorithm based on instance, which focus on classification rules displaying as decision trees inferred from a group of disorder and irregular instance [5].

To build the decision tree, first our dataset should be divided into training and test sets, with training set used to build the model and test set used to validate it. Since our target is to predict the interest domain of new users in terms of user category, the tree will be fed with the training dataset in which the category of users has been attached to and used as the target feature.

In addition, the input variables comprise several features of the user, namely: the topic probabilities defined for each textual features of the user (coming from the topic probabilities matrix generated by the LDA analysis), namely the biography, hashtags, tweets and lists; gender, language, number of followings and followers and number of tweets. Consequently, the decision tree generates a set of prediction rules that determine new users’ interest areas, regarding a cultural or art event, based on the values of the features.

4 Implementation

In the preprocessing phase, we use the Yandex API [27] for language identification and translation of stems into English, and the NamSor API [19] for gender detection. In the preliminary phases we store the data in a relational database for fast sequential preprocessing, and then we generate CSV files to be fed to the analysis phases.

All analyses, statistics, evaluations and results representations are done in R, a flexible statistical programming language and environment that is open source and freely available for all mainstream operating systems [17].Footnote 1

5 Case Study

For sixteen days, from June 18 through July 3 2016, Lake Iseo in Italy was reimagined by the world-renowned artists Christo and Jeanne-Claude.Footnote 2 More than 100,000 square meters of shimmering yellow fabric, carried by a modular floating dock system of 220,000 high-density polyethylene cubes, undulated with the movement of the waves as The Floating Piers rose just above the surface of the water. Visitors were able to experience the work of art by walking on it from Sulzano to Monte Isola and to the island of San Paolo, which was framed by The Floating Piers [4] (see Fig. 2 Footnote 3). More than 1.5 million people visited the installation in those 2 weeks.

We use this artistic event as a use case for our method. We extracted the social media content relevant to the event and we applied the analysis pipeline over it. The datasets were obtained from Twitter, during a time period from June 10th to July 30th 2016 and contain 14,062 tweets and 23,916 users. Figure 3 represents the total number of tweets, retweets, favorites and engaged users (per day), one week before the event starts until the end of July. As one can see, users tend to tweet about the installation at the early days of the event while the engagement of the users dramatically decreases afterwards.

Fig. 2.
figure 2

The Floating Piers by Christo and Jeanne-Claude

Fig. 3.
figure 3

Tweets, retweets, favorites and users timeline

According to the statistics, unlike Instagram users, most Twitter users are not willing to specify the location of their published tweets. Therefore, we also extracted Instagram posts (30,256 posts and 94,666 users) related to the event during the same time span and displayed the density of these posts on geographical plots in Fig. 4.

As one can see the density of posts has a direct relationship with their locality which means most Instagram posts have been published near the main venue of the event.

Fig. 4.
figure 4

Density of Instagram posts in different coordinates

6 Results and Discussion

In this section, the most significant results of the experiment over the case study are shown and discussed.

6.1 User Clustering

As discussed earlier, we apply and compare different clustering algorithms, namely K-means, Hierarchical and DBSCAN. Each of them is separately applied on our data collections. Each collection consists of documents that correspond to each textual property of users including bio, hashtags, tweets and lists.

Subsequently, to achieve the most accurate results, different cluster validity measures are employed. Among the existing validation metrics, Silhouette width, Dunn index and Entropy are selected to evaluate the clustering results. Silhouette width and Dunn index combine measures of compactness and separation of the clusters. Thus, algorithms that produce clusters with high Dunn index and high Silhouette width are more desirable. On the other hand, Entropy is a metric that is a measure of the amount of disorder in a vector. So, smaller values of entropy indicate less disorder in a clustering, which means a better clustering [7].

Table 2 represents these measures values for each algorithm that was applied on each feature collection. It should be noted that, the number of clusters in each experiment is either determined by the clustering algorithm itself like DBSCAN or calculated through different methods like Elbow for K-means. According to this table, Hierarchical clustering (with three clusters) can be considered as the best algorithm which has produced more pleasant results compared to the other two. Furthermore, among all four examined textual features, users’ biography (Bio) performs best. Consequently, as table suggests (bold numbers), from now on we only focus on hierarchical clustering performed on users’ biography.

Table 2. Evaluation of clustering results
Fig. 5.
figure 5

Word network representation of top terms in each topic

6.2 Applying Topic Modeling

As mentioned earlier, in this study, the input data for clustering models is a topic probability matrix that contains the probability of each topic associated to each document (user). This matrix is generated after applying LDA on document-term matrix, in which each row is a user’ biography and each column is a term. As indicated in Sect. 3.4, Deveaud et al. (2014) and Cao et al. (2009) can help to determine the number of topics before LDA is applied. Accordingly, the optimum values of these metrics offer 6 as the number of topics for LDA. Having all required parameters set, LDA is applied and returns the topic probability matrix along with the top terms of each extracted topic which are presented in a word network in Fig. 5.

By investigating through these terms, it seems that the extracted topics are correlated and need to be transformed to a lower-dimensional set, supplied by PCA procedure. The result of applying PCA on topics is represented in Table 3.

Table 3. PCA quantitative results

As the table suggests, we only consider the first three principal components (topics) where cumulative proportion passes 95% threshold. Consequently, in clustering phase, we will perform the hierarchical algorithm on topic probability matrix, exploiting only these three PCA-selected topics.

Fig. 6.
figure 6

Dendrogram representation of Twitter users (Color figure online)

6.3 Cluster Hierarchy

As indicated in previous sections, hierarchical algorithm returns more acceptable results. This algorithm’s output is a dendrogram which is illustrated in Fig. 6.

Unlike K-means algorithm, hierarchical algorithm does not require the optimal number of clusters to be defined at the beginning. In this clustering algorithm clusters are defined by cutting branches off the dendrogram. To determine the cutting section, various methods can be used. We used a convention which represents that a dendrogram can be cut where the difference is most significant.

To extract better insights over the situation, we report in Fig. 6 the three main clusters drawn in different colors. Each leaf in this tree represents a Twitter user engaged in the Floating Piers event through tweeting, retweeting or liking a post. According to this result, it can be concluded that nearly 60% of users lies in first cluster (green), over 35% in second (blue) and the rest (about 5%) in the third cluster (red).

6.4 Cluster Labeling/User Profiling

Having all the user objects in each cluster, we are able to label the obtained clusters or in other words to identify the categories of users. The five most frequent words, that users published about the event, along with the frequency of each word are indicated in Table 4. In this table, Cluster 1 refers to the biggest cluster (green), Cluster 2 refers to the second biggest cluster (blue) and Cluster 3 refers to the smallest cluster (red). It can be seen that the most frequent words in each cluster convey specific meanings. People in first cluster mostly talk about “Travel” introducing themselves in their Twitter bio. People in second cluster are “Art” lovers and people in third cluster state their positions as “Technology” fans and social media marketing addicted. Henceforth, we call the users in first, second and third cluster Travel Lovers, Art Lovers and Tech Lovers respectively.

Table 4. The five most frequent words and their frequency in each cluster
Fig. 7.
figure 7

Word network representation of top terms in each cluster - the thickness of the connections is proportional to the frequency of words in clusters

To depict a weighted list of the words that are used in users’ bio in each cluster, we employ word networks and word clouds, which are visual representations of textual data. The word networks and word clouds related to users’ bio in each cluster are illustrated in Figs. 7 and 8 respectively.

Fig. 8.
figure 8

Word cloud for each cluster

6.5 Demographic Analysis - Language

We can use demographic features like language and gender which help to compare users in three clusters. As Fig. 9 shows Italian is the most common language of users in three clusters while second place belongs to English, followed by the sum of all the other languages (French, Dutch, etc.). As one can see, the flows of languages follow the flow of tweets in all three clusters and have a peak on the opening day of the event. The bias towards Italian is particularly evident in the travel lovers cluster, while it’s less strong in the art lovers. This suggests that travelers visiting the event are mostly Italians, while people coming from abroad are not generic tourists, but more specifically art lovers, which come on purpose for the event.

Fig. 9.
figure 9

Time series of posts by language for each cluster

6.6 Demographic Analysis - Gender

Figure 10 demonstrates that the number of males who got involved in the Floating Piers overweighs the number of females but the difference is not substantial and can be overlooked. In addition, since Travel lovers are the highest majority, the number of males and females are the highest in this category. The presence of males is slightly higher in art lovers.

Fig. 10.
figure 10

Time series of posts by gender for each cluster

6.7 Prediction of Interests of New Users

As mentioned in Sect. 3.6, we suggested to employ decision tree to predict the possible interests of the potential future users, based on the categories that were acquired from clustering the current users. There are two competing concerns: with less training data, our parameter estimates have greater variance. With less testing data, our performance statistic will have greater variance. Thus, we divide our dataset into training and testing sets with the ratio of 80:20, such that neither variance is too high. Figure 11 can give an intuition of how the decision tree creates the prediction rules.

Fig. 11.
figure 11

The decision tree representation

In addition, the extracted rules from the tree are formulated as follows:

figure a

where the only relevant features identified by the decision tree are: the Bio_score, representing the topic probability for the biography feature; the Status_count, representing the total number of tweets of the user; and the Language.

In order to evaluate the decision tree, we use the test dataset that determines an accuracy of 62%. Now for new users, we can simply use the above rules and specify their categories (Travel, Art or Tech) or identify them as “not interested” in a similar event in the future.

Notice that the power of the solution, considering the obtained rules, is that of being able of classifying a user as interested and as engaged in travel, art or technology essentially looking at its biography.

7 Conclusion and Future Work

In this study, we proposed a complex approach that addresses user profiling and user interest prediction regarding arts and cultural events on social media. This approach is equipped with a preprocessing step that enriches user data in terms of language and gender. The outcomes of this research can help event organizers to decide what categories of users they are dealing with and have a clear understanding about the characteristics of users who are more likely to be attracted by the similar events in the future.

We used “The Floating Piers” event as a case study to show how the proposed approach works with the real life scenarios. We clustered users based on their interests in three main categories and then described and compared the behavior and properties of users in each cluster. In addition, using decision tree modeling resulted in a set of rules that predicts the interest domain of future users.

However, since the current study merely addresses the text content that users share on social media, it can go further with considering other types of media namely photo in other social network platforms such as Instagram, Facebook, Google+, Flickr, Foursquare, etc. that might result in a clearer and wider picture of the characteristics and behaviour of users with respect to cultural and artistic events. Last but not least, applying other techniques like semantic analysis, image processing and network analysis can also help us to improve the accuracy and coverage of the results.