Keywords

1 Background

1.1 Introduction

Social media activity data, in the case of this paper Twitter account activity, can be understood as consisting of two primary components, metadata or demographics, and content data. Metadata involves external characteristics such as time of activity, time of account creation, location, type of platform used for activity, number of friends, followers, and more. Content data involves syntactic and semantic characteristics. The focus of this paper is on content data, in particular, content feature extraction that can be implemented on a large set of text data in order to enable categorization of types of activities and classification of activities as automated versus non-automated.

1.2 The Content Data Elements and Their Encoding

Below are some linguistic features that can be extracted from the text content generated by Twitter users. These features can be used to generate mathematical “signatures” for different types of online behaviors. In this way, they augment account demographic features to create a rich, high-fidelity information space for behavior mining and modeling.

  1. 1.

    The relative size and diversity of the account vocabulary

    Content generated by automated means tends to reuse complex terms, while naturally generated content has a more varied vocabulary, and terms reused are generally simpler.

  2. 2.

    The word length mean and variance

    Naturally generated content tends to use shorter but more varied language than automatically generated content.

  3. 3.

    The presence/percentage of chat-speak

    Casual, social users often employ simple, easy to generate graphical icons, called emoticons. Sophisticated, non-social users tend to avoid these unsophisticated graphical icons.

  4. 4.

    The presence and frequency of hashtags

    Hashtags are essentially topic words. Several hashtags taken together amount to a tweet “gist”. A table of these could be used for automated topic/content identification and categorization.

  5. 5.

    The number of misspelled words

    It is assumed that sophisticated content generators, such as major retailers, will have a very low incidence of misspellings relative to casual users who are typing on a small device like a phone or tablet.

  6. 6.

    The presence of vulgarity

    Major retailers are assumed to be unlikely to embed vulgarity in their content.

  7. 7.

    The use of hot-button words and phrases (“act now”, “enter to win”, etc.)

    Marketing “code words” are regularly used to communicate complex ideas to potential customers in just a few words. Such phrases are useful precisely because they are hackneyed.

  8. 8.

    The use of words rarely used by other accounts (e.g., tf-idf scores)

    Marketing campaigns often create words around their products. These created words occur nowhere else, and so will have high tf-idf scores, which is the term frequency–inverse document frequency score.

  9. 9.

    The presence of URL’s

    To make a direct sale through a tweet, the customer must be engaged and directed to a location where a sale can be made. This is most easily accomplished by supplying a URL. URL’s, even tiny URL’s, can be automatically followed to facilitate screen scraping for identification/characterization.

  10. 10.

    The generation of redundant content (same tweets repeated multiple times)

    It is costly and difficult to generate unique content for each of thousands of online recipients. Therefore, automated content (e.g., advertising) tends to have a relatively small number of stylized units of content that they use over and over. The result is an account with “redundant” content.

2 Method

2.1 Data

Twitter account activity data is available through the Twitter API (application program interface) which returns requests for random samples of data in the JSON (JavaScript Object Notation) data structure containing both demographics and content.

Content data (tweets) are returned (in the JSON structure) as character strings of length 1 to 140 characters. They may be in any language or no language at all. Tweets can contain any combination of free text, emoticons, chat-speak, hashtags, and URL’s. Twitter does not filter tweets for content (e.g., vulgarisms, hate speech).

For this study a sample of the activities of 8845 Twitter accounts containing the content of 1,048,395 tweets was collected for content analysis.

2.2 Procedures

A vector of text features is derived for each user. This is accomplished by deriving text features for each of the user’s tweets and then rolling them up, i.e. summing and normalizing the data. Therefore, one content feature vector is derived for each user from all of that user’s tweets.

The extraction of numeric features from text is a multi-step process:

  1. 1.

    Collect the user’s most recent (up to 200) tweet strings into a single set (a Thread).

  2. 2.

    Convert the thread text to upper case for term matching.

  3. 3.

    Scan the thread for the presence of emoticons, chat-speak, hashtags, URL’s, and vulgarisms, setting bits to indicate the presence/absence of each of these text artifacts.

  4. 4.

    Remove special characters from the thread to facilitate term matching.

  5. 5.

    Create a Redundancy Score for the Thread. This is done by computing and rolling up (sum and normalize) the pairwise similarities of the tweet strings within the thread using six metrics: Euclidean Distance, RMS-Distance, L1 Distance, L-Infinity Distance, Cosine Distance, and the norm-weighted average of the five distances.

  6. 6.

    The thread text feature vector then contains as vector components user scores based on features such as the emoticon flag, the chat-speak flag, the hashtag flag, the URL flag, the vulgarity flag, and the Redundancy score.

A list of 23 potential content related features was created and calculated for each of the 8845 Twitter accounts in the sample (Tables 1 and 2).

Table 1. Sample of raw data
Table 2. The list of 23 features for analysis

For the purpose of classifying accounts as automated (bots) versus non-automated, a manual rating process of a sample of tweet content coming from 101 active accounts was executed. The sample was divided into 5 subsets with each set being rated by multiple volunteers who read the content of approximately 20 accounts in each subset, each subset containing a few thousand tweets. The rating of each account involved classification as a bot or not and also the assignment of a level of confidence associated with such classification, then a brief explanation of the main reasons was given for the relevant decisions. Of the 101 accounts, 65 were classified as 35 bot accounts and 30 non-bot accounts with a high level of confidence. Those 65 accounts were then assigned a dependent variable value of 1 if identified as a bot, and 0 otherwise.

3 Results

Excel was used to generate a correlation matrix for the 23 content features for the large sample of 8845 feature vectors (Table 3).

Table 3. Correlation among the 23 features of tweet data (correlation scores above 0.6 are bolded)
Table 4. Correlation among the 23 features of tweet data

Similarly, correlations between the 23 content features and the dependent variable for the small set of 65 accounts were calculated and sorted based on absolute value (Table 5).

Table 5. Correlation of the 23 features to the dependent variable (bot or not Boolean value)

Absolute values of the correlations between features and the dependent variable ranged from 0.003 to 0.603. Ranking such absolute values of correlations resulted in the following list of top predictors of bot-like behavior: “redund”, “urls”, “good_len”, “adj”, “tweets”, “vulgar”, “good_cnt”, “commnoun”, “emo_chat” and “art”.

Charts were created to examine the distributions of features that were deemed to be significant in terms of their correlation with the dependent variable in the small sample. Charts were created to examine joint distributions. Following some interpretation of the nature of distributions, some hypotheses were made as to potential statistical learning tools that may be useful in modeling based on such content features (Figs. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and 11).

Fig. 1.
figure 1

Histogram of the distribution of redundancy score

Fig. 2.
figure 2

Histogram of the distribution of number of tweets

Fig. 3.
figure 3

Histogram of the distribution of hashtag

Fig. 4.
figure 4

Histogram of the distribution of URLs

Fig. 5.
figure 5

Histogram of the distribution of emoticon_chat

Fig. 6.
figure 6

Scatter plot of dependent variable against redundancy score

Fig. 7.
figure 7

Scatter plot of dependent variable against URLs score

Fig. 8.
figure 8

Scatter plot of dependent variable against “good_len” score

Fig. 9.
figure 9

Scatter plot of dependent variable against “adj” score

Fig. 10.
figure 10

Scatter plot of dependent variable against number of tweets

Fig. 11.
figure 11

Scatter plot of dependent variable against vulgarism score

4 Discussion

4.1 Findings

Approximately 10% of the 8845 accounts had the maximum level of activity measured (200 tweets). This may provide some lower bound estimate of the rate of accounts exhibiting bot-like behavior.

Examination of the content features correlation matrix reveals that correlations are generally low with some explainable exceptions. Features such as good_len and good_cnt refer to the number of characters that are part of correctly spelled words and the number of correctly spelled words, respectively. The high correlation of 0.86 is to be expected, and such is the case for bad_len and bad_cnt with a correlation of 0.841 (both highlighted in Table 4). In both situations, consideration may be given to selecting only one of each pair for the purpose of predictive modeling.

The top ten content features appear to contain discriminating information that may be relevant in an attempt to classify Twitter accounts as bot or non-bot accounts. Separation issues and the skewed nature of the majority of the distributions of content features may justify an expectation that a nonparametric approach may perform better than a parametric one.

The distribution of the redundancy scores appears to be approximately normal, while all other distributions examined are skewed. As in the case of an earlier study of external features, most relevant distributions that quantify social media behaviors do not appear to be normal, a fact that may later support preference for nonparametric modeling techniques or the application of some feature transformations.

Examination of the scatter plots of joint distributions seems to support the selection of the top content features listed above. One can note that in the case of vulgarity score there is no presence of vulgarity among the bot accounts, while non-bot accounts may or may not include vulgar language.

Taking all this into account, a starting set of content features that may be selected for modeling may involve the following nine features: redund, urls, good_len, adj, tweets, vulgar, commnoun, art, emo_chat.

4.2 Limitations

A number of significant limitations must be noted.

First, the data set may not be a representative sample of the current state of affairs when it comes to bot versus non-bot activity in the Twitter medium.

Second, the process of manually classifying a small set of accounts and reaching a consensus in roughly two-thirds of the cases may not be without errors.

Third, a larger sample set from the manual classification process may lead to different conclusions about content features and the type of modeling that may be expected to perform best.

Fourth, concentrating on content, which probably provides the most predictive power, may still ignore some critical external features, and thus may not produce an optimal perspective.

4.3 Further Investigations

Future work may attempt to consider a mix of external features and content features, calculated on a large set of known bot and non-bot accounts for better feature selection, description, and classification. This should enable a much more reliable subset of predictive or discriminating features, which in turn may lead to more reliable descriptive and predictive models.

5 Conclusion

This paper demonstrates one way by which content of social media activities may be processed in terms of mathematical “signatures” of different types of online behaviors that may be used for descriptive and predictive modeling of automated versus non-automated activities.