Keywords

1 Introduction

Author profiling (PA) is a Natural Language Processing (NLP) task that aims to determine the characteristics of the author(s) of a given text, such as their gender, age, emotional state, personality, among others. AP can be performed on formal and informal textual sources. Formal texts have a certain structure and follow rules while informal texts do not follow rules and are not standardized. A good example of the latter are social networks.

The writing style on social media has special features [10] that make NLP tasks extremely complex processes: the abbreviation rules are not always followed, different use of punctuation marks, new characters are included such as # (hashtag), use of the sign @ to mention users, etc.

Given the importance and the enormous amount of information that is produced daily in social media, it is necessary to have computational methods that allow us to automatically analyze the information generated in these networks.

With the information that people publish and consume in their social media, companies can profile their clients and governments can improve security procedures, for example, identifying potential cases of pedophilia, virtual kidnappings, among others. In fact, the providers of these services already profile users, for example Twitter aims to know the patterns of use and personalization of content. For these reasons, the aim of this work is to develop an automatic gender identification model of Twitter users using transfer learning techniques. We also measure and evaluate the impact of text preprocessing on the accuracy of the author profiling model.

The work is presented in 5 sections, including this introduction. Section 2 describes the methods to carry out feature extraction and the machine learning algorithms typically used in AP, in Sect. 3 we introduce the concepts of transfer learning and explain the architecture used in this work, in Sect. 4, the methodology for author profiling and experimental results is presented. The conclusions of this paper are enunciated in the Sect. 5.

2 Related Work

Several supervised learning techniques were used to model author profiles in different text sources. Supervised learning classifiers employ a set of input-output pairs, through which a decision function is learned that associates a class label with a new data within the established classes. Author profiling (AP) consists in identifying the demographic features of the author of a text [6]. These features are those that describe the author in terms of gender, age, level of study, nationality, socio-economic level, among others. So it can be concluded that AP is a multiclass classification problem.

The use of supervised learning algorithms for AP is shown in [15]. Decision Functions are a technique to perform a binary classification, whose training consists of finding decision functions from input-output pairs. Logistic Regression is used for multiclass classification problems to predict the probability that the data belong to one or another class. Support Vector Machines is a technique used in the context of the AP for binary classification; data are linearly separable by several planes. Neural Networks are another resource for AP; the goal of the method is to approximate a function \(g(\dot{)}\), represented by the neural network, to a function \(f(\dot{)}\) as much as possible. This approximate function is the one used classify. Convolutional Networks represent an important tool for AP, since they are trained with large sets of information, in addition to setting a feature extractor.

2.1 Features Extraction

Analyzing in detail the large amount of information currently generated in the form of written texts is very complicated. Therefore, it is of interest to create representations of these documents, that is, to obtain their representative characteristics. The features obtained from a text are specific terms that allow analyzing and extracting useful patterns or knowledge from analyzed documents. In the past, this task was performed by linguists, limited to a little thorough manual processing. However, with the advance of science and technology, the methods for the extraction of characteristics changed. Some text representation schemes are:

  1. 1.

    Bag of words. In order to deal with complete documents it is necessary to use a computationally viable structure. To fulfil this, we see the documents as strings [7]. Let \(S=s_1, s_2, ..., s_k\) be a string, where a word is a substring of S of length 1, which can refer to: an item in the text, an item in lowercase or uppercase, the word with its part of speech label (POS), word lemma, any other variant of the word.

  2. 2.

    N-grams: Let \(S = s_1, s_2, ..., s_k\) be a string. The N-grams are defined as sub-strings of S of length N. The 1-gramas are called unigrams, the 2-gramas are called bigrams, and so on. There are two types of N-grams, those of words and those of characters. Word N-grams refer to continuous N-words in the document. Instead, character N-grams refer to the N-characters within the word limit without spaces.

  3. 3.

    Syntactic N-grams: Syntactic N-grams try to capture the linguistic structure of a text by organizing the words into nested components in order to show through arrows which words depend on others.

2.2 Weighting Schemes

To obtain a representation of a document, a preprocessing is carried out to see it as a vector. Each dimension of the vector stands for a feature of the document. Each feature is represented by assigning some weight according to its relevance, this process is called weighing scheme. The most relevant are described below:

  1. 1.

    Boolean model and Term Frequency (TF): There are some very intuitive ways to assign weight, such as identifying whether a term appears or not, counting how many times a term appears in a text and assigning a weight to each term depending on the number of occurrences it has.

  2. 2.

    Inverse Document Frequency: To treat high frequencies of certain words (due to the context they are constantly repeated), the weight of the Term Frequency (TF) is reduced by means of the Inverse Document Frequency (IDF). This compensates the weight depending on the appearance of the word in many documents or not. The Inverse Document Frequency of a term t is defined with total frequency in the collection with the expression:

    $$idf_t = log(N/df_f).$$
  3. 3.

    TF-IDF: It is the product of the Term Frequency (TF) by the Inverse Document Frequency (IDF). Its purpose is to provide a measure that expresses the relevance of words in such a way that it is possible to distinguish between those that describe the document and those that do not. To assign a weight to the words in a document, the frequency of the words is calculated and, in the total of documents, the weight is calculated with the following expression

    $$tf-idf_{t,d} = tf_{t,d} \times idf_t$$
  4. 4.

    Word embeddings: A different method for weighting schemes are word vectors (word embeddings), that use two main approaches: discrete and distributional. The idea of discrete approach is to represent a word in a vector of dimension n with 1’s and the others with 0’s; these are also known as one-hot vectors, where n is the number of words in the vocabulary. The distributional approach takes into account the similarity between the vectors themselves, when a word appears in a text its context is the set of words that appear near it (a fixed size window). This builds a dense vector for each word, making it similar to the word vectors. The most used methods with this technique are Word2vec [13] and GloVe [14].

3 Transfer Learning

Transfer learning is a subfield within machine learning that has been studied for more than three decades [2]. It tackles the ability to take advantage of pre-existing data sets when you want to learn from new data. One method that has proven to be effective for obtaining knowledge is the pre-training technique with large amounts of previously available data and the subsequent fine tuning of the pre-trained model based on data from new tasks [5]. This pre-training is also known as few-shot learning. In transfer learning, first it is trained a neural network on a given data set and a specific task, then the features learned by the network are reused, transferred to a second network to be trained in another task and a different data set.

The transfer learning technique consists in taking advantage of the weights of an already trained neural network and adjust them to solve other tasks with only few examples [16, 17]. The types of strategy to perform transfer learning with a new data set are:

  • Fixed feature extractor: A pre-trained neural network is taken and the last fully connected layers are removed, then the features are extracted with a fixed extractor for a new dataset. Finally, a linear classifier (for example SVM) is trained for the new dataset.

  • Fine tuning. In addition to replacing and re-training the classifier, the weights of the pre-trained network are adjusted by continuing back propagation.

  • Pre-trained models. It consists of taking advantage of the final control points of the neural network already trained to make adjustments.

To know the type of transfer learning that is more suitable to be carried out, the following criteria are taken into account:

  • The new dataset is small and similar to the original dataset so it can lead to overfitting the model, fine tuning does not work here. Therefore it is best to train a linear classifier.

  • The new dataset is large and similar to the original dataset. As there is more information, the risk of overfitting is low, therefore fine tuning can be applied.

  • The dataset is small but very different from the original dataset. Because there is little data, it is best to train a linear classifier. As it is different from the original dataset, it may be very different from its specific characteristics.

  • The new dataset is large and very different from the original. As there is enough data and they are different from the original it is best to apply the strategy of pre-trained models.

According to [4] there are two strategies for transfer learning for text:

  • Feature based: it consists on pre-training vectors that capture the additional context through other tasks. New vectors are obtained for each layer that are then used as characteristics, concatenated with the word vectors or with the intermediate layers, an example of this is ELMo [12].

  • Fine tuning: It consists on pre-training some architecture in an objective language model before refining it for a supervised subsequent task, introducing a minimum number of specific parameters of the task, and training in subsequent tasks simply by refining the pre-trained parameters [8].

In our case, we have a relatively small corpus to perform author profiling, so our strategy is to use the Fixed feature extractor technique. Bellow we describe the algorithm for extracting features.

3.1 Universal Sentence Encoder

Here we describe the transfer learning based algorithm we used to extract features for performing author profiling, which is called Universal Sentence Encoder (USE) [3]. Although this method is not designed specifically to perform author profiling, it has certain characteristics that can be used for this task. The Universal Sentence Encoder encodes text in high-dimensional vectors so that it can be used for text classification, semantic similarity, clustering, and other natural language tasks. The model is trained in a variety of text data sources and a variety of tasks in order to dynamically accommodate a wide variety of natural language comprehension tasks. Specifically, USE has two models to encode documents in word vectors, one makes use of the architecture based on averages called Deep Averaging Network (DAN) [9], while the other is based on a convolutional neural network for document classification [11]. These architectures are detailed below for a better understanding:

  1. 1.

    Deep Averaging Network: This architecture works in three steps:

  • Average the vectors associated to a token sequence

  • Pass the average through one or more layers of a Feed Forward

  • Make the linear classification in the last layer

  1. 2.

    Convolutional Neural Network: This type of network, receives a a document as a sequence of vectors in the input layer. It applies the average sampling (average pooling) to convert the word vectors into a document vector representation of fixed length. Document vectors are obtained after averaging the word vectors through one or more feed forward layers with fully connected layers.

For this work we used a USE model trained in multiple tasks across 16 languages, including Spanish. USE receives as input a text of variable length in any of the languages in which it was trained and the output is a vector of 512 dimensions. The USE model we use is available from the TensorFlowHubFootnote 1 page and can be freely downloaded. In addition to this model, there are several versions of trained USE models with different objectives, including multilingual, size/performance and question-answer systems.

So, in our approach USE receives a 100 tweets samples for each user. In this way the convolutional network will transform them into a vector of 512 dimensions, using the language model that we had already learned and updating with the new textual samples from Twitter.

4 Experimental Settings and Results

In this section, we describe the experiments carried out in order to obtain the author profile of Twitter users. First, we describe the evaluation corpus, then baseline results are presented, and finally results are shown using our proposed methodology. This baseline results are obtained by the combination of the different types of features (bag or words and N-grams), trained on two classification algorithms and several preprocessing variants (without emojis, without slangs, etc.). For all baseline experiments, the TF-IDF (mentioned above) weighting scheme is used.

4.1 Corpus Description

For training and evaluating our AP approach we used the corpus of PAN2017 competition [15], which was compiled from Twitter in Spanish. Gender and age information has been provided by the users themselves based on an online questionnaire. The corpus consists of 600 users of various nationalities: Mexican, Colombian, Peruvian, Argentine, Chilean, Venezuelan. 50% are male and the other 50% female.

Gender

Authors

Tweets

Male

2100

21000

Female

2100

21000

Total

4200

42000

4.2 Experimental Settings and Results

We performed several experiments considering bag of words and character N-grams as features. For each feature set we evaluated the impact of specific preprocessing strategies. The author profiling models obtained with the different settings were evaluated in terms of F-1, precision, recall and accuracy. Table 1 shows the results obtained with the logistic regression classification algorithm. The Characteristics column indicates whether the word bag (BoW) or character N-gram (N-char) is used, the Dim column indicates the amount of features extracted and therefore features vector dimensionality. Accuracy assessment measures (STD, the standard deviation of accuracy) are computed. The Preprocessing column indicates which strategy was followed in each experiment; in this case NONE indicates that no preprocessing was performed in that experiment, without Emojis indicates that the emojis were removed, as well as URL’s, Hashtags, etc. It can be seen that the preprocessing strategy with which the best results are obtained is when user mentions are removed, which allows to infer that these are the ones that provide less information regarding the gender of the person who wrote the tweet.

Table 1. Results of experiments performed to predict gender using bag of words and logistic regression classifier.

Table 2 presents results of the gender identification using character 3-gram as feature set and logistic regression classification algorithms. It is observed that the best results are also found when removing the mentions of users, however when slangs are removed the algorithm performance drops considerably.

Table 2. Results of experiments performed to predict gender using character N-grams and the logistic regression classifier.

Table 3 presents the evaluation measures of accuracy, recall, precision and F-1 score obtained by the Support Vector Machine when trained on the BOW feature set. It can be seen that the best results are obtained by removing the mentions of users and the worst when the slangs are removed with a difference between them of approximately 10%.

Table 3. Results of the experiments performed to predict gender using bag of words and support vector machine classifier.

Table 4 presents the results of the gender identification using character 3-grams and as a classification algorithm the Support Vector Machines. Likewise, it is observed that the best results are obtained by removing the mentions of users and the worst results when the slangs are removed. However, in the case of characters 3-gram, accuracy difference between the two is approximately 15%.

4.3 Experimental Settings and Results Using Transfer Learning

Table 5 presents results of gender identification using Universal Sentence Encoder (USE) to obtain 512-dimensional feature vectors for each user, that is, the 100 tweets are reduced to one 512-dimensional vector. The logistic regression is used as classification algorithm. Table structure is the same as the previous ones and in this case dimensionality of the feature vector is always 512. We present the measures of accuracy, recall, precision and F-1 score. It is observed that the best results in terms of accuracy are obtained by removing the mentions and the worst by replacing the slangs.

Table 4. Results of the experiments performed to predict gender using character N-gram and the support vector machine classifier.
Table 5. Results of experiments using transfer learning features with the logistic regression classifier to identify gender

Table 6 presents results of gender identification using Universal Sentence Encoder (USE) to obtain 512-dimensional word vectors and support vector machine as classification algorithm. Evaluation measures of accuracy, recall, precision and F-1 score are presented. As with the previous classifier, it is observed that the best results in terms of accuracy are obtained by removing the mentions and the worst by replacing the slangs. Although the results are in accordance with those obtained with traditional characteristics in terms of better and worse preprocessing, we can observe that with Universal Sentence Encoder the difference between them does not exceed 3%.

Table 6. Results of experiments using transfer learning with the support vector machine classifier to identify gender

5 Conclusions

In this paper, we introduced an approach to perform the gender identification of Twitter users using transfer learning. The transfer learning technique is useful when there is no much data for properly training machine learning algorithms. In this case, we had available a corpus of 4200 Twitter users, which is relatively low for training from scratch a deep learning model.

Our approach is based on the Universal Sentence Encoder model to obtain low dimensional vectors of documents (Users’ tweets) and use them as features to perform author profiling. To evaluate the quality of the vectors (representing all the tweets of a user) obtained by USE, we used them as features for training two machine learning algorithm that generally obtain good results in author profiling [1]. With these experiments, we show that these vectors allow us to identify the author’s gender with an accuracy of 71.98%, when the mentions to users are removed, with an SVM classifier for the PAN 2017 corpus. We can observe that this result is better than the obtained with the traditional approach for gender classification.

We consider that a possible extension of this work is to evaluate other transfer learning techniques, such as the Universal Language Model Fine-tuning (ULMFit) [8], which has achieved very good results in text classification problems.