Keywords

1 Introduction

For humans, experiences from the past are usually helpful when learning a new skill or solving a new problem. Equivalently, in the context of machine learning, meta-learning takes advantage of prior experience acquired when solving related tasks for approaching new problems [12]. The main goals are to speed up the learning process and to improve the quantitative performance of models. Meta-learning has had an impact into several machine learning problems such as learning to design optimization algorithms [1], automatically suggesting supervised learning pipelines [4], learning architectures for deep neural networks [3] and few-shot learning [10].

Text classification is one of the most studied tasks in NLP, this is because of the number of problems and applications that can be approached as text classification tasks. Many techniques for pre-processing, feature extraction, feature selection and document representation have been developed over the last decades. Each of these being appropriate for different scenarios and types of tasks. However, despite the progress achieved by the NLP community, nowadays it is still an NLP expert who determines the pipeline of text classification systems, including preprocessing methods, representation and classification models together with their hyperparameters.

This paper takes a first step towards the characterization of text classification problems with the ultimate goal of suggesting text classification pipelines for any type of problem, that is Meta-learning of text classification tasks. Earlier work in this direction (see Sect. 2) has defined straightforward meta-features and worked over a small number of datasets. What is more, previous work has focused exclusively on tabular data (i.e., they have extracted meta-features from a document-text matrix). Since natural language presents different characteristics from those of generic tabular data, herein we define a set of meta-features that are derived from the analysis of raw text and combine them with traditional meta-features. To the best of our knowledge this the first work on meta-learning extracting information from raw text directly.

As a first approximation, we approach the problem of learning to determine the type of task (e.g., topic-based vs. sentiment analysis) using the meta-features as predictive variables. We provide empirical evidence on the suitability of the proposed meta-features for characterizing text classification tasks. Additionally, we perform an analysis of the most important features for the approached meta-learning problem. Experimental results are encouraging and show that meta-learning of text classification is a promising research venue for NLP.

Our contributions are threefold: (1) introduction of the task-type prediction problem; (2) introduction of novel and effective meta-features that can be used for other meta-learning tasks; (3) experiments of larger scale than previous work (we proposed 73 meta-features, compared to 11 from previous references and report experiments on 81 corpora, compared to 9 from related work).

2 Background and Related Work

Meta-learning aims to learn from prior learning-experience in order to speed up the learning process when approaching a new task. A common way to learn from/across tasks is by characterizing them with a set of meta-features [13]. These attempt to describe a task (i.e., a dataset) by information readily available at a task/dataset level. In this way, each task is usually represented by a vector where dimensions are associated to meta-features. Meta-features can be as simple as the number of instances and features in a dataset and as complex as statistical measures from the data distribution. [11] provide a comprehensive description of the most commonly used meta-features.

In the machine learning context, meta-learning has been studied for a while [12, 13]. But it is only recently that it has become a mainstream topic, this mainly because of its successes in several tasks. For instance, Feurer et al. [5] successfully used a set of meta-features to warm-start a hyper-parameter optimization technique in the popular state-of-the-art AutoML solution Autosklearn. Likewise, the success of deep learning together with the difficulty in defining appropriate architectures and hyperparameters for users, has motivated a boom on neural architecture search, where meta-learning is common [3].

2.1 Meta-learning in Text Classification

In the context of text mining, meta-features from clustering text documents have been used directly for classification [2]. In the context of meta-learning these features have been used only in very specific domains [8]. Efforts dealing with generic datasets and closely related to the proposed research are reviewed in the remainder of this section.

Lam and Lai [7] introduced a meta-learning approach for text classification, they characterized subsets of the Reuters corpus with 8 document-feature meta-characteristics that were extracted from the document-term matrix representation. These consisted of simple meta-data such as the average document length or simple term statistics. These meta-features were later used to estimate the classification error of 6 classifiers and recommended a model depending on the prediction. Similarly, Gomez, et al. [6] proposed 11 meta-features which were also collected from a matrix representation of the documents, 9 different corpus were characterized with them. This method learned a set of rules that determine a suitable algorithm depending on the meta-feature values of the corpus.

Unlike previous approaches we do not assume a predefined representation of the documents, instead we derive meta-features from the raw text and combine these with traditional ones. This allows us to capture more language-relevant information. Also, we perform experiments of larger scale than previous work, considering 81 datasets (previous work used 6–9 collections) that have been characterized by 73 meta-features (in the past 8–11 meta-features have been considered).

3 Meta-learning Text Classification Tasks

We propose in this work a set of 73 meta-features with the aim of characterizing tasks (i.e., datasets), where the proposed meta-features comprise both, traditional and NLP-based ones. The ultimate goal of our work is to automatically suggest pipelines for solving text classification problems. As a first step in such direction, we show in this work that the proposed meta-features can be used as predictive variables to learn models able to recognize the type of task associated to a dataset. Different text classification tasks can be derived given the same dataset, our set of meta-features also acknowledges this since some of the proposed measures provide statistical information about the classes.

In NLP it is empirically known that certain methods work better according to the type of task that is aimed, for example, character-based n-grams are known to perform better than other representations in authorship attribution tasks because they determine better an author’s style. Identifying correctly the type of task that is tackled is a fundamental step when modeling a text classification pipeline, thus we propose to automate this in pursuit of an automated recommendation system. In this work, we limit ourselves to learn to discriminate among types of tasks, and postpone to future work the problem of pipeline recommendation.

3.1 Proposed Meta-features

A common form of characterizing tasks are meta-features. Some sets of meta-features have proven to be useful for supervised machine learning problems, however we consider that these are not enough to characterize tasks in text classification; extracting them usually requires a tabular representation of the data, in the case of text documents some representation such as Bag-of-Words would be necessary. When a representation is selected some fundamental characteristics of language are lost, extracting traditional meta-features from it would result in a limited characterization of the task. We propose a set of 73 meta-features combining meta-learning traditional features with NLP ones. Below we organized them in groups.

  • General meta-features. The number of documents and the number of categories.

  • Corpus hardness. Most of these originally used in [9] to determine the hardness of short text-corpora. Domain broadness. Measures related to the thematic broadness/narrowness of words in documents. We included measures based on the vocabulary length and overlap: Supervised Vocabulary Based (SVB), Unsupervised Vocabulary Based (UVD) and Macro-averaged Relative Hardness (MRH). Class imbalance. Class Imbalance (CI) ratio. Stylometry. Stylometric Evaluation Measure (SEM) Shortness. Vocabulary Length (VL), Vocabulary Document Ratio (VDR) and average word length.

  • Statistical and information theoretic. We derive meta-features from a document-term matrix representation of the corpus. min, max, average, standard deviation, skewness, kurtosis, ratio average-standard deviation, and entropy of: vocabulary distribution, documents-per-category and words-per-document: Landmarking. 70% of the documents are used to train 4 simple classifiers and their performance on the remaining 30% was used based on the intuition that some aspects of the dataset can be inferred: data sparsity - 1NN, data separability - Decision Tree, linear separability - Linear Discriminant Analysis, feature independence Naïve Bayes. The percentage of zeros in the matrix was also added as a measure for sparsity. Principal Components (PC) statistics. Statistics derived from a PC analysis: pcac from Gomez, et al. [6]; for the first 100 components, the same statistics from documents per category and their singular values sum, explained ratio and explained variance, and for the first component its explained variance.

  • Lexical features. We incorporated the distribution of parts of speech tags. We intuitively believe that the frequency of some lexical items will be higher depending on the task associated to a corpus, for instance a corpus for sentiment analysis may have more adjectives while a news corpus may have less. We tagged the words in the document and computed the average number of adjectives, adpositions, adverbs, conjunctions, articles, nouns, numerals, particles, pronouns, verbs, punctuation marks and untagged words in the corpus.

  • Corpus readability. Statistics from text that determine readability, complexity and grade from textstat libraryFootnote 1: Flesch reading ease:

    $$ 206.835 - 1.015 \left( \frac{total\_ words}{total\_ sentences}\right) - 84.6 \left( \frac{total\_ syllables}{total\_ words}\right) $$

    SMOG grade:

    $$ 1.043\sqrt{polysyllables\times \frac{30}{total\_sentences}}+3.1291 $$

    Flesch-Kincaid grade level:

    $$ 0.39\left( \frac{total\_words}{total\_sentences}\right) + 11.8\left( \frac{total\_syllables}{total\_words}\right) - 15.59 $$

    Coleman-Liau index:

    $$ 0.0588L - 0.296S - 15.8 $$

    where L is the average number of letters per 100 words and S the average number of sentences per 100 words, automated readability index:

    $$ 4.71\left( \frac{total\_chars}{total\_words}\right) + 0.5\left( \frac{total\_words}{total\_sentences}\right) - 21.43 $$

    Dale-Chall readability score:

    $$ 0.1579 \left( \frac{difficult\_words}{total\_words}\right) + 0.0496\left( \frac{total\_words}{total\_sentences}\right) $$

    the number of difficult words, Linsear Write formula:

    $$ \frac{3(complex\_words) + (non\_complex\_words)}{2(total\_sentences)} $$

    where complex words are those with more than 3 syllables Gunning fog scale:

    $$ 0.4\left( \frac{total\_words}{total\_sentences}\right) + 40\left( \frac{complex\_words}{total\_words}\right) $$

    and the estimated school level to understand the text that considers all the above tests.

Apart from general, statistical and PC based, the rest of the listed features have not been used in a meta-learning context.

Table 1. Meta-features identified as relevant after feature selection. We show the ranked features for each problem, in bold we show the features used for obtaining the results from Table 5.

3.2 Datasets

For the extraction of the meta-features and the experimental evaluation we collected 81 text corpora associated to different problems. We associated each corpus with a task-type-label according to the associated classification problem, where the considered labels were: authorship analysis, sentiment analysis, topic/thematic tasks, irony and hate speech detection. Table 2 illustrates the distribution of the datasets as labeled by their task.

Table 2. Tasks by their type.

The full list of datasets is available in the appendix material. Some of the considered datasets are well known benchmarks (e.g. Yelp) while the rest can be found in competition sites like Kaggle and SemEval. After pre-processing each corpus to share the same format and codification, we extracted the 73 meta-features for each of the 81 collections and we assigned a task type label to each dataset according to the associated classification problem. To accelerate the feature extraction process we limited the number of documents to 90,000 for each collection, where these were randomly sampled from the categories of the corpus. The resultant matrix of size 81 \(\times \) 73 comprises our knowledge base characterizing multiple corpora.

3.3 Meta-learning of Task Labels

We approached the problem of recognizing the classification task of a dataset by using the proposed meta-features. We studied the prediction problem as both multiclass (predicting one of the 5 task labels) and binary (distinguishing one label from the rest at a time) classification problems. The following classifiers were considered for the evaluation: Random Forest (RF), XGBoost (XG), Support Vector Machines and 1NN.

4 Experiments

For the evaluation we adopted a leave-one-out cross-validation: 80 tasks were used for training and 1 for testing, repeating this process 81 times, each time changing the test task; the average performance over the 81 folds is reported. As evaluation measures we report accuracy and \(f_1\) measure for the positive class; in the case of the multiclass problem average accuracy and Macro-\(f_1\) are reported.

Table 3. Task prediction results with 73 meta-features

Table 3 shows the results obtained by the 2-best performing classifiers (XG and RF). Table 4 shows the results of experiments with different classifiers. It can be seen that performance for all of the tasks is greater than random guessing. The high accuracy contrasted by moderate \(f_1\) values reveals the models are favouring the majority class. In fact, high imbalance makes prediction quite difficult, specially for the hate and irony detection tasks where there are 6 and 7 positive examples, respectively.

Table 4. Task prediction f1 score for different classification models.

An additional experiment involved a feature selection process prior to the classification stage. Mutual information was used to select the top K features and used for training and predicting. Table 5 shows the best performance obtained when performing feature selection together with the number of meta-features selected. It can be observed that there is a performance improvement after the selection of meta-features in all binary cases. Improvements are dramatic in terms of the \(f_1\) measure in some cases (e.g., Hate, Topics, Author). Surprisingly, for some problems only few meta-features were required to achieve better performance, see, e.g., Hate. For the multiclass problem meta-feature selection did not improve the initial results on either evaluation measures.

Table 5. Results with meta-feature selection

Table 1 shows the complete subsets of features considered for obtaining the results from Table 5. Meta-features are ordered by their mutual information values. It is hard to find a common pattern but we found that some features are part of almost every subset: the percentage of adverbs, the number of categories, vocabulary overlapping in classes: MRH, and some statistic of documents per category. Hence showing the importance of the novel meta-features extracted from raw text. For hate detection and authorship analysis simple statistical measures appear to be better to describe the corpora, for the rest of the tasks the subsets that improved the original performance include a wide variety of meta-features from the groups presented in Sect. 3 (Fig. 1).

Fig. 1.
figure 1

Normalized confusion matrix of predicting all 5-tasks with XG.

5 Conclusions

We introduced the problem of automatically predicting the type of text classification tasks from meta-features derived from text. A set of 73 meta-features have been proposed and evaluated in 81 data sets associated to 5 types of tasks. Experimental results demonstrate that the proposed meta-features entail discriminative information that could be useful for other meta-learning tasks. Results of a meta-feature selection analysis showed that traditional meta-features are not good enough to characterize datasets by themselves, proving the effectiveness of the newly introduced ones. This paper comprises the first steps in trying to meta-learn from raw text directly, we foresee our work will pave the way for the establishment of meta-learning in NLP.