Keywords

1 Introduction

1.1 Motivation

As an example, when we have a data set (below) combined of many variables where all are numerical ones except two variables of categorical type (gender and marital status) as following [50]:

Table 1. Original mixed variables

When applying many machine learning models, the models need the data to be numerical data type. Thus, the categorical data should be converted into numerical type. The most efficient way of converting the categorical variable is the introduction of dummy variables (one hot encoding) for which a new (dummy) variable is created for each category (except the last category – since it’d be dependent on the rest of dummy variables, i.e., its value could be determined when all other dummy variables are known) of the categorical variable. These dummy variables are binary variables and could assume only two values, 1 and 0. The value 1 means the sample has the value of that variable and 0 means the opposite.

Here, for this example, we have two categorical variables:

  1. 1.

    Gender: there are only two categories, so we need to create one dummy variable.

  2. 2.

    Marital Status: there are three categories so we need to create two new dummy variables.

The result after the creation of dummy variables is shown in Table 2.

Table 2. The original variables after the introduction of dummy variables.

After this transitional step, we could use any machine learning model for this data set as all its variables are numerical one.

In general, for any categorical variable of “m” categories (classes), we need to create “m − 1” dummy variables. The problem arises when any specific categorical variable has large (based on our work, that means larger than 8) number of categories. The reason is that, in these cases, the number of dummy variables need to be created becomes too large causing the data to become of high dimension. The high dimensionality of data leads to “curse of dimensionality” problem and thus all related issues related to “curse of dimensionality” such as the need of “exponential increase in the number of data rows” and “difficulties of distance computation” would appear. Obviously, one needs to avoid the situation since, in addition to these problems, curse of dimensionality also leads to misleading results from any machine learning models such as finding false patterns discovered based on noise or random chance. Besides all of that, higher dimension leads to higher “computational cost” and “slow model response and lower robustness”, all of which should be avoided. Therefore, in the process of transformation of categorical data into numerical data types, we must reduce the number of newly created numerical variables to reduce the dimension of data [50].

Two examples of the case of categorical variables of large categories or classes are “country of residence” and “URL related data such as the last site visited by the user”. For the first variable, there are more than 150 categories and for the second, there is potentially as many categories as the number of users which is a very large (in the order of millions) number. To address these types of problem, this work establishes a new approach of reducing the number of categories (when the number of categories in a categorical variable in larger than 10) to K categories for \( {\text{K}} \le 10 \). This way, we will create a limited number of dummy variables to replace the categorical variable in the data set.

For some types of categorical variables such as “country of residence”, we may find some attributes online and thus, using these attributes and applying clustering models and web scraping, we can create only a handful of dummy variable to replace the categorical variables of large categories [50].

But, there are other type of categorical variables, such as “URL” variable, where it is not possible to scrap features online and thus the above method [50] cannot be applied. This paper focuses on a method of dealing with this type of categorical data.

2 The Approach Used in This Work

2.1 The Difficulties in Dealing with Modern Data

Quite often, the models in machine learning are models that use only numeric data. Though, practically all data that are used in machine learning are mixed type, numerical and categorical data. When used for machine learning models that could use only numerical data, mixed data types are handled using three different approaches: first approach is trying to, instead, using models that could handle mixed data type, second approach is to ignore (drop) categorical variables. The last approach is converting categorical variables to numerical type by introducing dummy variables. The first approach introduces many limitations as there are only a limited number of models that could handle mixed data and those models are often not the best model fitting the data set. The second approach leads to ignoring much of the information in data set, i.e., the categorical data. The practical approach is the third one, i.e., conversion of categorical data into numerical data. As we explained above, this can be done correctly only when all categorical variables have only limited number of categories (10 or less). Else, it leads to high dimensional data that causes, among other problems, machine learning models to produce meaningless (biased) results. In other words, when the variable has many classes, this approach becomes infeasible because the number of variables will be too much for the numeric models to handle.

This work detects a much smaller number of “latent classes” that are the underpinning classes or categories for the original categories of each categorical variable. This way, the high dimensionality is avoided and thus, we can use these latent classes to perform the dummy variable generation described above to use any machine learning. The small number of latent categories are detected using k-means clustering.

The basic idea is that categorical variables that have many values (or unique values for each sample) provide little information for other samples. To maintain the useful information from these variables, the best method is to keep that useful (latent) information. This invention does it by finding the latent categories by clustering all categories into similar groups. Using k-means clustering of the categories of any categorical variable, we may two distinct cases. First, is when each category has given features or attributes. This is rarely seen in the data sets. The second case is when there are no such attributes about each of the categories and we need to create them.

In the cases, we have features for all categories or classes of any variable, we could use k-means clustering directly. Though, quite often, there is no attributes information about these classes in the data sets. This work uses NLP [2, 13, 18,19,20, 53, 57] models (Natural Language Processing) to address the case of categorical variables without any attributes or features. The objective is to find a small number of dummy variables replacing the categorical variable, that we want to convert to a numerical one.

We show our approach for the very important example of URL variable.

2.2 Application of Our Model by Using the Example of URL Data

Categorical variables having URL are important example of these types of categorical variables. They are frequently present in click data and often have very large possible values, sometime as much as the number of users.

To extract the latent categories from these URL variables, we try to cluster them into similar URL’s i.e. URLs with similar paths. We choose to extract a word and character using n-gram vector representations from the URL’s, then cluster these vector representations using K-means clustering.

URL clustering is a great example because of the difficulty of the task. The difficulty is not only as a result of the number of URLs but also because of the lack of information (attributes) about them that can be used for clustering. When there is no information available about the variables, we need to use NLP. It important that we use NLP to perform the clustering because we have no knowledge of the format of the URLs, i.e., we have no attributions for each URL and clustering cannot be done without attributes. In this case, we use NLP to build the needed attributes for the URLs. When URLs have the same domain, like www.google.com, then the clusters would all be under www.google.com. However, the URLs could also be under multiple domains in which case the clusters would be under multiple domains. A predetermined algorithm would not be able to dynamically handle this variability. This is another reason that, in the case of URLs as an example, we use NLP to cluster them based off syntactic similarity, specifically word bigrams i.e. groups of three words. Our categorical variable has 500 categories, all under the domain of www.adobe.com. A few of these categories are;

Fig. 1.
figure 1

The example of URL variable list with 500 different categories.

For the algorithm to work best, we first strip the URL’s of any characters that provide little information for clustering (since these words may introduce no new information). These words include punctuation and common words such as “http” and “www”. We, thus, perform pre-processing on this list which includes removing punctuation, queries (anything after the character “?”), and stop-words (http, com, www, html, etc.). After this step, we are left with the URLs as space separated words representing the path of URL (Fig. 2);

Fig. 2.
figure 2

The process of deleting noisy words from the url variable.

A sample of the result looks like (Fig. 3):

Fig. 3.
figure 3

The url data after the removal of words that may be irrelevant for clustering.

One of the most popular tools in NLP is the ones involving representation of words with a numerical vector representation in an n dimensional space. Using the context of a word, it can be mapped into an n-dimensional vector space. Learned representations such as word embedding is increasingly popular for modeling semantics in NLP. This is done by reducing semantic composition to simple vector operations. We’ve modified and extended traditional representation learning techniques [13, 18, 50] to support multiple word senses and uncertain representations.

In this work, we used a modification so that, instead of projecting individual words, we project whole URLs containing multiple words. We use these words and their contexts as features for the projection of the whole URL (Fig. 4).

Fig. 4.
figure 4

Vector representation of the url data.

Using the cleaned list, we extract vector representations of the URL’s using the tool “Sally”. Sally is a tool that maps a set of strings to a set of vectors. The features that we use for this mapping are bi-gram words and tri-gram characters. Thus, using word bigrams of the URLs as features, we project the URLs into vector space using “Sally”. Sally represents the URLs using a sparse matrix representation. This means that the URLs are projected into very long vectors with each dimension representing a word trigram that has been seen in the dataset. If a trigram has been observed in the URL its value in the vector is 1. Otherwise the value is 0. This results in a long vector with most values equal to 0 and a few values equal to 1. All the vectors together make a matrix that is a sparse matrix because of its many 0 values. Finally, we used K-means clustering on the embedding. Given that the URLs have been transformed into points in n-dimensional vector space, K-means clustering can find groups of points and partitions them as a cluster in the dataset. Given a number K which is the number of clusters for the algorithm to discover, K-means finds the best partitioning of the dataset such that the points in the clusters are mutually as similar as possible. In the context of URLs this means finding the groups of URLs that share the most word trigrams. Figure 5 shows that the best K values is 10.

Fig. 5.
figure 5

The computation of optimal number of clustering using word tri-grams.

2.3 Computing the Optimal Number of Clusters

To compute the optimal number of clusters, we use Silhouette method which is based on minimizing the dissimilarities inside a cluster and maximizing the dissimilarities among clusters [31, 50]:

The Silhouette model computes s(i) for each data point in the data set for each K:

$$ \varvec{s}(\varvec{i}) = \frac{{\varvec{b}(\varvec{i}) - \varvec{a}(\varvec{i})}}{{{\mathbf{max}}\left\{ {\varvec{a}(\varvec{i}),\varvec{b}(\varvec{i})} \right\}}} $$

Where \( a\left( i \right) \) is the mean distance of point i to all the other points in its cluster. Also, \( b\left( i \right) \) is the mean distance to all the points in its closest cluster, i.e., \( b\left( i \right) \) is the minimum mean distance of point i to all clusters that i is not a member of.

The optimal K is the K that maximizes the total score s(i) for all data set. The score values lie in the range of [−1, 1] with −1 to be the worst possible score and +1 to be the optimal score. Thus, the closest (average score of all points) score to +1 is the optimal one and the corresponding K is the optimal K. Our experiments show that the value of K has upper bound of 10. Here, we use not only the score but the maximum separation and compactness of the clusters, as measured by distance between clusters and uniformity of the width of clusters, to test and validate our model simultaneously when computing optimal K. Figure 6 depicts Silhouette model for different K [50].

Fig. 6.
figure 6

Using silhouette model to compute the optimal number of clusters, to be 10.

Using the results from silhouette model, we use k-means clustering to cluster the URL data. Some of the clusters are shown in Fig. 7.

Fig. 7.
figure 7

Some of the clusters for the url data.

As the figure above shows, our method has grouped together URLs with similar paths and separated URLs with dissimilar paths.

3 The Results and Conclusion

This project provides a method of converting categorical variables to numerical variables so machine learning models could use data. For this conversion to be plausible for categorical variables with many classes, we propose that clustering can be used to decrease the number of classes in the variable to a small number for dummy variable generation. Though, some variables may have accessible features which makes it possible to cluster them, but many variables lack the information or features that would be needed for clustering models. This work deal effectively with these types of categorical variables and assumes no extra features and information may be available, neither explicitly nor implicitly – by web scraping, for such variables. For the model to work, we used NLP to create a vector representation of the variables. Then, we use the vector representation to cluster the variables, i.e., clustering the categories of the variables.

This work provides a new and only practical method of dealing with the standardization of categorical variables when the variables have large number of categories or classes and have no explicitly or implicitly available features. Our model avoids the deletion of the categorical variables and thus loss of information that causes machine learning models to produce meaningless results. This work also leads to the avoidance of creating high dimensional data where “curse of dimensionality” leads to high computational cost, need of exponentially larger data sets, distorted values for distance metrics and biased models.